Skip to main content

Platform Engineering 2025: Build Internal Developer Platforms That Actually Work

Introduction

Let’s talk about the elephant in the DevOps room: we asked developers to own their entire stack from code to production, and now they’re drowning. The “you build it, you run it” philosophy was supposed to empower teams, but instead it created a fragmented mess where every squad reinvents deployment pipelines, monitoring, and infrastructure management.

Enter platform engineering - the discipline that’s taken the industry by storm in 2024-2025. It’s not just DevOps rebranded. It’s a fundamental shift in how we think about enabling development teams: build self-service platforms that provide golden paths while allowing flexibility when needed.

I’ve spent the last 18 months building and evolving an internal developer platform (IDP) for a 200+ engineer organization. We’ve gone from 15+ different deployment methods and zero standardization to a cohesive platform that’s actually beloved by our developers (shocking, I know). In this guide, I’ll share what worked, what failed spectacularly, and the principles that separate great platforms from shelfware.

What Platform Engineering Actually Is (And Isn’t)

The Core Idea

Platform engineering is about treating infrastructure and developer tooling as a product, with your developers as the customers.

Key principles:

  • Self-service by default: Developers shouldn’t need tickets to deploy, create databases, or provision environments
  • Golden paths, not golden cages: Provide opinionated, easy defaults but allow customization when needed
  • Developer experience first: If your platform is painful to use, it will be avoided and routed around
  • Product mindset: Gather feedback, iterate, measure adoption, celebrate wins

What It’s NOT

  • Not a renamed DevOps team: Platform engineering builds products for developers, not “does ops for them”
  • Not enforced standardization: You can’t just lock devs in a cage and call it a platform
  • Not just Kubernetes: While K8s is often involved, the platform layer sits above infrastructure
  • Not a dashboard: Building a UI over kubectl is not a platform

The Platform Engineering Stack in 2025

Modern IDPs typically include:

Developer Portal:

  • Backstage, Kratix, or Humanitec
  • Service catalog
  • API documentation
  • Golden path templates

Infrastructure Provisioning:

  • Terraform modules with self-service wrappers
  • Crossplane for declarative infrastructure
  • Cloud provider abstractions

Deployment & Runtime:

  • GitOps with ArgoCD or Flux
  • Kubernetes (often multi-cluster)
  • Service mesh for advanced routing

Observability:

  • Standardized logging, metrics, tracing
  • Pre-configured dashboards
  • Alert templates

Security & Compliance:

  • Policy as code (OPA, Kyverno)
  • Automated security scanning
  • Secrets management

Building Your IDP: A Practical Roadmap

Phase 1: Foundation - Understand Your Developers’ Pain (Weeks 1-4)

Don’t start by picking tools. Start by understanding what’s actually broken.

What I did:

  1. Developer interviews (15-20 one-on-ones)

    • “Walk me through your last deployment”
    • “What takes longer than it should?”
    • “What do you wish just worked?”
  2. Process archaeology

    • Map out every deployment pipeline variant
    • Document all the tribal knowledge and runbooks
    • Identify common failure modes
  3. Metric collection

    • Time from commit to production
    • Mean time to environment provisioning
    • Frequency of “DevOps help needed” tickets

Findings from my org:

  • Average time to production: 2.5 days (should be hours)
  • 8 different CI/CD patterns in use
  • 60% of platform team time spent on repetitive requests
  • Developers spending ~40% of time on non-feature work

Phase 2: Quick Wins - Prove Value Fast (Weeks 5-12)

Pick ONE high-impact, low-complexity problem and solve it beautifully.

My first project: Self-service staging environments

Before:

  • File a ticket
  • Wait 1-3 days
  • Get a manually provisioned namespace
  • Manually configure DNS, secrets, databases

After:

# Developer workflow
platform create-env --name my-feature --type staging

# Behind the scenes: Terraform + ArgoCD
# - Provisions namespace with resource quotas
# - Configures DNS (feature-123.staging.company.com)
# - Deploys database (isolated schema)
# - Sets up secrets from Vault
# - Creates GitOps application
# - Ready in 3 minutes

Impact:

  • Environment creation time: 2 days → 3 minutes
  • Developer satisfaction score: +45 points
  • Platform team requests: -70%

Lesson: One great experience beats ten mediocre features.

Phase 3: Golden Paths - Make the Right Way the Easy Way (Weeks 13-26)

Golden paths are opinionated, batteries-included workflows for common tasks.

Example: Service scaffolding

We built templates for common service types:

platform new-service \
  --name payments-api \
  --type rest-api \
  --language python \
  --database postgres

# Generated:
# - Git repo from template
# - CI/CD pipeline (GitHub Actions)
# - Kubernetes manifests (Kustomize)
# - Observability (Prometheus, Grafana, OpenTelemetry)
# - Security scanning (Trivy, SonarQube)
# - Documentation (OpenAPI spec, README)

What gets configured automatically:

  • Health check endpoints
  • Metrics exposition
  • Structured logging
  • Distributed tracing
  • Database migrations
  • Feature flags integration
  • Secrets from Vault
  • Resource limits and autoscaling

Adoption rate: 85% of new services use golden paths

Why it works:

  • Easier to use the template than start from scratch
  • Bakes in best practices by default
  • Still allows customization for edge cases

Phase 4: Developer Portal - Single Pane of Glass (Weeks 27-40)

We chose Backstage (Spotify’s open-source developer portal) as our foundation.

What we surfaced:

  1. Service Catalog

    • All services, libraries, and infrastructure
    • Ownership (team, on-call, Slack channel)
    • Dependencies and dependents
    • SLA/SLO commitments
  2. Documentation Hub

    • Getting started guides
    • API references (auto-generated from OpenAPI)
    • Runbooks and troubleshooting
  3. Software Templates

    • Golden path scaffolding
    • One-click service creation
  4. Tech Insights

    • Per-service scorecards
    • Security posture
    • Dependency health

Custom plugins we built:

  • Cost Dashboard: Per-service AWS/GCP spend
  • Deployment Status: Real-time view of all environments
  • On-call Integration: PagerDuty schedules and incidents
  • Compliance Checker: Security and policy violations

Integration points:

# Example: Backstage catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payments-api
  description: Handles payment processing
spec:
  type: service
  lifecycle: production
  owner: payments-team
  system: checkout
  dependsOn:
    - component:database/payments-db
    - component:service/user-service
  providesApis:
    - payments-v1
  consumesApis:
    - user-v1
    - fraud-detection-v2

Phase 5: Continuous Improvement - Listen and Iterate (Ongoing)

Platform engineering is never “done.”

What we do:

  1. Weekly office hours: Developers can ask questions, demo features, give feedback
  2. Monthly developer surveys: NPS score, feature requests, pain points
  3. Quarterly roadmap reviews: Share what’s coming, prioritize based on feedback
  4. Changelog and release notes: Every platform update communicated clearly

Metrics we track:

  • Platform adoption rate (% of services using golden paths)
  • Time to production (commit to live)
  • Developer satisfaction (NPS score)
  • Self-service ratio (automated vs. manual requests)
  • Cognitive load (time spent on undifferentiated work)

Common Pitfalls and How to Avoid Them

Pitfall 1: Building in Isolation

Mistake: Platform team builds what they think devs need without asking.

Solution:

  • Embed platform engineers with product teams
  • Dogfood your own platform
  • Public roadmap with developer input

Pitfall 2: The Big Bang Launch

Mistake: Build for 18 months, then unveil the “perfect platform.”

Solution:

  • Ship incrementally
  • Get feedback early and often
  • Iterate based on real usage

Pitfall 3: Too Much Abstraction

Mistake: Hide so much complexity that troubleshooting is impossible.

Solution:

  • “Escape hatches” for power users
  • Transparent abstractions (show the underlying commands)
  • Progressive disclosure (simple by default, powerful when needed)

Example:

# Simple mode (90% of use cases)
platform deploy --env production

# Power user mode (full control)
platform deploy --env production --dry-run --show-manifest
# Outputs the actual kubectl commands to run manually

Pitfall 4: Treating It Like Infrastructure

Mistake: Platform team operates like a traditional ops team - reactive, ticket-driven.

Solution:

  • Act like a product team
  • Have a product manager for your platform
  • Roadmap driven by developer needs, not ops convenience

Pitfall 5: Ignoring the Long Tail

Mistake: Optimize for the most common case, ignore edge cases.

Solution:

  • 80/20 rule: Golden paths for 80%, escape hatches for 20%
  • Allow “bring your own” for special needs
  • Document when and why to diverge

Organizational Structure: Who Builds the Platform?

Team Composition

For a 100-200 developer org, I recommend:

  • 1 Product Manager (platform as product owner)
  • 4-6 Platform Engineers (full-stack, infrastructure-savvy)
  • 1 Developer Experience Engineer (focus on DX, docs, training)
  • 1 SRE/Ops liaison (bridge to production operations)

Skills needed:

  • Strong infrastructure as code (Terraform, Crossplane)
  • Kubernetes and cloud platforms
  • CI/CD expertise
  • Developer empathy (many came from product engineering)
  • Product thinking and communication

Reporting Structure

Platform teams work best when they report to Engineering leadership, not Operations.

Why?

  • Incentives aligned with developer productivity
  • Product mindset over cost-cutting
  • Innovation vs. stability balance

Interaction Model

Don’t: Be a ticketing system for infra requests

Do: Enable self-service with support

Support tiers:

  1. Self-service docs and automation (80% of needs)
  2. Office hours and Slack support (15%)
  3. Direct eng help for truly unique cases (5%)

Measuring Success: Platform KPIs

Developer Productivity Metrics

Metric Before Platform After 12 Months
Time to first deploy (new service) 2 weeks 1 day
Time from commit to production 2.5 days 45 min
Environment provisioning 2 days 3 min
Developer time on toil 40% 15%

Adoption Metrics

  • Platform usage rate: 85% of services
  • Golden path adoption: 78% of new services
  • Self-service ratio: 92% (vs. manual requests)

Satisfaction Metrics

  • Developer NPS: +62 (from +12)
  • Platform team satisfaction: +48
  • Time spent on meaningful work: +25%

Technology Choices: What We Use and Why

Developer Portal: Backstage

Why:

  • Open source, extensible
  • Plugin ecosystem
  • Backed by CNCF

Alternatives considered:

  • Port (SaaS, less customizable)
  • Humanitec (more opinionated)
  • Build custom (too much effort)

Infrastructure Provisioning: Terraform + Crossplane

Terraform for foundational infrastructure:

  • VPCs, IAM, databases
  • Mature ecosystem
  • State management understood

Crossplane for developer-facing resources:

  • Declarative K8s-native
  • Self-service via CRDs
  • GitOps-friendly

Example Crossplane claim:

apiVersion: database.platform.company/v1
kind: PostgresInstance
metadata:
  name: payments-db
spec:
  storageGB: 100
  instanceClass: db.r5.large
  engineVersion: "15.3"
  backupRetention: 7
  encrypted: true

Deployment: ArgoCD (GitOps)

Why ArgoCD over Flux:

  • Better UI for troubleshooting
  • RBAC model fits our org
  • ApplicationSet for multi-tenant deployments

Config:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments-api-production
spec:
  project: payments
  source:
    repoURL: https://github.com/company/payments-api
    targetRevision: main
    path: deploy/production
  destination:
    server: https://prod-cluster.company.com
    namespace: payments
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Observability: Grafana Stack

  • Prometheus for metrics
  • Loki for logs
  • Tempo for traces
  • Grafana for visualization

Pre-configured dashboards for every service.

Security: Policy as Code

OPA Gatekeeper for Kubernetes admission control:

# Policy: All containers must have resource limits
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
  name: container-must-have-limits
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
  parameters:
    limits: ["memory", "cpu"]

Real-World Case Study: From Chaos to Platform

Before Platform Engineering

Deployment methods in use: 15 different approaches

  • Some teams: Jenkins
  • Others: GitHub Actions
  • A few: GitLab CI
  • One team: Manual kubectl

Environment provisioning: Manual, ticket-based, 2-5 days

Observability: Each team rolled their own (or didn’t)

Security scanning: Inconsistent, mostly absent

Developer frustration: High (lots of “how do I…?” questions)

The Transformation

Month 1-3: Research and quick wins (self-service environments)

Month 4-6: Golden path templates, standardized CI/CD

Month 7-9: Backstage portal, service catalog

Month 10-12: Observability standardization, cost visibility

Month 13-18: Advanced features (policy enforcement, cost optimization, ML platform)

Results

Quantitative:

  • Deploy frequency: 2x per week → 20x per week
  • Lead time: 2.5 days → 45 minutes
  • Change failure rate: 23% → 8%
  • MTTR: 4 hours → 35 minutes

Qualitative:

  • Developers focus on features, not infra
  • Consistent security posture
  • Easier onboarding (new devs productive in days)
  • Platform team went from firefighting to innovation

Getting Started: Your First 90 Days

Week 1-2: Discovery

  • Interview 15-20 developers
  • Map current deployment processes
  • Identify top 3 pain points

Week 3-4: Strategy

  • Define platform principles
  • Choose initial focus area (recommend: environment provisioning)
  • Get leadership buy-in and budget

Week 5-8: First Feature

  • Build one self-service capability
  • Make it amazing
  • Launch to friendly beta users

Week 9-12: Iterate and Expand

  • Gather feedback
  • Improve based on usage
  • Add second capability
  • Start building developer portal

Beyond 90 Days

  • Continuous iteration
  • Regular communication
  • Measure and improve
  • Grow team as adoption increases

Best Practices Checklist

  • Treat your platform as a product with a product manager
  • Interview developers to understand real pain points
  • Start with quick wins to prove value
  • Build golden paths that make the right way easy
  • Provide escape hatches for power users
  • Deploy a developer portal (e.g., Backstage)
  • Measure adoption, satisfaction, and productivity
  • Have regular office hours and feedback loops
  • Automate toil and repetitive requests
  • Document everything clearly
  • Celebrate wins and share success stories
  • Iterate continuously based on feedback

Resources & Further Learning

Related articles on INFOiYo:

Final Thoughts

Platform engineering isn’t a silver bullet, and it’s definitely not easy. But after 18 months of building, iterating, and listening, I can confidently say it’s transformed how our organization ships software.

The key insight: platforms succeed when they genuinely make developers’ lives better. Not when they enforce compliance, not when they reduce costs (though both happen as side effects), but when they eliminate friction and let developers focus on what they do best - building products.

Start small, prove value quickly, and grow organically. Your platform should feel like a product your developers love, not infrastructure they tolerate.

Build platforms that empower, not constrain.

Ship with joy.