Platform Engineering 2025: Build Internal Developer Platforms That Actually Work

Introduction

Let’s talk about the elephant in the DevOps room: we asked developers to own their entire stack from code to production, and now they’re drowning. The “you build it, you run it” philosophy was supposed to empower teams, but instead it created a fragmented mess where every squad reinvents deployment pipelines, monitoring, and infrastructure management.

Enter platform engineering - the discipline that’s taken the industry by storm in 2024-2025. It’s not just DevOps rebranded. It’s a fundamental shift in how we think about enabling development teams: build self-service platforms that provide golden paths while allowing flexibility when needed.

I’ve spent the last 18 months building and evolving an internal developer platform (IDP) for a 200+ engineer organization. We’ve gone from 15+ different deployment methods and zero standardization to a cohesive platform that’s actually beloved by our developers (shocking, I know). In this guide, I’ll share what worked, what failed spectacularly, and the principles that separate great platforms from shelfware.

What Platform Engineering Actually Is (And Isn’t)

The Core Idea

Platform engineering is about treating infrastructure and developer tooling as a product, with your developers as the customers.

Key principles:

Self-service by default: Developers shouldn’t need tickets to deploy, create databases, or provision environments
Golden paths, not golden cages: Provide opinionated, easy defaults but allow customization when needed
Developer experience first: If your platform is painful to use, it will be avoided and routed around
Product mindset: Gather feedback, iterate, measure adoption, celebrate wins

What It’s NOT

Not a renamed DevOps team: Platform engineering builds products for developers, not “does ops for them”
Not enforced standardization: You can’t just lock devs in a cage and call it a platform
Not just Kubernetes: While K8s is often involved, the platform layer sits above infrastructure
Not a dashboard: Building a UI over kubectl is not a platform

The Platform Engineering Stack in 2025

Modern IDPs typically include:

Developer Portal:

Backstage, Kratix, or Humanitec
Service catalog
API documentation
Golden path templates

Infrastructure Provisioning:

Terraform modules with self-service wrappers
Crossplane for declarative infrastructure
Cloud provider abstractions

Deployment & Runtime:

GitOps with ArgoCD or Flux
Kubernetes (often multi-cluster)
Service mesh for advanced routing

Observability:

Standardized logging, metrics, tracing
Pre-configured dashboards
Alert templates

Security & Compliance:

Policy as code (OPA, Kyverno)
Automated security scanning
Secrets management

Building Your IDP: A Practical Roadmap

Phase 1: Foundation - Understand Your Developers’ Pain (Weeks 1-4)

Don’t start by picking tools. Start by understanding what’s actually broken.

What I did:

Developer interviews (15-20 one-on-ones)
- “Walk me through your last deployment”
- “What takes longer than it should?”
- “What do you wish just worked?”
Process archaeology
- Map out every deployment pipeline variant
- Document all the tribal knowledge and runbooks
- Identify common failure modes
Metric collection
- Time from commit to production
- Mean time to environment provisioning
- Frequency of “DevOps help needed” tickets

Findings from my org:

Average time to production: 2.5 days (should be hours)
8 different CI/CD patterns in use
60% of platform team time spent on repetitive requests
Developers spending ~40% of time on non-feature work

Phase 2: Quick Wins - Prove Value Fast (Weeks 5-12)

Pick ONE high-impact, low-complexity problem and solve it beautifully.

My first project: Self-service staging environments

Before:

File a ticket
Wait 1-3 days
Get a manually provisioned namespace
Manually configure DNS, secrets, databases

After:

# Developer workflow
platform create-env --name my-feature --type staging

# Behind the scenes: Terraform + ArgoCD
# - Provisions namespace with resource quotas
# - Configures DNS (feature-123.staging.company.com)
# - Deploys database (isolated schema)
# - Sets up secrets from Vault
# - Creates GitOps application
# - Ready in 3 minutes

Impact:

Environment creation time: 2 days → 3 minutes
Developer satisfaction score: +45 points
Platform team requests: -70%

Lesson: One great experience beats ten mediocre features.

Phase 3: Golden Paths - Make the Right Way the Easy Way (Weeks 13-26)

Golden paths are opinionated, batteries-included workflows for common tasks.

Example: Service scaffolding

We built templates for common service types:

platform new-service \
  --name payments-api \
  --type rest-api \
  --language python \
  --database postgres

# Generated:
# - Git repo from template
# - CI/CD pipeline (GitHub Actions)
# - Kubernetes manifests (Kustomize)
# - Observability (Prometheus, Grafana, OpenTelemetry)
# - Security scanning (Trivy, SonarQube)
# - Documentation (OpenAPI spec, README)

What gets configured automatically:

Health check endpoints
Metrics exposition
Structured logging
Distributed tracing
Database migrations
Feature flags integration
Secrets from Vault
Resource limits and autoscaling

Adoption rate: 85% of new services use golden paths

Why it works:

Easier to use the template than start from scratch
Bakes in best practices by default
Still allows customization for edge cases

Phase 4: Developer Portal - Single Pane of Glass (Weeks 27-40)

We chose Backstage (Spotify’s open-source developer portal) as our foundation.

What we surfaced:

Service Catalog
- All services, libraries, and infrastructure
- Ownership (team, on-call, Slack channel)
- Dependencies and dependents
- SLA/SLO commitments
Documentation Hub
- Getting started guides
- API references (auto-generated from OpenAPI)
- Runbooks and troubleshooting
Software Templates
- Golden path scaffolding
- One-click service creation
Tech Insights
- Per-service scorecards
- Security posture
- Dependency health

Custom plugins we built:

Cost Dashboard: Per-service AWS/GCP spend
Deployment Status: Real-time view of all environments
On-call Integration: PagerDuty schedules and incidents
Compliance Checker: Security and policy violations

Integration points:

# Example: Backstage catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: payments-api
  description: Handles payment processing
spec:
  type: service
  lifecycle: production
  owner: payments-team
  system: checkout
  dependsOn:
    - component:database/payments-db
    - component:service/user-service
  providesApis:
    - payments-v1
  consumesApis:
    - user-v1
    - fraud-detection-v2

Phase 5: Continuous Improvement - Listen and Iterate (Ongoing)

Platform engineering is never “done.”

What we do:

Weekly office hours: Developers can ask questions, demo features, give feedback
Monthly developer surveys: NPS score, feature requests, pain points
Quarterly roadmap reviews: Share what’s coming, prioritize based on feedback
Changelog and release notes: Every platform update communicated clearly

Metrics we track:

Platform adoption rate (% of services using golden paths)
Time to production (commit to live)
Developer satisfaction (NPS score)
Self-service ratio (automated vs. manual requests)
Cognitive load (time spent on undifferentiated work)

Common Pitfalls and How to Avoid Them

Pitfall 1: Building in Isolation

Mistake: Platform team builds what they think devs need without asking.

Solution:

Embed platform engineers with product teams
Dogfood your own platform
Public roadmap with developer input

Pitfall 2: The Big Bang Launch

Mistake: Build for 18 months, then unveil the “perfect platform.”

Solution:

Ship incrementally
Get feedback early and often
Iterate based on real usage

Pitfall 3: Too Much Abstraction

Mistake: Hide so much complexity that troubleshooting is impossible.

Solution:

“Escape hatches” for power users
Transparent abstractions (show the underlying commands)
Progressive disclosure (simple by default, powerful when needed)

Example:

# Simple mode (90% of use cases)
platform deploy --env production

# Power user mode (full control)
platform deploy --env production --dry-run --show-manifest
# Outputs the actual kubectl commands to run manually

Pitfall 4: Treating It Like Infrastructure

Mistake: Platform team operates like a traditional ops team - reactive, ticket-driven.

Solution:

Act like a product team
Have a product manager for your platform
Roadmap driven by developer needs, not ops convenience

Pitfall 5: Ignoring the Long Tail

Mistake: Optimize for the most common case, ignore edge cases.

Solution:

80/20 rule: Golden paths for 80%, escape hatches for 20%
Allow “bring your own” for special needs
Document when and why to diverge

Organizational Structure: Who Builds the Platform?

Team Composition

For a 100-200 developer org, I recommend:

1 Product Manager (platform as product owner)
4-6 Platform Engineers (full-stack, infrastructure-savvy)
1 Developer Experience Engineer (focus on DX, docs, training)
1 SRE/Ops liaison (bridge to production operations)

Skills needed:

Strong infrastructure as code (Terraform, Crossplane)
Kubernetes and cloud platforms
CI/CD expertise
Developer empathy (many came from product engineering)
Product thinking and communication

Reporting Structure

Platform teams work best when they report to Engineering leadership, not Operations.

Why?

Incentives aligned with developer productivity
Product mindset over cost-cutting
Innovation vs. stability balance

Interaction Model

Don’t: Be a ticketing system for infra requests

Do: Enable self-service with support

Support tiers:

Self-service docs and automation (80% of needs)
Office hours and Slack support (15%)
Direct eng help for truly unique cases (5%)

Measuring Success: Platform KPIs

Developer Productivity Metrics

Metric	Before Platform	After 12 Months
Time to first deploy (new service)	2 weeks	1 day
Time from commit to production	2.5 days	45 min
Environment provisioning	2 days	3 min
Developer time on toil	40%	15%

Adoption Metrics

Platform usage rate: 85% of services
Golden path adoption: 78% of new services
Self-service ratio: 92% (vs. manual requests)

Satisfaction Metrics

Developer NPS: +62 (from +12)
Platform team satisfaction: +48
Time spent on meaningful work: +25%

Technology Choices: What We Use and Why

Developer Portal: Backstage

Why:

Open source, extensible
Plugin ecosystem
Backed by CNCF

Alternatives considered:

Port (SaaS, less customizable)
Humanitec (more opinionated)
Build custom (too much effort)

Infrastructure Provisioning: Terraform + Crossplane

Terraform for foundational infrastructure:

VPCs, IAM, databases
Mature ecosystem
State management understood

Crossplane for developer-facing resources:

Declarative K8s-native
Self-service via CRDs
GitOps-friendly

Example Crossplane claim:

apiVersion: database.platform.company/v1
kind: PostgresInstance
metadata:
  name: payments-db
spec:
  storageGB: 100
  instanceClass: db.r5.large
  engineVersion: "15.3"
  backupRetention: 7
  encrypted: true

Deployment: ArgoCD (GitOps)

Why ArgoCD over Flux:

Better UI for troubleshooting
RBAC model fits our org
ApplicationSet for multi-tenant deployments

Config:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payments-api-production
spec:
  project: payments
  source:
    repoURL: https://github.com/company/payments-api
    targetRevision: main
    path: deploy/production
  destination:
    server: https://prod-cluster.company.com
    namespace: payments
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Observability: Grafana Stack

Prometheus for metrics
Loki for logs
Tempo for traces
Grafana for visualization

Pre-configured dashboards for every service.

Security: Policy as Code

OPA Gatekeeper for Kubernetes admission control:

# Policy: All containers must have resource limits
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
  name: container-must-have-limits
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
  parameters:
    limits: ["memory", "cpu"]

Real-World Case Study: From Chaos to Platform

Before Platform Engineering

Deployment methods in use: 15 different approaches

Some teams: Jenkins
Others: GitHub Actions
A few: GitLab CI
One team: Manual kubectl

Environment provisioning: Manual, ticket-based, 2-5 days

Observability: Each team rolled their own (or didn’t)

Security scanning: Inconsistent, mostly absent

Developer frustration: High (lots of “how do I…?” questions)

The Transformation

Month 1-3: Research and quick wins (self-service environments)

Month 4-6: Golden path templates, standardized CI/CD

Month 7-9: Backstage portal, service catalog

Month 10-12: Observability standardization, cost visibility

Month 13-18: Advanced features (policy enforcement, cost optimization, ML platform)

Results

Quantitative:

Deploy frequency: 2x per week → 20x per week
Lead time: 2.5 days → 45 minutes
Change failure rate: 23% → 8%
MTTR: 4 hours → 35 minutes

Qualitative:

Developers focus on features, not infra
Consistent security posture
Easier onboarding (new devs productive in days)
Platform team went from firefighting to innovation

Getting Started: Your First 90 Days

Week 1-2: Discovery

Interview 15-20 developers
Map current deployment processes
Identify top 3 pain points

Week 3-4: Strategy

Define platform principles
Choose initial focus area (recommend: environment provisioning)
Get leadership buy-in and budget

Week 5-8: First Feature

Build one self-service capability
Make it amazing
Launch to friendly beta users

Week 9-12: Iterate and Expand

Gather feedback
Improve based on usage
Add second capability
Start building developer portal

Beyond 90 Days

Continuous iteration
Regular communication
Measure and improve
Grow team as adoption increases

Best Practices Checklist

Treat your platform as a product with a product manager
Interview developers to understand real pain points
Start with quick wins to prove value
Build golden paths that make the right way easy
Provide escape hatches for power users
Deploy a developer portal (e.g., Backstage)
Measure adoption, satisfaction, and productivity
Have regular office hours and feedback loops
Automate toil and repetitive requests
Document everything clearly
Celebrate wins and share success stories
Iterate continuously based on feedback

Resources & Further Learning

Backstage Documentation
Crossplane for Infrastructure
Team Topologies Book (platform team structure)
CNCF Platform Engineering Maturity Model
Humanitec’s Platform Engineering Guide

Final Thoughts

Platform engineering isn’t a silver bullet, and it’s definitely not easy. But after 18 months of building, iterating, and listening, I can confidently say it’s transformed how our organization ships software.

The key insight: platforms succeed when they genuinely make developers’ lives better. Not when they enforce compliance, not when they reduce costs (though both happen as side effects), but when they eliminate friction and let developers focus on what they do best - building products.

Start small, prove value quickly, and grow organically. Your platform should feel like a product your developers love, not infrastructure they tolerate.

Build platforms that empower, not constrain.

Ship with joy.