Introduction
Let’s talk about the elephant in the DevOps room: we asked developers to own their entire stack from code to production, and now they’re drowning. The “you build it, you run it” philosophy was supposed to empower teams, but instead it created a fragmented mess where every squad reinvents deployment pipelines, monitoring, and infrastructure management.
Enter platform engineering - the discipline that’s taken the industry by storm in 2024-2025. It’s not just DevOps rebranded. It’s a fundamental shift in how we think about enabling development teams: build self-service platforms that provide golden paths while allowing flexibility when needed.
I’ve spent the last 18 months building and evolving an internal developer platform (IDP) for a 200+ engineer organization. We’ve gone from 15+ different deployment methods and zero standardization to a cohesive platform that’s actually beloved by our developers (shocking, I know). In this guide, I’ll share what worked, what failed spectacularly, and the principles that separate great platforms from shelfware.
What Platform Engineering Actually Is (And Isn’t)
The Core Idea
Platform engineering is about treating infrastructure and developer tooling as a product, with your developers as the customers.
Key principles:
- Self-service by default: Developers shouldn’t need tickets to deploy, create databases, or provision environments
- Golden paths, not golden cages: Provide opinionated, easy defaults but allow customization when needed
- Developer experience first: If your platform is painful to use, it will be avoided and routed around
- Product mindset: Gather feedback, iterate, measure adoption, celebrate wins
What It’s NOT
- Not a renamed DevOps team: Platform engineering builds products for developers, not “does ops for them”
- Not enforced standardization: You can’t just lock devs in a cage and call it a platform
- Not just Kubernetes: While K8s is often involved, the platform layer sits above infrastructure
- Not a dashboard: Building a UI over kubectl is not a platform
The Platform Engineering Stack in 2025
Modern IDPs typically include:
Developer Portal:
- Backstage, Kratix, or Humanitec
- Service catalog
- API documentation
- Golden path templates
Infrastructure Provisioning:
- Terraform modules with self-service wrappers
- Crossplane for declarative infrastructure
- Cloud provider abstractions
Deployment & Runtime:
- GitOps with ArgoCD or Flux
- Kubernetes (often multi-cluster)
- Service mesh for advanced routing
Observability:
- Standardized logging, metrics, tracing
- Pre-configured dashboards
- Alert templates
Security & Compliance:
- Policy as code (OPA, Kyverno)
- Automated security scanning
- Secrets management
Building Your IDP: A Practical Roadmap
Phase 1: Foundation - Understand Your Developers’ Pain (Weeks 1-4)
Don’t start by picking tools. Start by understanding what’s actually broken.
What I did:
-
Developer interviews (15-20 one-on-ones)
- “Walk me through your last deployment”
- “What takes longer than it should?”
- “What do you wish just worked?”
-
Process archaeology
- Map out every deployment pipeline variant
- Document all the tribal knowledge and runbooks
- Identify common failure modes
-
Metric collection
- Time from commit to production
- Mean time to environment provisioning
- Frequency of “DevOps help needed” tickets
Findings from my org:
- Average time to production: 2.5 days (should be hours)
- 8 different CI/CD patterns in use
- 60% of platform team time spent on repetitive requests
- Developers spending ~40% of time on non-feature work
Phase 2: Quick Wins - Prove Value Fast (Weeks 5-12)
Pick ONE high-impact, low-complexity problem and solve it beautifully.
My first project: Self-service staging environments
Before:
- File a ticket
- Wait 1-3 days
- Get a manually provisioned namespace
- Manually configure DNS, secrets, databases
After:
# Developer workflow
platform create-env --name my-feature --type staging
# Behind the scenes: Terraform + ArgoCD
# - Provisions namespace with resource quotas
# - Configures DNS (feature-123.staging.company.com)
# - Deploys database (isolated schema)
# - Sets up secrets from Vault
# - Creates GitOps application
# - Ready in 3 minutes
Impact:
- Environment creation time: 2 days → 3 minutes
- Developer satisfaction score: +45 points
- Platform team requests: -70%
Lesson: One great experience beats ten mediocre features.
Phase 3: Golden Paths - Make the Right Way the Easy Way (Weeks 13-26)
Golden paths are opinionated, batteries-included workflows for common tasks.
Example: Service scaffolding
We built templates for common service types:
platform new-service \
--name payments-api \
--type rest-api \
--language python \
--database postgres
# Generated:
# - Git repo from template
# - CI/CD pipeline (GitHub Actions)
# - Kubernetes manifests (Kustomize)
# - Observability (Prometheus, Grafana, OpenTelemetry)
# - Security scanning (Trivy, SonarQube)
# - Documentation (OpenAPI spec, README)
What gets configured automatically:
- Health check endpoints
- Metrics exposition
- Structured logging
- Distributed tracing
- Database migrations
- Feature flags integration
- Secrets from Vault
- Resource limits and autoscaling
Adoption rate: 85% of new services use golden paths
Why it works:
- Easier to use the template than start from scratch
- Bakes in best practices by default
- Still allows customization for edge cases
Phase 4: Developer Portal - Single Pane of Glass (Weeks 27-40)
We chose Backstage (Spotify’s open-source developer portal) as our foundation.
What we surfaced:
-
Service Catalog
- All services, libraries, and infrastructure
- Ownership (team, on-call, Slack channel)
- Dependencies and dependents
- SLA/SLO commitments
-
Documentation Hub
- Getting started guides
- API references (auto-generated from OpenAPI)
- Runbooks and troubleshooting
-
Software Templates
- Golden path scaffolding
- One-click service creation
-
Tech Insights
- Per-service scorecards
- Security posture
- Dependency health
Custom plugins we built:
- Cost Dashboard: Per-service AWS/GCP spend
- Deployment Status: Real-time view of all environments
- On-call Integration: PagerDuty schedules and incidents
- Compliance Checker: Security and policy violations
Integration points:
# Example: Backstage catalog-info.yaml
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: payments-api
description: Handles payment processing
spec:
type: service
lifecycle: production
owner: payments-team
system: checkout
dependsOn:
- component:database/payments-db
- component:service/user-service
providesApis:
- payments-v1
consumesApis:
- user-v1
- fraud-detection-v2
Phase 5: Continuous Improvement - Listen and Iterate (Ongoing)
Platform engineering is never “done.”
What we do:
- Weekly office hours: Developers can ask questions, demo features, give feedback
- Monthly developer surveys: NPS score, feature requests, pain points
- Quarterly roadmap reviews: Share what’s coming, prioritize based on feedback
- Changelog and release notes: Every platform update communicated clearly
Metrics we track:
- Platform adoption rate (% of services using golden paths)
- Time to production (commit to live)
- Developer satisfaction (NPS score)
- Self-service ratio (automated vs. manual requests)
- Cognitive load (time spent on undifferentiated work)
Common Pitfalls and How to Avoid Them
Pitfall 1: Building in Isolation
Mistake: Platform team builds what they think devs need without asking.
Solution:
- Embed platform engineers with product teams
- Dogfood your own platform
- Public roadmap with developer input
Pitfall 2: The Big Bang Launch
Mistake: Build for 18 months, then unveil the “perfect platform.”
Solution:
- Ship incrementally
- Get feedback early and often
- Iterate based on real usage
Pitfall 3: Too Much Abstraction
Mistake: Hide so much complexity that troubleshooting is impossible.
Solution:
- “Escape hatches” for power users
- Transparent abstractions (show the underlying commands)
- Progressive disclosure (simple by default, powerful when needed)
Example:
# Simple mode (90% of use cases)
platform deploy --env production
# Power user mode (full control)
platform deploy --env production --dry-run --show-manifest
# Outputs the actual kubectl commands to run manually
Pitfall 4: Treating It Like Infrastructure
Mistake: Platform team operates like a traditional ops team - reactive, ticket-driven.
Solution:
- Act like a product team
- Have a product manager for your platform
- Roadmap driven by developer needs, not ops convenience
Pitfall 5: Ignoring the Long Tail
Mistake: Optimize for the most common case, ignore edge cases.
Solution:
- 80/20 rule: Golden paths for 80%, escape hatches for 20%
- Allow “bring your own” for special needs
- Document when and why to diverge
Organizational Structure: Who Builds the Platform?
Team Composition
For a 100-200 developer org, I recommend:
- 1 Product Manager (platform as product owner)
- 4-6 Platform Engineers (full-stack, infrastructure-savvy)
- 1 Developer Experience Engineer (focus on DX, docs, training)
- 1 SRE/Ops liaison (bridge to production operations)
Skills needed:
- Strong infrastructure as code (Terraform, Crossplane)
- Kubernetes and cloud platforms
- CI/CD expertise
- Developer empathy (many came from product engineering)
- Product thinking and communication
Reporting Structure
Platform teams work best when they report to Engineering leadership, not Operations.
Why?
- Incentives aligned with developer productivity
- Product mindset over cost-cutting
- Innovation vs. stability balance
Interaction Model
Don’t: Be a ticketing system for infra requests
Do: Enable self-service with support
Support tiers:
- Self-service docs and automation (80% of needs)
- Office hours and Slack support (15%)
- Direct eng help for truly unique cases (5%)
Measuring Success: Platform KPIs
Developer Productivity Metrics
| Metric | Before Platform | After 12 Months |
|---|---|---|
| Time to first deploy (new service) | 2 weeks | 1 day |
| Time from commit to production | 2.5 days | 45 min |
| Environment provisioning | 2 days | 3 min |
| Developer time on toil | 40% | 15% |
Adoption Metrics
- Platform usage rate: 85% of services
- Golden path adoption: 78% of new services
- Self-service ratio: 92% (vs. manual requests)
Satisfaction Metrics
- Developer NPS: +62 (from +12)
- Platform team satisfaction: +48
- Time spent on meaningful work: +25%
Technology Choices: What We Use and Why
Developer Portal: Backstage
Why:
- Open source, extensible
- Plugin ecosystem
- Backed by CNCF
Alternatives considered:
- Port (SaaS, less customizable)
- Humanitec (more opinionated)
- Build custom (too much effort)
Infrastructure Provisioning: Terraform + Crossplane
Terraform for foundational infrastructure:
- VPCs, IAM, databases
- Mature ecosystem
- State management understood
Crossplane for developer-facing resources:
- Declarative K8s-native
- Self-service via CRDs
- GitOps-friendly
Example Crossplane claim:
apiVersion: database.platform.company/v1
kind: PostgresInstance
metadata:
name: payments-db
spec:
storageGB: 100
instanceClass: db.r5.large
engineVersion: "15.3"
backupRetention: 7
encrypted: true
Deployment: ArgoCD (GitOps)
Why ArgoCD over Flux:
- Better UI for troubleshooting
- RBAC model fits our org
- ApplicationSet for multi-tenant deployments
Config:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: payments-api-production
spec:
project: payments
source:
repoURL: https://github.com/company/payments-api
targetRevision: main
path: deploy/production
destination:
server: https://prod-cluster.company.com
namespace: payments
syncPolicy:
automated:
prune: true
selfHeal: true
Observability: Grafana Stack
- Prometheus for metrics
- Loki for logs
- Tempo for traces
- Grafana for visualization
Pre-configured dashboards for every service.
Security: Policy as Code
OPA Gatekeeper for Kubernetes admission control:
# Policy: All containers must have resource limits
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredResources
metadata:
name: container-must-have-limits
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
parameters:
limits: ["memory", "cpu"]
Real-World Case Study: From Chaos to Platform
Before Platform Engineering
Deployment methods in use: 15 different approaches
- Some teams: Jenkins
- Others: GitHub Actions
- A few: GitLab CI
- One team: Manual kubectl
Environment provisioning: Manual, ticket-based, 2-5 days
Observability: Each team rolled their own (or didn’t)
Security scanning: Inconsistent, mostly absent
Developer frustration: High (lots of “how do I…?” questions)
The Transformation
Month 1-3: Research and quick wins (self-service environments)
Month 4-6: Golden path templates, standardized CI/CD
Month 7-9: Backstage portal, service catalog
Month 10-12: Observability standardization, cost visibility
Month 13-18: Advanced features (policy enforcement, cost optimization, ML platform)
Results
Quantitative:
- Deploy frequency: 2x per week → 20x per week
- Lead time: 2.5 days → 45 minutes
- Change failure rate: 23% → 8%
- MTTR: 4 hours → 35 minutes
Qualitative:
- Developers focus on features, not infra
- Consistent security posture
- Easier onboarding (new devs productive in days)
- Platform team went from firefighting to innovation
Getting Started: Your First 90 Days
Week 1-2: Discovery
- Interview 15-20 developers
- Map current deployment processes
- Identify top 3 pain points
Week 3-4: Strategy
- Define platform principles
- Choose initial focus area (recommend: environment provisioning)
- Get leadership buy-in and budget
Week 5-8: First Feature
- Build one self-service capability
- Make it amazing
- Launch to friendly beta users
Week 9-12: Iterate and Expand
- Gather feedback
- Improve based on usage
- Add second capability
- Start building developer portal
Beyond 90 Days
- Continuous iteration
- Regular communication
- Measure and improve
- Grow team as adoption increases
Best Practices Checklist
- Treat your platform as a product with a product manager
- Interview developers to understand real pain points
- Start with quick wins to prove value
- Build golden paths that make the right way easy
- Provide escape hatches for power users
- Deploy a developer portal (e.g., Backstage)
- Measure adoption, satisfaction, and productivity
- Have regular office hours and feedback loops
- Automate toil and repetitive requests
- Document everything clearly
- Celebrate wins and share success stories
- Iterate continuously based on feedback
Resources & Further Learning
- Backstage Documentation
- Crossplane for Infrastructure
- Team Topologies Book (platform team structure)
- CNCF Platform Engineering Maturity Model
- Humanitec’s Platform Engineering Guide
Related articles on INFOiYo:
- GitOps Continuous Deployment: ArgoCD & Flux
- Infrastructure as Code with Terraform
- How Developers Can Master Deep Work
Final Thoughts
Platform engineering isn’t a silver bullet, and it’s definitely not easy. But after 18 months of building, iterating, and listening, I can confidently say it’s transformed how our organization ships software.
The key insight: platforms succeed when they genuinely make developers’ lives better. Not when they enforce compliance, not when they reduce costs (though both happen as side effects), but when they eliminate friction and let developers focus on what they do best - building products.
Start small, prove value quickly, and grow organically. Your platform should feel like a product your developers love, not infrastructure they tolerate.
Build platforms that empower, not constrain.
Ship with joy.