Chapter 01

Introduction

You build it, you run it.
— Werner Vogels, CTO of Amazon

Story

Marta, engineering director at Mountain Lab, arrives at work one Monday morning. Still coffee in hand, she opens her laptop and sees three Slack messages waiting. One from Diego, her CTO. Another from Javi, the tech lead of the infrastructure team. And one more from the Payments team, who have been waiting two weeks for a staging environment.

She opens Diego’s message first.

“Good morning, Marta! We need to talk about the infrastructure team. Work is piling up and they’ve become a huge bottleneck. They’re operating production services they know nothing about, and they can’t keep up with everything the product teams are developing. With the funding round coming in Q2, we want to set a goal of tripling our feature velocity. We need an organization that scales better.”

Marta sighs. Damn, she knew it. She’s been seeing the signs for months.

The infrastructure team started with three people four years ago, when Mountain Lab had barely twenty people. Now the company has almost a hundred developers organized into twelve product teams, and the infrastructure team is still six people. Each week they receive between fifteen and twenty tickets: new environments, configuration changes, permission adjustments, incident investigation, critical dependency updates. Response times have gone from hours to weeks.

She reads Javi’s message.

“Marta, we need to talk. The team is burned out. Yesterday Carlos spent four hours debugging a problem in the recommendations service, code he’s never seen. Turns out the data team had changed the Python version without telling us. Meanwhile, ten more tickets piled up. We can’t continue like this. Either we hire ten more people, or we need to rethink how we’re organized. This is unsustainable.”

Ten more people! Marta does the mental math: with onboarding time, it would take six months before they’re productive. And by then, the company will have grown even more. It’s a losing race.

She returns to Diego’s message and responds.

“Hi! Sure, let’s talk. What do you have in mind?”

The response comes immediately.

“I’ve been reading about DevOps, SRE and Platform Engineering. Companies like Spotify and Netflix have solved this by building internal platforms that allow teams to self-manage. Developers can deploy their services without depending on a centralized team. I think we need some kind of self-service model like that. Could you look into it? Talk to Javi, I think he’s also been researching and has some ideas.”

Marta smiles. She’s been mulling over this exact thing for weeks too.

“Of course! I recently saw a talk by Kelsey Hightower about the evolution from DevOps to Platform Engineering. The theory makes sense to me. But I’m almost more worried about the cultural change. Give me a week to analyze the situation and prepare a proposal.”

She closes Slack and opens her text editor. She begins to write.

Product teams have been waiting weeks for things that should be trivial. The Payments team has been waiting fourteen days for a staging environment. Fourteen fucking days for an environment. While they wait, code accumulates untested and they’re losing momentum. There’s a lot of pressure to launch the new payment system. Let’s see how they explain to Ana, the CEO, that there are delays due to not having machines to set up an environment, after all the money they’re investing.

And then there’s the mess of operating services you don’t know. She remembers last week’s 3 AM call. The recommendations service down. Carlos trying to debug a system he’d never seen, waking up Laura from the data team to ask her what the hell that service did. “It’s the recommendations engine, it uses product embeddings…”, Laura explained half asleep. But Carlos needed to know more basic things: “what database does it use? What are its dependencies? How do you restart it correctly?” Questions that Laura answered in five minutes, but which cost thirty minutes of downtime and pissed-off customers on the other side of the Atlantic.

Knowledge is locked inside individual engineers’ heads and outdated runbooks. Well, actually three people really know how to properly operate the most classic services. Marta checks the calendar. Javi is going on vacation in two weeks. Two full weeks without him. Who the hell is going to be able to make that critical network configuration change that only he knows how to do? The runbooks that should document these processes have been obsolete for years. Nobody has time to update them when they’re constantly putting out fires.

And the inconsistency. Marta opens the infrastructure catalog. What chaos! The Integrations team uses Java, Ubuntu 20.04 and PostgreSQL 13. The Payments team uses C#, Windows Server 2022 and SQL Server 2019. Some services have monitoring dashboards configured, others don’t even output logs. Each team requested their infrastructure at different times, from different people on the infrastructure team, without any criteria. The result: a heterogeneous disaster where each change is unique because nothing is standard.

She thinks about Javi and his team. Three months of almost permanent on-call. Putting out fires they don’t understand. Doing manual and repetitive work that they know should be automated. “As soon as I have a moment, I’ll automate it,” Javi says every week. But that moment never comes. The next fire has been burning for days.

Marta takes a deep breath. The solution isn’t to hire ten more people for infrastructure. We need to change how the organization works. We need development teams to self-manage, always within security and infrastructure standardization limits. They have to take on part of the responsibility for both provisioning and maintaining production services. We need the infrastructure team to stop being a bottleneck and become an enabler. To act more like a best practices team and be in charge of simplifying developers’ lives. To be there to provide support, rather than to do the work for them.

But where do we start? DevOps? SRE? Platform Engineering? What’s the difference? What are the pros and cons?

She has one week to find the answer.

This situation surely sounds familiar. If we change the names of the people mentioned and their profiles, it almost certainly fits very well in some company you’ve worked at. I’m not clairvoyant, but this situation is quite common and is a product of both the rapid evolution of technologies and the way companies work. We think everything is already invented, but every day new tools and new practices appear that force us to rethink how we do things. Many times they’re simple fads, but other times they’re here to stay.

When a company is small, it’s easy to manage everything that happens. The architecture is simple and you’re aware of what other teams are doing. A centralized team responsible for deploying and operating services in production won’t have problems running those services. But as the company grows, system complexity increases exponentially, and communication between teams becomes more difficult. Development teams want to move fast, while the operations team wants stability. It also becomes more difficult to train new operations team members, due to the magnitude and nuances of the systems they have to learn to manage.

Over the years, as technology has evolved, we’ve been adopting new ways of working. Always seeking to bring development and operations closer together. This fusion of responsibilities has led to the creation of new roles and practices. In this chapter, we’re going to review the historical context that has brought us here, to the need to write this book you have in your hands. This will help us understand why Platform Engineering has become such a popular discipline. But I’m sure it won’t be the last given the rise of AI and generative models.

The Classic Sysadmin Era

Since the beginnings of computing, installing and operating systems has been a fundamental part of software development. In the early days, developers were responsible for everything, from writing code to deploying and maintaining it in production. As applications grew in complexity, specialized roles began to emerge, such as system administrators and operators. Curiously, my first professional title was a “Degree in Computer Systems Administration,” 25 years ago. At that time, there were two main groups of IT professionals: developers, who wrote the code, and sysadmins (system administrators), who managed servers and infrastructure. But this segregation goes back much further, when mainframes dominated the technology landscape.

This working model is manual and reactive. The typical scenario can be summarized like this:

A developer needs a server for their applications
They open a ticket, or send an email, to the sysadmins, with a project code to charge the cost to, a description of the server’s intended use, and their manager’s approval
The sysadmins review the ticket. If there’s a free server, they’re lucky, otherwise they have to wait for a new one to be purchased, which can take weeks
The sysadmins physically install the hardware in the data center rack
The sysadmins manually install the operating system from CDs or images
The sysadmins manually configure networks, firewalls, storage
The sysadmins install necessary dependencies (Java, Apache, databases) following documentation that’s probably outdated
The sysadmins deliver credentials and access to the developer so they can install and configure their application

I experienced this firsthand in the 2000s, having to plan capacity upfront and request the IT department to allocate servers for projects. The process easily took 2-4 weeks in the best case. Does it seem inefficient and frustrating? Because it is. And the worst part is that there are companies that still operate exactly like this in 2025. Especially in traditional sectors with an old legacy infrastructure. But fortunately, there are fewer and fewer, because this model has systemic problems:

Slowness: Every change requires coordinated human intervention and bureaucracy. Something as simple as increasing a server’s memory can take days of approvals and maintenance windows.
Inconsistency: It’s common for configurations to vary between environments. This is due to manual changes, human error, and outdated documentation. Sysadmins do what they’re asked with the documentation they’ve been provided. But without the proper context, they’ll rarely question what they’re asked to do.
Rigidity: As organizations grow, the operations team becomes a bottleneck. Hiring more sysadmins only solves the problem temporarily, because capacity doesn’t scale linearly.
Knowledge silos: Documentation exists in theory, but in practice it lives in sysadmins’ heads. It’s common for documentation to become outdated the instant it’s written, especially in teams with heavy workloads and little discipline.
Cultural silos: Developers and operators work in separate worlds, often with contradictory goals. Developers want to move fast and operators want stability. Different teams don’t necessarily have to be managed the same way.

That said, this model does offer some advantages: highly specialized teams with deep infrastructure knowledge, tight control over security and compliance, and expertise in managing specialized hardware. However, these benefits come at the cost of agility. The question becomes whether there’s a better way to balance specialization with speed. We don’t want to burden developers (yet) with the full responsibility of operating their applications, but we do want to close that gap between development and operations. Developers need to gain visibility into how their applications run in production.

Site Reliability Engineering

While the rest of the industry was dealing with traditional operations problems, Google was facing these challenges at a different scale. In the early 2000s, Google was growing exponentially. The traditional sysadmin model simply couldn’t scale to manage thousands of servers and critical services serving millions of simultaneous users.

In 2003, Ben Treynor Sloss, then VP of engineering at Google, proposed: What if we treat operations as a software problem? From his proposal came Site Reliability Engineering (SRE) teams. Instead of hiring more sysadmins to manage more servers, Google hired software engineers and asked them to automate operational work. The proposal was radical. SREs should dedicate 50% of their time to operational work (incidents, manual deployments, housekeeping) and the other 50% to writing code to automate it.

SRE is based on several fundamental principles:

Automation: If you do something twice, automate it the third time. If a process requires touching more than 5 times a year, automate it completely.
Engineering before operations: SREs are software engineers who specialize in reliability. They write code to solve operational problems.
Service Level Objectives (SLOs): Instead of pursuing “100% uptime” –which is impossible–, define quantifiable objectives based on what users actually need. For example, 99.9% uptime means 43 minutes of allowed downtime per month.
Error budgets: If your SLO is 99.9%, you have a 0.1% “error budget”. As long as you’re within budget, the development team can work more on new features and take more risks. If you exceed that budget, changes are frozen until normalcy is restored.
Blameless postmortems: When something fails –and at some point it will fail–, analyze what system and why it failed, not who was to blame.
Toil reduction: “Toil” is manual, repetitive operational work with no lasting value. What nobody likes to do. SREs identify what it is and automate everything possible to eradicate it.

Google published a book [Beyer, 2016], sharing these practices with the world. Companies of all types adopted –and continue to adopt– SRE as a discipline, because it’s proven that treating operations as an engineering problem scales better.

However, SRE adoption comes with challenges. The discipline requires a higher skills bar. SREs need to be software engineers who also understand operations deeply. This makes hiring difficult. Additionally, SRE has a specific focus: reliability and availability of services in production. It focuses on keeping systems running, not necessarily on improving the developer experience to produce better software. Let’s say SRE is a model that goes from operations toward development, not the other way around. At the point where they meet with development teams is where friction appears between both disciplines. While development pushes changes, SRE has to ensure those changes don’t compromise SLOs. We’re back to having two teams with potentially opposing interests. One wants to move fast, the other wants stability. So one proposal is to embed some SRE practices within development teams. That is, continue closing the gap between development and operations.

For deeper understanding of SRE principles and practices, see [Beyer, 2016][Beyer, 2018][Adkins, 2020] in Appendix A.

The DevOps Culture

While Google was developing SRE internally –and even secretly for years–, another movement was emerging in the community: DevOps. The term was coined by Patrick Debois in 2009, and popularized during the first “DevOpsDays” in October of the same year, in Ghent, Belgium.

DevOps is not a tool or a specific technology. It’s a cultural revolution and philosophical response to the dysfunctional silos between development and operations. The goal is to eliminate the barriers between these two worlds, and different organizations have implemented this in different ways:

Cross-functional teams: Development teams take on operational responsibilities for their services. There’s no separate operations team. This is the model Amazon pioneered under Werner Vogels’ famous principle: developers who build the service are also responsible for operating it in production, including on-call duties. Netflix follows a similar approach.
Embedded operations: Operations engineers are embedded within development teams or work very closely with them, but maintain their specialized expertise. Teams collaborate from day one on architecture, deployment, and operations. Same as QA teams do nowadays.
Centralized platform teams: Which we will discuss in the next section about Platform Engineering.

Regardless of the specific implementation, DevOps seeks to break down the traditional wall between development and operations. No more “throwing over the wall.”

Note

The expression “throwing over the wall” describes the antipattern where one team completes their work and simply “tosses” it to the next team. Without collaboration. This applies not only between development and operations, but also between development and QA, or other teams.

The fundamental principles of DevOps are:

Shared responsibility: Teams that write the code are responsible for maintaining it in production. Gone is the “developers code, operations suffers the result” model.
Collaboration: Whether teams are merged or separate, development and operations work together from the beginning. Architecture decisions consider both functionality and operability.
Automation: Not just scripts, but complete CI/CD pipelines. From commit to production should be an automated, reproducible, and auditable process.
Observability: Instrument everything. Metrics, logs, traces. Learn from data, not assumptions. Fast feedback on performance and availability will allow us to improve continuously.
Experimentation: Foster a culture of controlled experiments. Failing is acceptable if you learn and improve the system. Incidents are learning opportunities, not reasons for punishment.

The emergence of virtualization and the public cloud was the perfect catalyst for DevOps:

AWS EC2 (2006): On-demand infrastructure via APIs
GitHub (2008): Collaboration on code and workflows
Docker (2013): Containers that package applications with all their dependencies
Terraform (2014): Infrastructure as code
Kubernetes (2014): Container orchestration at scale

With these tools, it’s now possible to provision infrastructure through code, not tickets or emails to dedicated teams. Teams can deploy changes multiple times a day, not multiple times a year. Additionally, they can get rid of infrastructure they don’t need, paying only for what they use. This has democratized access to infrastructure and empowered developers, but has also overloaded them with new responsibilities.

This is a double-edged sword. On one hand, developers have more control over their applications and can iterate faster. But on the other hand, developers don’t always have the necessary knowledge to manage the infrastructure needed when their applications are deployed at large scale. Or they understand very well the cost implications. This is a problem, because modern applications are increasingly complex, with more abstractions, and require solid knowledge to design and operate them correctly.

DevOps remains the predominant culture in many companies, but presents challenges that have become more evident over time:

Ambiguity: Ask 10 people today what DevOps is and you’ll get 10 different answers. Is it a culture? A role? A set of tools? All of the above? This confusion persists.
Inconsistency: Each team, in the same company, implements DevOps differently. One team uses Jenkins, another CircleCI, another GitLab for CI/CD. One team deploys on AWS, another on Kubernetes on-prem. Even when tools serve the same purpose, the lack of standardization creates a heterogeneous landscape that’s difficult to maintain and support at the organizational level.
Cognitive overload: Developers are expected to understand not only their code, but also Docker, Kubernetes, Terraform, AWS, networks, security, observability, regulations… The mental load is unsustainable, especially for small teams.
Reinventing the wheel: Beyond choosing different tools, each team often builds their own custom solutions from scratch. Everyone needs CI/CD pipelines, but instead of using shared templates or common frameworks, each team writes their own pipeline logic. Everyone needs monitoring, but each team sets up their own dashboards and alerting. Everyone needs secrets management, but each team implements their own approach. This duplication of effort doesn’t scale and generates enormous extra costs in time and maintenance.
“DevOps engineer” is an antipattern: Many companies hire “DevOps engineers” thinking it’s a role. These engineers frequently become the infrastructure team with a cooler name. They don’t bring the cultural transformation that DevOps promises.

Wow! It seems we’ve now overshot. We’ve suddenly closed the gap between development and operations, but we’ve created new problems. Now we need to backtrack a bit, and look for someone to take on part of the DevOps responsibilities, with the intention of standardizing and homogenizing infrastructure and services. But who?

For essential readings on DevOps culture, practices, and transformation, see [Kim, 2013][Kim, 2016][Humble, 2010][Forsgren, 2018] in Appendix A.

Platform Engineering

Platform engineering emerges as a direct response to the limitations –or confusions– of DevOps and SRE. The central idea is powerful and simple:

Instead of asking each team to become infrastructure experts, let’s build common tools and capabilities once, correctly, as an internal product that everyone can consume via self-service.

Platform Engineering represents the current state of the art in how the most advanced organizations manage their infrastructure and DevEx.

Pioneer companies like Spotify, Netflix, Airbnb, and Uber built –and continue evolving– sophisticated internal platforms that allow hundreds of development teams to deploy applications independently, securely, and quickly. These aren’t casual platform projects. We’re talking about strategic investments with dedicated teams that treat the platform as a product they adapt to the needs of development teams.

Spotify, for example, built “Backstage” internally –now it’s an open source tool– which is a developer portal to centralize:

Service and API catalog
Project templates
Technical documentation
Deployment status

Netflix also developed multiple internal tools like Spinnaker (continuous delivery), Chaos Monkey (resilience testing), and maintains a complete platform that allows teams to deploy services without intervention from a centralized infrastructure team.

Platform Engineering doesn’t replace DevOps or SRE, but complements them. DevOps remains the necessary and cross-functional culture in all engineering teams. SRE is also necessary, but is more integrated into application teams and responsibility is shared with the platform team. Platform Engineering is the mechanism that makes all this scalable and sustainable. All three disciplines coexist and mutually reinforce each other in modern organizations.

Tip

Industry recognition: Gartner identified Platform Engineering as a Top Strategic Technology Trend for 2024, predicting that by 2026, roughly 80% of large software engineering organizations will establish platform engineering teams, up from 45% in 2022. According to the 2024 State of Platform Engineering Report, organizations using internal platforms report 30% shorter time-to-market, 4x higher deployment frequency, and 30% shorter lead time. Additionally, 68% of organizations practicing platform engineering report increased development velocity.

Today, in 2025, platform engineering is a mature and recognized discipline. Conferences like PlatformCon bring together thousands of attendees. Companies of all sizes actively invest in platform teams.

However, adopting platform engineering requires significant upfront investment and organizational buy-in. There’s a risk of over-engineering a platform that developers won’t adopt if it doesn’t truly address their needs. The reality is heterogeneous. While some tech companies already have sophisticated platforms, many traditional organizations still operate with classic sysadmin models or are in the middle of their DevOps transformation. There’s no single “industry state.” The fit of each model depends on the context, culture, and specific needs of each organization. Therefore, the need for this book. I hope that by the end of it, the reader will be in the right track to build or evolve their own platform.

For comprehensive guides on platform engineering, team organization, and practical implementation, see [Fournier, 2023][Skelton, 2019][Salatino, 2023] in Appendix A.

Summary

In this chapter we’ve traced the historical evolution from classic sysadmins, through SRE and DevOps, to reach Platform Engineering. Each stage has emerged as a response to the limitations and challenges of the previous one, adapting to new technological and organizational realities. Virtualization and the explosion of the public cloud have been catalysts in this evolution. They’ve introduced new ways to provision and manage infrastructure, but have also brought new challenges.

This historical positioning helps us understand the fit of Platform Engineering in the current landscape. The future remains uncertain, particularly with the rapid evolution of generative artificial intelligence promising to reshape how we build and operate systems. Emerging concepts like vibe coding –where AI interprets high-level intent to generate infrastructure and code—are still speculative and unproven at scale. Whether these become mainstream or give way to other abstractions remains to be seen. What’s clear is that system complexity won’t decrease, but rather the opposite. We’re headed for increased complexity with more layers of abstraction.