Infrastructure
If you think good architecture is expensive, try bad architecture.
The Slack message from Diego arrived at 4:45 PM on a Friday.
“Hey Javi, great news! The business just signed off on our European expansion to comply with GDPR. We need to deploy the user-service and checkout APIs to an EU region. Can the team just change the region flag in our deployment manifests and push?”
Javi, the lead platform engineer at Mountain Lab, stared at the screen. He typed, deleted, and re-typed his response three times.
“Uh, well… no,” Javi finally replied. “It’s not that simple.”
Diego’s typing indicator flashed immediately. “Why not? I thought the whole point of this internal developer platform was that we could deploy anywhere without opening tickets. You said the infrastructure was abstracted.”
Javi sighed and opened a video call link. When Diego joined, Javi shared his screen, which showed a tangled web of AWS console windows and spreadsheet tabs. “The application deployment is abstracted,” Javi explained, rubbing his temples. “But the underlying infrastructure isn’t dynamic yet. If we spin up a new region, I have to manually allocate a new non-overlapping CIDR block for the European VPC. If I get the math wrong, we’ll have IP collisions when we peer the networks.”
“Okay, so allocate the block. How long does that take?” Diego asked.
“That’s just step one,” Javi continued. “Then I have to configure the Transit Gateway to route traffic between eu-central-1 and us-east-1 so the checkout service can still hit the legacy inventory database. Then I need to create a new API Gateway, set up the regional WAF rules, and submit a Jira ticket to InfoSec to approve them. After that, we have to generate a new wildcard SSL certificate, validate it, and manually update the Cloudflare global DNS routing.”
Diego was silent for a moment. “Marketing already announced the European launch for next Tuesday.”
Javi opened a new Jira epic. “Give me three weeks. If InfoSec approves the WAF rules on the first try.”
Marta, the VP of Platform Engineering, had been silently listening to the call. She unmuted herself. “We haven’t built a platform,” she said flatly. “We’ve just built a very efficient ticketing system where Javi is the sole API.”
The realization hit them hard. They had successfully abstracted the CI/CD pipelines, but the base infrastructure—networking, DNS, SSL, and compute provisioning—was still rooted in “ClickOps” and manual coordination. To support real self-service, they needed an infrastructure approach where adding a region, establishing secure network segments, and managing edge routing were entirely declarative and automated by the platform.
We often think of internal developer platforms (IDPs) purely in terms of developer portals or deployment pipelines. But if the underlying infrastructure relies on manual provisioning, ticketing systems, or UI-driven configuration, the platform is merely a shiny facade over a traditional IT bottleneck. This toil restricts flow and prevents scaling.
In this chapter, we will explore the infrastructure primitives required to support a scalable internal platform. We will cover how to lay a declarative foundation with Infrastructure as Code, implement physical network segmentation, enforce guardrails through resource quotas, and automate edge routing across public cloud providers.
Infrastructure as Code
Before we can discuss networking patterns or compute abstractions, we must establish the absolute baseline for platform engineering: everything must be code. Infrastructure as Code (IaC) using tools like Terraform, OpenTofu, or Pulumi is non-negotiable.
Manual infrastructure provisioning—“ClickOps”—leads directly to configuration drift. When an engineer manually configures an Azure Virtual Network (VNet) or an AWS Transit Gateway, that state exists only in the cloud provider’s proprietary database. It cannot be peer-reviewed, it cannot be audited for security regressions, and crucially, it cannot be reliably replicated to a new region or environment.
Vending Machine Infrastructure
For a platform team, IaC is not just about writing scripts to spin up servers; it is about creating reusable, opinionated modules. The platform team acts as a vendor of infrastructure components.
Developers should not be writing raw cloud provider blocks. Instead, the platform team provides a standard module (a “Golden Path”) that encapsulates the organization’s security, compliance, and high-availability requirements. Furthermore, these modules automatically enforce non-functional requirements like Resource Naming Conventions and Tagging Strategies.
# Developers consume a simplified, platform-vended module.
# The module enforces tagging, naming conventions, and security defaults.
module "database" {
source = "git::https://git.mountainlab.io/tf-modules//rds-postgres?ref=v2.1.0"
# Resources map directly to the segmentation strategy
sector = "ecommerce"
tier = "live"
region = "eu01"
tenant = "payments"
environment = "production"
instance_class = "db.r6g.large"
# The module internally forces storage_encrypted = true,
# multi_az = true, generates standardized resource names,
# and applies mandatory cost-allocation tags based on coordinates.
}
By abstracting the complexity, you guarantee that every database deployed on the platform encrypts data at rest, pushes logs to the central aggregator, resides in the correct network segment, and possesses the standardized tags required by the finance team for cost attribution.
Resource Naming Conventions
One of the most tedious aspects of infrastructure provisioning is adhering to naming conventions. Some cloud providers enforce strict character limits or require globally unique names (e.g., AWS S3 buckets or Azure Storage Accounts).
The platform team should entirely abstract this away from the developer. I recommend a structured naming approach for resources that do not need to be globally unique (e.g., vnet-ecommerce-live-eu01), combined with random hashes or suffixes for resources that require global uniqueness (e.g., mybucket-a7x9).
Developers should never need to think about naming cloud resources.
Sometimes, developers want to specify a custom name for a resource for easier identification. That’s ok. The platform team can allow an optional name parameter in the module, but the golden path should always append the standardized suffix to ensure uniqueness and consistency. But this should be the exception, not the norm. And not in all modules.
Tagging Strategy
Every resource deployed by the platform must be tagged (or labeled) with a standard set of metadata. This allows both humans and automation to query, identify, and govern resources without relying on the physical resource name. This is critical for both cost management and operational efficiency.
To avoid conflicts with cloud provider-generated tags or third-party tools, you should prefix all custom tags with your company domain. However, the format is not homogenous across providers. You must adapt the syntax to the context:
- AWS / Azure / GCP Tags:
<company>:<key>(e.g.,mountainlab.io:owner) - Kubernetes Labels:
<company>/<key>(e.g.,mountainlab.io/owner)
The key is a consistent prefix and a well-defined taxonomy enforced globally. A standard set of mandatory tags vended through your IaC modules should include:
<company>:tenant— The product team that owns the workload.<company>:sector— The bounded context or business domain.<company>:tier— The environment classification (e.g.,Tier("live"),Tier("sandbox")).<company>:region— The platform region.<company>:tf_module— The name of the Terraform module that created the resource.<company>:tf_module_version— The specific version of the module, invaluable for tracking infrastructure drift and upgrade campaigns.
Beyond these structural tags, I recommend implementing a broader set of best-practice tags for governance and operations:
- Ownership:
<company>:owneror<company>:business-unit. - Financials:
<company>:cost-centerfor precise chargebacks and<company>:projectfor initiative tracking. - Security:
<company>:data-classification(e.g.,public,confidential,pci) and<company>:criticality(e.g.,tier-1,mission-critical). - Automation:
<company>:maintenance-window(e.g.,sun-0400-0800-utc) and<company>:created-by(e.g.,terraform,crossplane).
The Shared Responsibility Model
With a declarative foundation in place, we must define who owns what. Treating infrastructure as a self-service product requires a clear, codified Shared Responsibility Model. This model acts as a contract between three distinct parties. Each level abstracts the underlying complexity, offers support, and enables the next level to build safely on top of it:
flowchart TD
Product["Product Teams<br>(Business Logic, App Config, Micro-segmentation)"]
Platform["Platform Team<br>(Self-service APIs, Base Network, Security Guardrails, IaC Modules)"]
Cloud["Cloud Provider<br>(Physical Data Centers, Hardware, Managed Services)"]
Product -->|Consumes Platform Services and Modules| Platform
Platform -->|Abstracts & Governs| Cloud
- The Cloud Provider (AWS/GCP/Azure): Abstracts away the physical data centers, hardware, and network, offering managed services and their control planes (e.g., EKS, GKE, AKS, Cloud SQL). I favor managed services whenever possible to reduce the platform team’s operational toil.
- The platform team: Builds on top of the cloud provider’s managed services by creating internal platforms that provide self-service capabilities, automation, and guardrails. For example, the platform team implements baseline security controls (like enforcing transit encryption or centralizing logs), provisions base networking, manages global DNS, and vends the standardized IaC modules.
- The Product Teams (Development Teams): Consume those internal platforms to build customer-facing products. They use the platform’s self-service capabilities to deploy a new feature without waiting on a centralized operations team. They are responsible for application-level configuration, defining environment variables, selecting compute “t-shirt sizes”, and implementing micro-segmentation for their specific workloads (e.g., writing Kubernetes
NetworkPolicyresources).
This contract ensures that product teams do not need to understand BGP routing or TLS certificate negotiations to deploy a service securely, while the platform team isn’t burdened with configuring application-specific business logic.
Cloud Primitives
Segmentation, as discussed in Chapter 4, relies heavily on cloud primitives to establish hard boundaries between different segments. This is what we use to isolate workloads, set up guardrails, apply resource quotas, and enforce strict security controls. Using the notation, we can formally define these infrastructure boundaries as types: CloudPrimitive(Sector, Tier), ResourceGroup(Sector, Tier, Region, Tenant), Namespace(Sector, Tier, Region, Tenant, Name), and so on.
flowchart TD
Org["Organization Root / Tenant"]
Folder["Organizational Unit / Folder / Management Group"]
Boundary["CloudPrimitive / Account / Project / Subscription"]
Group["Resource Group"]
Compute["Kubernetes Namespace / Space"]
Org --> Folder
Folder --> Boundary
Boundary --> Group
Group --> Compute
Modern architectures leverage the native structural hierarchies provided by cloud vendors to manage governance and scale. While each provider uses different terminology, they all follow a layered approach that separates broad policy control from hard resource isolation:
- AWS:
Organization->Organizational Units (OUs)->Accounts. OUs provide hierarchical policy control through Service Control Policies (SCPs), while Accounts serve as the strongest isolation boundary for identity, billing, and API quotas. - GCP:
Organization->Folders->Projects. Folders allow for nested policy inheritance and logical grouping, while Projects are the fundamental unit for resource isolation, IAM, and quota management. - Azure:
Tenant->Management Groups->Subscriptions->Resource Groups. Management Groups aggregate subscriptions for broad policy enforcement, while Subscriptions provide the primary billing and quota boundary. Resource Groups are used for granular RBAC delegation and lifecycle management of related resources.
It is crucial to distinguish between tools used for organization and tools used for isolation. Constructs like AWS OrganizationalUnit (OUs), GCP Folder, and Azure ManagementGroup are primarily for hierarchical organization and applying broad governance policies—they do not provide workload isolation.
Structural Mapping per Provider
To implement the platform in a real environment, we must map our abstract dimensions to these concrete cloud vendor structures. There is no single “right” mapping; your organization must choose based on blast radius tolerance and operational capacity.
The following examples show a common, proven design for each major provider using our functional notation:
Microsoft Azure
Azure provides a well-defined hierarchy of management scopes. One common approach maps Sector to ManagementGroup and Tier to Subscription:
CloudPrimitive(Sector) => AzureManagementGroup(Sector)CloudPrimitive(Sector, Tier) => AzureSubscription(Sector, Tier)Network(Sector, Tier, Region) => AzureVNet(Sector, Tier, Region)Namespace(Sector, Tier, Region, Tenant) => AzureResourceGroup(Sector, Tier, Region, Tenant)KubernetesCluster(Sector, Tier, Region) => AzureAKS(Sector, Tier, Region)KubernetesNamespace(Sector, Tier, Region, Tenant, Name)
graph TD
Root["TenantRootGroup"]
Root --> PMG["AzureManagementGroup('platform')"]
Root --> ECMG["AzureManagementGroup('ecommerce')"]
PMG --> PS["AzureSubscription('platform', 'sandbox')"]
PMG --> PL["AzureSubscription('platform', 'live')"]
ECMG --> ECS["AzureSubscription('ecommerce', 'sandbox')"]
ECMG --> ECL["AzureSubscription('ecommerce', 'live')"]
ECS --> RG1["AzureResourceGroup('ecommerce', 'sandbox', 'eu01', 'payments')"]
ECL --> RG2["AzureResourceGroup('ecommerce', 'live', 'eu01', 'payments')"]
Management Groups enforce Azure Policy and RBAC at the Sector level. Subscriptions provide billing, identity, and API limit isolation—here mapped to (Sector, Tier). Resource Groups provide RBAC delegation and resource grouping—here mapped to the full coordinate (Sector, Tier, Region, Tenant).
For organizations requiring stricter tenant isolation, an alternative is to create subscriptions per (Sector, Tier, Tenant)—for example, a dedicated payments-live subscription.
Amazon Web Services
AWS relies primarily on isolated accounts bound together by an Organization. Because AWS does not have an equivalent to Azure’s ResourceGroup, the account boundary often absorbs the Tenant dimension:
CloudPrimitive(Sector) => AWSOrganizationalUnit(Sector)CloudPrimitive(Sector, Tier, Tenant) => AWSAccount(Sector, Tier, Tenant)Network(Sector, Tier, Region) => AWSVPC(Sector, Tier, Region)KubernetesCluster(Sector, Tier, Region) => AWSEKS(Sector, Tier, Region)KubernetesNamespace(Sector, Tier, Region, Tenant, Name)
graph TD
Root["AWSOrganizationRoot"]
Root --> POU["AWSOrganizationalUnit('platform')"]
Root --> ECOU["AWSOrganizationalUnit('ecommerce')"]
POU --> PSA["AWSAccount('platform', 'sandbox')"]
POU --> PLA["AWSAccount('platform', 'live')"]
ECOU --> SBOU["SubOU('sandbox')\n(Tier grouping)"]
ECOU --> LOU["SubOU('live')\n(Tier grouping)"]
SBOU --> PaySb["AWSAccount('ecommerce', 'sandbox', 'payments')"]
LOU --> PayLv["AWSAccount('ecommerce', 'live', 'payments')"]
PayLv --> VPC["AWSVPC('ecommerce', 'live', 'eu01', 'payments')"]
Organizational Units (OUs) enforce Service Control Policies (SCPs) at the Sector level. Accounts provide the hardest isolation boundary—IAM, billing, and API limits are all account-scoped. Here they’re mapped to (Sector, Tier, Tenant), providing a stronger isolation level than the minimum recommended in Chapter 4 for these providers. Region-specific resources like AWSVPC are created inside each account.
For smaller organizations, a simpler alternative is to use accounts per (Sector, Tier) only—ecommerce-sandbox and ecommerce-live—with tenant isolation handled via IAM boundaries within the account.
Google Cloud Platform
GCP uses a resource hierarchy of Organization, Folder, and Project. Projects are the primary isolation primitive, analogous to AWS accounts:
CloudPrimitive(Sector) => GCPFolder(Sector)CloudPrimitive(Sector, Tier, Tenant) => GCPProject(Sector, Tier, Tenant)Network(Sector, Tier, Region) => GCPVPC(Sector, Tier, Region)KubernetesCluster(Sector, Tier, Region) => GCPGKE(Sector, Tier, Region)KubernetesNamespace(Sector, Tier, Region, Tenant, Name)
graph TD
Root["GCPOrganization"]
Root --> PF["GCPFolder('platform')"]
Root --> ECF["GCPFolder('ecommerce')"]
PF --> PP["GCPProject('platform', 'sandbox')"]
PF --> PPL["GCPProject('platform', 'live')"]
ECF --> SBF["GCPFolder('sandbox')\n(Tier grouping)"]
ECF --> LF["GCPFolder('live')\n(Tier grouping)"]
SBF --> PSP["GCPProject('ecommerce', 'sandbox', 'payments')"]
LF --> PLP["GCPProject('ecommerce', 'live', 'payments')"]
Folders provide OrganizationPolicy inheritance and IAM grouping. Projects are the hard boundary: billing, IAM, and quotas are all project-scoped. Like AWS, mapping projects to the (Sector, Tier, Tenant) coordinate provides maximum isolation. Region-specific resources are created within projects, mapping to the full coordinate (Sector, Tier, Region, Tenant).
Tenant Isolation
Within these boundaries, we must isolate individual tenants (product teams). How this is achieved depends on the cloud provider:
- Azure: Azure Resource Groups act as a hard boundary where Role-Based Access Control (RBAC) can be applied.
- AWS: AWS Resource Groups are strictly for organizational tagging and do not provide security isolation; thus, tenant isolation in AWS often requires entirely separate AWS Accounts.
- GCP: Tenant isolation is achieved using separate GCP Projects.
Regardless of the primitive used, tenants should never manage these boundaries themselves. The platform team provisions and governs the Resource Group, Account, or Project to ensure strict security policies and permissions are in place. The tenant is merely granted access to deploy into it. Down at the compute layer, the platform team provides Kubernetes Namespaces to isolate the tenant’s cluster resources.
Design Patterns
Choosing how to map your segmentation strategy to these cloud primitives is a high-stakes decision. These patterns and anti-patterns help guide the way.
Anti-Pattern: The Mega-Cluster
The practice of putting all teams, all tiers, and all applications into a single massive cloud account or a single oversized Kubernetes cluster. While it significantly reduces initial complexity and cost, it creates a single point of failure for the entire organization. A misconfigured IAM policy or a noisy neighbor within a shared cluster will bring down every service.
Anti-Pattern: Premature Granularity
The opposite of the mega-cluster. For example, a small startup with only a handful of engineers decides to give every developer their own dedicated cloud account and separate Kubernetes cluster. The infrastructure overhead instantly grinds the engineering team to a halt, as they spend all their time updating versions and managing cross-account networking instead of building product.
Pattern: CloudPrimitive(Sector, Tier) Matrix
A proven approach for organizations of all sizes is mapping Sector and Tier directly to the cloud provider’s hardest boundary primitive. A standard core setup involves one dedicated account/subscription for each (Sector, Tier) combination:
CloudPrimitive("platform", "sandbox")CloudPrimitive("platform", "live")CloudPrimitive("ecommerce", "sandbox")CloudPrimitive("ecommerce", "live")
By adopting this matrix, you leverage the cloud’s strongest security and cost-attribution boundaries for your most critical segments. You then safely use softer, more cost-effective boundaries—like namespaces and resource groups—for Region and Tenant isolation operating inside those hardened accounts.
Pattern: Network(Sector, Tier, Region)
Create separate virtual networks for each (Sector, Tier, Region) combination. By default, there is no routing between them. Cross-network connectivity is explicit and controlled through centralized peering or transit hubs. This enforces physical isolation aligned with geographic and data residency requirements.
Networking
Networking is the physical manifestation of your segmentation strategy. A flat network where every workload can route to every other workload is a critical security anti-pattern.
The Hub and Spoke Architecture
To enforce strict boundaries at the infrastructure level, the industry standard across all public clouds is the Hub and Spoke network topology.
The Hub is designed for VPN, global connectivity, and centralized inspection. It is a unique, central routing point (using AWS Transit Gateway, Azure Virtual WAN, or GCP Network Connectivity Center) that connects the organization worldwide.
Instead of deploying all services into the Hub, workloads live in isolated Spoke networks (VPCs in AWS/GCP, VNets in Azure). A best practice is to provision one Spoke for each (Sector, Tier, Region) combination.
By default, Spoke networks cannot communicate with each other. If the checkout-api in the ecommerce Spoke needs to talk to the ledger-db in the finance Spoke, the traffic must traverse the Hub, where the platform team can enforce centralized firewall rules.
Crucially, this architecture relies on non-overlapping IP address allocation. For the Hub to route traffic between Spokes, every network must possess a unique CIDR block. If two Spokes share the same IP range, they cannot be peered or connected to the same Hub, leading to a “split-brain” networking state that is notoriously difficult to remediate.
Tenants’ workloads are usually isolated in their own namespaces within a shared Spoke. In some cases with strict compliance or data residency needs, a specific tenant might require their own dedicated Spoke, though this trade-off significantly increases operational overhead.
Furthermore, we must design for Egress (outbound traffic to the internet). There are two primary approaches:
- Centralized Egress: All outbound traffic is routed from the Spokes back through the Hub’s NAT Gateways and egress proxies to prevent data exfiltration and centralize inspection.
- Decentralized Egress: Spokes have their own local NAT Gateways for direct internet egress. This can reduce latency and data transfer costs, but requires distributing security policies to every Spoke.
Subnets
Within a Spoke network, the IP space is further divided into Subnets. A common and highly recommended pattern is to strictly separate public and private subnets.
Public subnets contain resources that must be exposed to the internet (such as load balancers or NAT gateways). Private subnets contain the actual workloads, databases, and internal services, which have no direct inbound path from the internet. This drastically minimizes the attack surface.
IP Address Management (IPAM)
A common failure mode for scaling platforms is IP address exhaustion. If CIDR blocks are allocated organically, you will inevitably end up with overlapping IP ranges, making network peering mathematically impossible without highly complex NAT workarounds. Centralized, non-overlapping IP Address Management (IPAM) is a day-one architectural requirement.
The Hub is globally unique, while Spokes multiply across regions and environments. The platform team must map out the RFC 1918 private IP space and partition it strategically. For example, allocating one large block for the Hub, and distinct blocks for each Spoke:
- Hub Network:
10.0.0.0/16 - Spoke
("ecommerce", "live", "us01", _):10.16.0.0/16 - Spoke
("ecommerce", "sandbox", "us01", _):10.17.0.0/16 - Spoke
("finance", "live", "eu01", _):10.32.0.0/16
Furthermore, when carving up a Spoke’s 10.16.0.0/16 block into smaller subnets, you must account for cloud provider nuances. Some providers have specific delegation requirements. For example, Azure requires multiple dedicated private subnets to delegate IP addresses directly to managed services like Azure Kubernetes Service (AKS) or Azure Database for PostgreSQL. Your IPAM strategy (and IaC logic) must reserve these sub-blocks in advance to avoid exhaustion within the Spoke.
By strictly defining this math upfront, expanding to a new region becomes a simple parameter change in an IaC module, completely eliminating the risk of IP collisions.
DNS
Just as we partition IP space, we must partition the organization’s Domain Name System (DNS) namespace. A well-structured DNS architecture prevents routing conflicts and simplifies service discovery across the Hub and Spoke network.
I recommend a dual-zone approach: provisioning two hosted zones per Spoke (or even per tenant)—one internal and one public.
When establishing the DNS naming convention, you must decide how to handle top-level domains. A scalable approach is to assign a unique top-level domain to each sector (e.g., company.com for the ecommerce sector, and company.io for internal tools). This prevents naming collisions and elegantly separates platform-managed domains from domains managed by the marketing team. Alternatively, you can use subdomains like ecommerce.company.com and tools.company.com.
For the internal hosted zone, service records should follow an explicit internal pattern:
{region}.{tier}.internal.{sector_domain} (e.g., eu01.live.internal.company.com)
For the public hosted zone, the public modifier is redundant and should be omitted:
{region}.{tier}.{sector_domain} (e.g., eu01.live.company.com)
Some organizations prefer to mix public and internal records in a single hosted zone to save on administrative overhead. I strongly advise against this. Mixing zones can expose sensitive internal service names and infrastructure details to the public internet. Keeping internal and public zones strictly separate ensures that internal routing information remains completely invisible to external actors.
Compute
With the network segmented, we need a substrate to run the workloads. While organizations use varied compute models, the platform team must provide standard, governed paths for all of them.
Kubernetes
I consider Kubernetes to be the quintessential technology for building platforms. Kubernetes is far more than just a container orchestrator; it is a universal control plane. It provides a standardized, declarative API that sits above the cloud provider, abstracting a vast amount of underlying infrastructure complexity. From a developer’s perspective, a Deployment manifest looks exactly the same whether it is applied to Amazon EKS, Google GKE, or Azure AKS.
Furthermore, through the use of the Operator pattern (such as Crossplane, Azure Operator, or AWS Controllers for Kubernetes), the Kubernetes API can be extended to manage almost any external cloud resource—allowing developers to provision a database using the exact same workflow they use to deploy their application.
While Kubernetes is not strictly mandatory—you can build a successful IDP using virtual machines technologies, or other container orchestration such as ACA, ECS and the such—it is highly recommended as it is the industry defacto standard. Throughout this book, I will use Kubernetes in my examples and diagrams as the default abstraction layer.
Multi-Tenancy
We must avoid the “Mega-Cluster” anti-pattern—running the entire organization on a single cluster. However, managing hundreds of physical clusters for every development team creates unbearable operational toil. Instead, platforms use logical isolation.
The bread and butter of platform compute is providing namespaces as a service. Developers request a namespace and receive it automatically, fully pre-configured with the correct guardrails, network policies, and RBAC permissions. Developers can request this isolation boundary either by opening a pull request to a Git repository that the platform team manages, or by invoking a self-service API through an Internal Developer Portal (IDP).
For teams that require deeper isolation—such as the ability to install their own cluster-wide CustomResourceDefinitions (CRDs)—the industry is adopting Virtual Clusters (vcluster). A virtual cluster runs a fully functional Kubernetes control plane inside a namespace of a host cluster. To the developer, they have admin access to their own dedicated cluster. To the platform team, it is managed as just another pod within the standard infrastructure lifecycle.
Node Pools
Sharing a Kubernetes cluster does not necessarily mean sharing the exact same compute resources. The platform team should provide a General Purpose Node Pool as the default landing zone for the vast majority of workloads. The specific instance types (e.g., standard m5 or burstable t3 instances on AWS) within this default pool depend on the nature of the organization’s workloads.
To optimize this default pool, platform teams increasingly rely on intelligent provisioning services like Karpenter. Instead of managing rigid autoscaling groups, Karpenter automatically provisions the most appropriate, cost-effective compute instances (often leveraging Spot instances) dynamically based on the specific CPU and memory requests of the pending pods. This significantly reduces idle capacity costs while improving workload performance.
However, sometimes specific workloads require specialized hardware or deeper isolation. A data science team might need a node pool with GPU instances for machine learning, while another team might need high-memory instances for an in-memory database. Alternatively, a specific tenant might require strict physical compute isolation for compliance reasons.
The platform team handles this by provisioning Dedicated Node Pools. Product teams request these specific compute configurations through the same self-service channels used for quotas. The platform team evaluates the request, and if approved, provisions the pool and uses Kubernetes taints and tolerations to ensure only that specific team’s workloads are scheduled onto those dedicated nodes. This provides physical compute isolation while still benefiting from the shared control plane and centralized management of a single cluster.
Alternative Compute Models
While Kubernetes is our primary abstraction, a mature platform strategy must also accommodate alternative compute models. Some workloads, such as highly burstable asynchronous tasks, are better suited for Serverless offerings (like AWS Lambda or Google Cloud Run). Others, like legacy vendor software, may demand raw Virtual Machines. High Performance Computing (HPC) workloads might require dedicated bare-metal clusters.
In all these cases, the platform team must provide standard IaC modules that ensure these alternative compute environments are automatically placed into the correct Spoke network, tagged accurately, and governed by the same central IAM and security policies as the Kubernetes workloads.
Quotas
Self-service infrastructure does not mean infinite resources. To ensure platform stability and predictable costs, the platform must strictly manage the consumption of product teams. We implement this through a multi-layered quota system that acts as a hard boundary on resource allocation.
Cloud-Level Quotas
At the highest level, cloud providers allow you to set quotas on the maximum number of resources an account or subscription can provision (e.g., maximum vCPUs, maximum load balancers). These cloud-level quotas are strictly the platform team’s concern; product teams do not manage them and generally should not even know they exist.
The platform team uses the organizational primitives defined earlier (OUs, Folders, Management Groups) to centralize the management of these limits across the organization. By governing these quotas at the root level, you prevent a single runaway experiment in a sandbox account from exhausting the entire organization’s cloud provider service limits.
Kubernetes Quotas
Inside the Kubernetes abstraction, product teams should be intimately aware of the quotas assigned to their namespaces.
Resource limits are enforced natively using ResourceQuota objects. When the platform team vends a Namespace to a product team, it is pre-configured with a ResourceQuota acting as a hard boundary on the total CPU and memory the namespace can consume collectively.
To simplify quota management and prevent product teams from requesting arbitrary, hard-to-predict allocations, I recommend implementing Predefined Quota Templates:
- Extra Small:
2 CPU, 4Gi Memory - Small:
4 CPU, 8Gi Memory - Medium:
8 CPU, 16Gi Memory - Large:
16 CPU, 32Gi Memory - Custom: Requires specific architectural justification.
Product teams request a specific template for their namespace via a self-service internal portal or a pull request. The platform team then approves or denies it based on overall cluster capacity and cost management goals.
# A standard 'Medium' ResourceQuota applied to a vended namespace
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-quota-medium
namespace: tenant-checkout
spec:
hard:
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "16"
limits.memory: 32Gi
Kubernetes Limits
While a ResourceQuota restricts the total consumption of a namespace, it does not prevent a single misconfigured application within that namespace from hogging all the allotted resources. To prevent resource starvation at the micro-level, the platform team must also deploy a LimitRange object alongside every vended namespace.
A LimitRange enforces constraints on individual Pods. If a developer forgets to specify CPU or memory requests in their deployment manifest, the LimitRange automatically injects default values. More importantly, it enforces maximum limits per container. By guaranteeing that every pod has a defined footprint, the Kubernetes scheduler can efficiently pack workloads onto underlying compute nodes without fear of a single memory leak bringing down adjacent services.
Like quotas, the configuration of the LimitRange is strictly the platform team’s responsibility, and product teams must formally request architectural exceptions.
Capacity Planning
These default quotas are intentionally finite. When a development team hits their limit, they cannot simply deploy more pods. They must ask the platform team to increase their quota or provision a dedicated pool with a documented argument for why they need more resources.
This friction is a feature, not a bug. A quota increase request acts as a circuit breaker, triggering a conversation about architecture design, capacity planning, and cost. It provides an opportunity to ask: “Do you actually need 40Gi of memory, or is there a memory leak we need to investigate?”
Furthermore, requests for custom quotas or expensive dedicated node pools often require further approval from management to avoid overprovisioning and cost overruns. In these cases, the platform team acts as the single point of contact, escalating the request to engineering leadership or FinOps, and communicating the decision back to the product team. This centralizes capacity management and ensures that infrastructure costs remain predictable and aligned with business goals.
API-Driven Edge and Routing
The final piece of the infrastructure puzzle is how traffic enters the system. Manual, ticket-driven edge configuration is a massive bottleneck. But beyond just automating toil, edge routing represents another critical Shared Responsibility boundary.
North-South traffic entering your platform is intercepted at the perimeter by edge services:
- Global Server Load Balancing (GSLB) and DNS: Routes traffic to the closest or healthiest region.
- Web Application Firewalls (WAFs): Provides DDoS protection and filters malicious payloads.
- Edge API Gateways / Load Balancers: Terminates SSL/TLS, enforces global rate limiting, and routes traffic into the specific Kubernetes cluster.
These outer edge components are strictly managed by the platform team. Developers should never be writing WAF rules or configuring Transit Gateway attachments.
However, once the traffic hits the cluster’s ingress controller, responsibility shifts. Product teams define how traffic is routed to their specific pods using standard Kubernetes primitives (like Ingress or Gateway API objects).
Tools like ExternalDNS and cert-manager automate the bridge between these two boundaries. When a developer deploys an Ingress manifest, ExternalDNS automatically detects the change and calls out to the platform-managed Route53 or Cloudflare to create the necessary DNS records. Simultaneously, cert-manager negotiates, provisions, and attaches a valid SSL/TLS certificate.
By treating the edge, the network, and the compute layers as declarative, heavily abstracted, and automated services, the platform team removes themselves as the bottleneck. As Marta and Javi learned, a multi-region expansion shouldn’t be a three-week Jira epic—it should be a simple parameter change in a pull request.
Skills for This Chapter
design-landing-zone — An AI skill that maps the Platform Notation to a concrete cloud organizational hierarchy: which accounts, subscriptions, projects, OUs, and folders exist per coordinate, and which guardrails apply at each structural level.define-naming-convention — An AI skill that defines the resource naming patterns and mandatory tag schema for your platform, ensuring every resource’s coordinate is readable from its name.design-networking — An AI skill that designs the hub-and-spoke network topology for your platform: IPAM CIDR allocation, spoke VPC/VNet definitions per coordinate, DNS zone structure, and egress strategy.design-compute — An AI skill that designs the Kubernetes cluster topology, node pool strategy, and multi-tenancy model for your platform, including ResourceQuota tier templates that tenants choose from.manage-azure-landing-zone — An AI skill that provisions and syncs the Azure landing zone: Management Groups, Subscriptions, Resource Groups, Azure Policy assignments, and budget alerts.manage-aws-landing-zone — An AI skill that provisions and syncs the AWS landing zone: Organizational Units, Accounts, Service Control Policies, centralized audit configuration, and budget alerts.manage-gcp-landing-zone — An AI skill that provisions and syncs the GCP landing zone: Folders, Projects, Billing Account links, Organization Policies, and budget alerts.manage-azure-networking — An AI skill that provisions and syncs Azure networking: Virtual WAN Hubs, Spoke VNets, subnets, NSGs, and Private DNS zones.manage-aws-networking — An AI skill that provisions and syncs AWS networking: Transit Gateway, TGW route tables with tier-isolation enforcement, Spoke VPCs, subnets, and Route 53 private hosted zones.manage-gcp-networking — An AI skill that provisions and syncs GCP networking: Network Connectivity Center or Shared VPC topology, subnets, firewall rules, Cloud NAT, and Cloud DNS private zones.manage-k8s-namespaces — An AI skill that provisions and syncs tenant namespaces in Kubernetes: ResourceQuota from quota templates, LimitRange defaults, and default-deny NetworkPolicies as the platform’s compute guardrail floor.Subscribe to the Newsletter
Enjoying the book? Join 1,000+ platform engineers getting articles, insights, and stories from the trenches delivered directly to your inbox.
Subscribe for free