Crafting Platforms' Book
Chapter 05

Identity and Access Management

The most secure system is one that is turned off, covered in concrete, and kept in a lead-lined room with armed guards — and even then I have my doubts.

— Gene Spafford
Author Note
This chapter is still being drafted and may contain inaccuracies or incomplete information. Please check back later and report any issues you find to ezequiel+book@foncubierta.com.
Story

Javi was halfway through his second cup of coffee at the Mountain Lab office on a rainy Wednesday morning when the Slack notification arrived. It wasn’t an alert from the monitoring system, but a direct message from Diego, the CTO.

“Javi, why is the payments-live-db cluster being deleted?”

Javi nearly choked. He scrambled to his terminal, fingers flying across the keys. A quick check of the cloud console confirmed the nightmare: the production database for the payments team was in the middle of a teardown.

“I didn’t do it,” Javi typed back, his heart racing. “Checking logs now.”

Three minutes later, he found the culprit. It wasn’t a malicious actor or a rogue script. It was a terraform destroy command executed by a CI/CD runner in the payments-sandbox account.

“Wait,” Javi muttered to himself. “How does a sandbox runner have permission to delete a live database?”

Marta joined him at his desk, her face grim. “I see it too. It looks like when we migrated the payments team last month, someone added the sandbox runner’s service principal to the live-contributor group ‘just for a quick test’ to verify connectivity. We forgot to remove it.”

The database was restored from a snapshot within twenty minutes, but the damage to the team’s confidence was done. A simple human error — a stale permission — had almost wiped out the company’s revenue stream.

At the post-mortem the next day, the mood was somber.

“We’re playing with fire,” Marta said, looking at the assembled engineering leads. “We have developers with permanent ‘Contributor’ access to production ‘just in case’ they need to fix a bug at 2:00 AM. We have CI/CD keys sitting in GitHub secrets for years without rotation. Our IAM is a collection of ‘quick fixes’ and ’temporary’ groups that became permanent.”

“But we can’t just lock everyone out,” one of the developers argued. “If I have to open a ticket with IT every time I need to check a log or restart a service, we’ll never ship anything.”

Marta nodded. “I agree. We’re not going back to TicketOps. But we are done with standing privileges. From now on, your default state in the cloud is a ghost. You can see, but you cannot touch. And when you need to touch, you’ll have to ask — not a person, but the platform.”

In the previous chapters, we defined our segmentation strategy and built the “coordinate system” that organizes our infrastructure. But a segment is only as secure as the people and processes allowed inside it.

Identity and Access Management (IAM) is often seen as the domain of the Security team or Corporate IT. But in a modern platform, IAM is a core infrastructure capability. It is the invisible mesh that connects our users to their resources. If our IAM is too restrictive, we kill developer velocity. If it’s too permissive, we risk the entire business on a single stale credential.

To get this right, we need to move away from the “all-or-nothing” access models of the past and toward a platform-driven, self-service authorization layer.

Identity vs. Access

The first step in building a scalable IAM strategy is recognizing a fundamental architectural split: the difference between Identity and Access.

  • Identity (Who you are): This is the responsibility of Corporate IT. They manage the Identity Provider (IdP) — whether it’s Microsoft Entra ID (formerly Azure AD), Okta, or Google Workspace. They handle the user lifecycle: onboarding, offboarding, Multi-Factor Authentication (MFA), and broad organizational groupings (e.g., “The Engineering Team”).
  • Access (What you can do): This is the responsibility of the platform team. We take the identities provided by IT and map them to specific infrastructure permissions across our sectors, tiers, and tenants.

This division is critical for operational sanity. The platform team shouldn’t be in the business of resetting passwords or managing who joined the company last Monday. Similarly, Corporate IT shouldn’t need to understand the nuances of Kubernetes RBAC or AWS IAM Policy syntax.

Instead, we create an authorization interface. IT provides the “Who,” and the platform team provides the “What” via declarative, version-controlled definitions. To simplify the cognitive load on developers, we must abstract away the thousands of granular cloud permissions into a handful of standardized roles and groups.

Roles

A Role defines a specific set of capabilities. Using the Platform Notation, we formally define this as a functional type: Role(Name)

In our platform, we standardize on four primary roles:

  • Role("operator"): Reserved for CI/CD runners and infrastructure provisioners. It has the power to create and destroy resources, but — crucially — it has no human members. We don’t log in as an operator. Instead, we use federated identity (such as Azure Workload Identity or AWS OIDC) to grant these permissions to our automation pipelines.
  • Role("admin"): The highest-privilege human-facing role. It allows for managing IAM policies, creating public endpoints, and deleting critical data stores. Because of its power, this role is never a “standing” privilege. It is reserved for emergency situations or initial tenant setup, and it always requires Just-In-Time (JIT) escalation with a mandatory justification.
  • Role("contributor"): The developer’s “fast path” for rapid iteration in Tier("sandbox") environments or urgent troubleshooting in Tier("live"). It allows for “write” operations—deploying code manually, updating configurations, or scaling resources—without needing to manage IAM policies or shared infrastructure. However, it is not the primary deployment path.
  • Role("reader"): The standing access for everyone in the engineering organization. It allows you to see resource configurations, view metrics, and read logs. This is enough for 90% of a developer’s daily troubleshooting. Because it is read-only, it carries a much lower risk profile.

While these standard roles are a pragmatic starting point, they are often too broad for high-security environments. While Role("contributor") is effective for human troubleshooting, the applications should never use it. Instead, provision a dedicated identity for each workload with the absolute minimum permissions required (e.g., s3:GetObject on a single bucket).

Groups

A Group is the container where we place members (human or non-human) and assign them a specific role within a specific context. The functional type is Group(Tenant, Tier, Role).

This notation allows us to express the “address” of a permission clearly. Following the grammar, we use positional inference for the parameters:

  • Group("payments", "live", "operator"): The group for the CI/CD runner that deploys the Tenant("payments") services to the Tier("live").
  • Group("payments", "live", "admin"): The group for Tenant("payments") members needing emergency administrative access in Tier("live").
  • Group("payments", "live", "contributor"): The group for Tenant("payments") developers needing to perform active troubleshooting or manual operations in Tier("live").
  • Group("payments", "live", "reader"): The group for everyone in Tenant("payments") who needs to view the state of their Tier("live") environment.

By separating roles from groups, we ensure that the “what” (the role’s permissions) remains consistent across the entire organization, while the “who” and “where” (the group’s membership and scope) are managed independently per tenant.

Tip
Delegation over Control Don’t try to own the user directory. Align your platform to the existing corporate IdP. Use Single Sign-On (SSO) for everything — cloud consoles, Kubernetes clusters, and internal portals. If a user is deactivated in the corporate IdP, their access to the entire platform should vanish instantly.

Just-In-Time (JIT) Access

The Mountain Lab incident happened because a runner had “standing” privileges — permissions that were active even when they weren’t being used. The solution to this is Just-In-Time (JIT) Access.

JIT access means that permissions are granted only when needed, for a specific window of time, and are automatically revoked when that window closes. It is the architectural equivalent of a “break-glass” procedure, but for daily operations.

The JIT Workflow

  1. Default State: A developer is a member of Group(Tenant, Tier, "reader").
  2. The Request: The developer needs to fix a bug. They request to join Group(Tenant, Tier, "contributor") via a CLI or portal.
  3. The Justification: The developer must provide a reason for the escalation (e.g., “incident-402”, “debugging memory leak”). This creates a permanent audit trail, linking every elevated action to a specific business need.
  4. The Activation/Approval: The platform verifies membership. For Role("contributor") in Tier("sandbox"), this may be automatic. For Role("admin") or the Tier("live"), it may require a peer’s approval, adding a second pair of eyes to high-risk operations.
  5. The Window: The developer is added to the group for a 4-hour window. Every action they take is logged and tied to this specific activation.
  6. The Revocation: After the window expires, the platform automatically removes the developer from the group.

This “Reader by Default” philosophy significantly reduces the attack surface. If a developer’s laptop is stolen or their credentials are leaked, the attacker only gains read access. To do any damage, they would have to successfully navigate the JIT escalation process, which often involves MFA and audit logging.

Note
Native JIT Tools You don’t need to build a JIT system from scratch. Azure has Privileged Identity Management (PIM) built directly into Entra ID. AWS offers IAM Identity Center (formerly AWS SSO), which supports session-based permission sets. For cloud-agnostic environments, tools like Teleport or Okta Advanced Server Access provide similar time-bound escalation for SSH, Kubernetes, and databases.

Non-Human Identity

Managing human access is only half the battle. We also have to manage the “non-human” identities: the CI/CD runners, the cron jobs, and the microservices that need to talk to our cloud providers.

Historically, we’ve managed this by creating “service accounts” and generating long-lived API keys or JSON secret files. We then take these secrets and shove them into GitHub Secrets, GitLab Variables, or Jenkins.

This is a security disaster waiting to happen. Secrets leak. They don’t get rotated. They are often over-privileged.

The Shift to Federated Identity

Modern platforms are moving away from static secrets in favor of Federated Identity (such as Azure/GCP Workload Identity or AWS OIDC-based authentication).

Instead of a secret key, the cloud provider (AWS/Azure/GCP) is configured to trust an external Identity Provider (like GitHub Actions, GitLab, or a Kubernetes cluster). When a CI/CD runner or an application needs to authenticate, it requests a short-lived token from its own IdP. It then exchanges this token with the cloud provider for a temporary set of credentials scoped to its specific task.

sequenceDiagram
    participant App as Application / Runner
    participant IdP as External IdP (GitHub/K8s)
    participant Cloud as Cloud Provider (AWS/Azure/GCP)
    participant Resource as Cloud Resource

    App->>IdP: Request OIDC/Workload Token
    IdP-->>App: Signed JWT Token
    App->>Cloud: Exchange JWT for Ephemeral Credentials
    Cloud->>Cloud: Verify IdP Signature
    Cloud-->>App: Short-lived Access Token / Role Session
    App->>Resource: Access Resource (SDK/API)

By using Federated Identity, we eliminate the need to store, rotate, or protect long-lived cloud secrets. This approach isn’t just for the platform team; it’s our primary recommendation for developers building applications. Whether they are using the AWS SDK, the Azure libraries, or GCP APIs, they should always rely on federated identities (such as Workload Identity) rather than managing their own tokens. The platform provides the identity; the SDK handles the exchange. If a workload is compromised, the attacker only has a token that expires in minutes, not a key that works forever.

Identity in Kubernetes

While the coordinate system governs our cloud-level permissions, many developers spend their daily lives inside a Kubernetes cluster. The principles of standardized roles and JIT access apply here just as strongly, but the primitives change.

Mapping Groups to RBAC

In Kubernetes, the boundary for the Core Space is the Cluster while the primary boundary for a tenant is the Namespace (more about this in the next chapter). We map our standard platform groups to Kubernetes’ built-in ClusterRole resources:

Platform GroupSpaceK8s ClusterRoleK8s Scope
Group("platform", _, "reader")CoreviewCluster
Group("platform", _, "contributor")CoreeditCluster
Group("platform", _, "admin")CoreadminCluster
Group("platform", _, "operator")CoreadminCluster
Group("payments", _, "reader")TenantviewNamespace
Group("payments", _, "contributor")TenanteditNamespace
Group("payments", _, "admin")TenantadminNamespace
Group("payments", _, "operator")TenantadminNamespace

The platform automation handles the creation of RoleBinding objects within each tenant’s namespace, linking the cloud-managed groups to these roles. For more granular permissions (e.g., allowing a specific service to read Secret objects but not modify them), the platform team can define custom Role and RoleBinding resources, but the view/edit/admin triad covers the majority of human use cases.

Throughout this book, we assume that your Kubernetes clusters are integrated with your cloud provider’s IAM system (e.g., EKS with AWS IAM, AKS with Entra ID, or GKE with Google IAM). This integration allows us to use the same identities and groups for both cloud resources and Kubernetes resources, providing a unified audit trail and a single source of truth for membership.

If you are running Kubernetes on-premises or in a “bare-metal” environment without this native integration, managing human identities becomes significantly more complex. You may need to manage User and Group objects directly or set up a standalone OIDC provider. While the implementation details of those setups are beyond the scope of this book, the underlying principles — least privilege, JIT access, and standardized roles — remain the foundation of a secure platform regardless of where the cluster lives.

Implementation

To manage this at scale, we need a single source of truth for membership. We don’t want to manually click around in the Entra ID or AWS console to add people to groups. Instead, we define our tenants and their members as code.

I prefer a People-First IAM approach. We define the members once at the tenant level, and our platform automation handles the complex mapping to Sector, Tier, and roles.

# tenants/payments.yaml
name: payments
sectors:
  - ecommerce
  - platform

# The human members of this tenant
members:
  admins:
    - marta@mountainlab.io
  contributors:
    - javi@mountainlab.io
    - ana@mountainlab.io
  readers:
    - pedro@mountainlab.io

When this file is committed to our central tenants repository, a Platform Operator (a pipeline) runs and performs the following actions:

  1. Group Sync: It verifies that cloud-native groups (e.g., Group("payments", "sandbox", "contributor"), Group("payments", "live", "reader")) exist in the IdP.
  2. Membership Sync: It adds the listed individuals to these groups.
  3. Role Assignment: It creates the role assignments in the cloud accounts that map these groups to the correct Role (e.g., mapping Group("payments", "live", "reader") to the Role("reader") in the ("ecommerce", "live") subscription).
  4. JIT Configuration: It configures the escalation rules (e.g., setting up Azure PIM so that only members of the contributors list can activate the Role("contributor")).

By managing IAM through these simple YAML files, we provide developers with a self-service interface. If the Tenant("payments") team hires a new engineer, they don’t open a ticket. They open a Pull Request to their tenant.yaml file. The platform team reviews the PR (ensuring the new person actually belongs to that team), and once merged, the infrastructure is updated automatically.

Warning
Always Individual-Based Membership Avoid adding groups inside your tenant.yaml (e.g., adding group-all-devs to the contributors list). While it’s tempting for convenience, it obscures the audit trail and makes it impossible to implement least-privilege effectively. Always list individuals. Let the platform automation do the work of grouping them.

Summary

  • Identity vs. Access: Let IT manage the “Who” (the user directory); the platform team manages the “What” (the infrastructure permissions).
  • Standardize on Roles: Use a simple taxonomy of Role("reader"), Role("contributor"), Role("admin"), and Role("operator") to reduce cognitive load.
  • Roles vs. Groups: Maintain a strict separation between the capabilities of a Role and the membership scope of a Group(Tenant, Tier, Role).
  • Reader by Default: Eliminate standing write privileges. Use Just-In-Time (JIT) escalation to join a contributor or admin group.
  • Federate Your Workloads: Stop using static API keys for CI/CD. Use federated identity (such as Azure Workload Identity or AWS OIDC) for short-lived, cryptographically-proven access.
  • IAM as Code: Use a central tenant.yaml as the source of truth for membership. Allow teams to manage their own access through pull requests and automated sync.

By moving IAM from a manual, ticket-based process to a platform-driven capability, we don’t just make the system more secure — we make it faster. We give developers the power they need to do their jobs, but we wrap that power in a safety net of ephemeral access and automated governance.

Skills for This Chapter

AI Skill
define-core-iam — An AI skill that defines the IAM groups and roles the platform team needs to operate its own sectors: Role("operator"), Role("admin"), Role("contributor"), and Role("reader") groups per tier, with JIT escalation policies and Federated Identity requirements.
AI Skill
define-tenant-iam — An AI skill that defines the IAM groups and membership model for a tenant team. It produces a tenant.yaml file — the source of truth for group provisioning and JIT escalation configuration that feeds all cloud-specific provisioning skills.
AI Skill
manage-azure-iam — An AI skill that provisions and syncs Azure IAM resources from your definition files: Entra ID groups, role assignments at Management Group and Subscription scope, PIM eligible assignments, and Workload Identity Federation credentials.
AI Skill
manage-aws-iam — An AI skill that provisions and syncs AWS IAM resources from your definition files: IAM Identity Center permission sets, account assignments, OIDC identity providers, and IAM roles with permission boundaries.
AI Skill
manage-gcp-iam — An AI skill that provisions and syncs GCP IAM resources from your definition files: Cloud Identity groups, IAM bindings at Organization/Folder/Project scope, Workload Identity pools and providers, and Organization Policy constraints.
AI Skill
manage-k8s-iam — An AI skill that provisions and syncs Kubernetes RBAC resources from your definition files: namespace-scoped RoleBindings mapping platform groups to built-in ClusterRoles, and annotated Operator ServiceAccounts for cloud IAM integration.

Subscribe to the Newsletter

Enjoying the book? Join 1,000+ platform engineers getting articles, insights, and stories from the trenches delivered directly to your inbox.

Subscribe for free