Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond Productivity: Scaling Cloud Dev Environm...

Beyond Productivity: Scaling Cloud Dev Environments for Faster Feedback & Sustainable Engineering

Presented at Kubecon India 2025 - https://kccncind2025.sched.com/event/23Evj

Local dev setups worked fine when teams were small. But when you’re dealing with hundreds or thousands of developers, things start to break - slow onboarding, dependency hell, inconsistent environments, and wasted compute cycles.

Cloud Developer Environments (CDEs) promise instant, reproducible workspaces, but shifting from local machines to cloud-first workflows is easier said than done. Latency, security, adoption hurdles, and cost can turn a promising initiative into an operational headache.

This talk will go deep into:

* Why local dev is unsustainable—from wasted CPU cycles to lost engineering hours.
* How cloud environments reduce friction—ephemeral, pre-configured workspaces that just work.
* Optimizing for speed—pre-warmed environments, AI-assisted debugging, and workload-aware compute allocation.
* Measuring impact—tracking developer velocity, infra costs, and sustainability improvements.
* Lessons from real-world rollouts—what works, what breaks, and how to get buy-in.

Avatar for Siddhant Khare

Siddhant Khare

August 06, 2025
Tweet

Other Decks in Programming

Transcript

  1. 1

  2. About me Siddhant Khare • Software engineer at Gitpod •

    Maintainer at OpenFGA • Co-creator & maintainer of github1s github.com/Siddhant-K-code x.com/Siddhant_K_code 3
  3. Agenda 1. 2. 3. 4. 5. 6. The standardization problem:

    why team consistency matters CDEs: What, Why, How Real-world implementation case studies Infrastructure design decisions: network, storage, security Platform engineering at scale: operations and cost models Implementation roadmap for platform teams 4
  4. 1. The standardization problem The real problem: Environment consistency at

    scale • • • • Core challenge: How do you ensure 1000+ developers work in identical environments? Technical reality: Different OS versions (macOS 13.2 vs 14.1 vs Ubuntu 22.04) Node.js versions (16.14.0 vs 18.15.0 vs 20.5.0) Docker Desktop versions with different behaviors Python environments with conflicting dependencies Result: "Works on my machine" becomes "doesn't work in CI/CD" Source: Survey of 223 developers across 15 enterprises (Coder research 2024) 6
  5. Environment drift - The hidden tax • • • Measured

    impact (DORA research 2024): Average 2.5 days to resolve "works locally" issues 40% increase in support tickets for setup problems 15% of developer time spent on environment management 1. The standardization problem The Question: How do platform teams solve standardization without sacrificing developer autonomy? 7
  6. 1. The standardization problem Traditional solutions and their limits •

    • • • • • Approach 1: Strict standardization Problem: Developers need flexibility for different projects Reality: Shadow IT and workarounds Approach 2: Containerization (Docker, etc.) Problem: Still requires local setup and maintenance Reality: "Docker works differently on my Mac" Approach 3: Documentation and scripts Problem: Documentation goes stale, scripts break Reality: Onboarding still takes days Missing piece: Centralized, ephemeral, consistent compute 8
  7. 1. The standardization problem The local development chaos spiral •

    • • • • • • • • Stage 1: Individual workarounds "Install Node 20, not 22" "Use Docker Desktop 4.12, not 4.15" Slack channels full of setup instructions Stage 2: Team fragmentation Different OS environments Inconsistent toolchain versions "Works for me" becomes team motto Stage 3: Organizational crisis New hires blocked for days Production bugs unreproducible locally Security incidents from unpatched machines 9
  8. 2. CDEs: What, Why, How • • • • Cloud

    Development Environments (CDE) = Compute + Storage + Network + Developer Interface Key components: Compute: Kubernetes pods, VMs, or containers Storage: Ephemeral + persistent volumes Network: Low-latency connection (WebRTC, SSH, HTTPS) Interface: VS Code Server, JetBrains Gateway/Toolbox, Web IDEs, Terminals Critical Insight: Move environment complexity from developer machine to platform team VM-based or K8s-based compute 11
  9. 2. CDEs: What, Why, How Kubernetes-based VM-based Container-based (Non-K8s) •

    • • • • • Uber's Devpod (Internal) GitLab workspaces Gitpod Classic Coder (k8s provider) Eclipse Che & maybe some more.. • • • • • • • GitHub Codespaces Gitpod Google Cloud Workstations Coder (VM Provider) Google Firebase studio (prev. Google IDX) Google Jules Shopify's Spin (Internal) • • • • Stackblitz WebContainers Codesandbox Replit (Nix-based dev env.) GitHub.dev (editor only, no backend) 12
  10. 2. CDEs: What, Why, How Traditional local development CDE Development

    AI-Enhanced CDE (Next Generation) • Environment setup: 2-3 days (if it works) • Code → Build → Test → Debug (manual process) • Git commit → CI/CD → Manual code review • Documentation and testing: Manual, often skipped Success rate: ~60% first try Developer focus: 40% coding, 60% environment mgmt • Environment Ready: 30 seconds (99.9% success) (Source: Shopify) (Source: Uber) • Code → Build → Test → Debug (automated pipeline) • Git commit → Pre-tested, consistent environment • Standardized tooling and processes Developer focus: 85% coding, 15% environment • Environment + AI Context: 30 seconds • • • • Human + AI Pair Programming Real-time code completion w/ Multi-agents AI review: Automated code analysis Test generation: AI creates edge case tests • Parallel AI Exploration: Multiple solution paths Performance: 55% faster task completion (Source: GitHub Copilot productivity study, September 2024) Developer focus: 95% problem-solving, 5% tooling 13
  11. 2. CDEs: What, Why, How Infrastructure as code meets AI

    development Standardization using Dev Containers: https://containers.dev/ It is an open-source spec. to standardize Dev. environments initiated by Microsoft Claude code setup with a one liner 14
  12. 3. Real-world implementation case studies • • • Sources: Uber's

    Devpod Shopify's Spin Gitpod (Kubernetes + Bazel) (GCE → Isospin VMs) (K8s → Custom VM-based infra) 15
  13. 3. Real-world rollouts • • • • • • Devpod:

    Kubernetes-Native monorepo solution Challenge: Fragmented codebase across thousands of repositories 10+ programming languages, 4,000+ services, 500+ web apps 9+ build tools, 6+ configuration tools, trunk- based development Solution: Devpod on Kubernetes with Bazel build system (Source) Scale Achievement: 60% adoption among engineering workforce (Source) Performance: 2.5x faster complex builds, 1.8x faster local binary builds, git status improvements • • • • Evolution to VM-like Simplicity Journey: Local dev tool → GCE VMs → Kubernetes pods → Isospin (VM- like abstraction) Key Insight: "A development environment is an application" - moved from scripts to application framework thinking Philosophy: Developer productivity over technical elegance, reducing cognitive load. Implementation: systemd- orchestrated processes in single Linux environment. Source: Shopify Engineering • • • • • • Kubernetes experiment and Now to VMs Scale: 1.5+ million users served on Kubernetes, thousands of environments daily Kubernetes challenges discovered: Noisy neighbor effects, unpredictable resource allocation. Complex state management requirements Kubernetes wasn't designed for. Managing Kubernetes at scale is complex, significant operational overhead New Gitpod's Architecture: VM- level isolation with up to 896 cores, 12TB RAM per environment, 70% cost reduction. Source: Gitpod Engineering Key Architectural Insight: Success depends on organizational context and developer experience priorities, not pure technical metrics. 16
  14. 3. Real-world rollouts Devpod: Kubernetes CRDs + Bazel at Scale

    • • • • • • Technical Foundation: Monorepo: 70,000+ files in a Go monorepo, likely largest using Bazel (Source: Mobile Developer Productivity at Uber Scale [Droidcon NYC 2022] - Speaker Deck) Kubernetes: Custom Resource Definitions, no internal database needed (Source: DevPod: Uber's Monorepo-Based Remote Development Platform - The New Stack) Management: kubectl-based operations, custom controllers, (Source: DevPod: Uber's Monorepo-Based Remote Development Platform) Performance Results: 1.8-2.5x improvement in git status and Go build times (Source: Building Uber’s Go Monorepo with Bazel) Zero setup time vs hours for local monorepo cloning (Source: Uber Improves Productivity with Remote Development Environment Devpod) 60% engineering workforce adoption (Nov 2022) Key Innovation: "Flavors" - Docker image presets with tools and configurations for specific groups Architecture overview 17
  15. 3. Real-world rollouts Spin: VM-Like Simplicity Journey 1. 2. 3.

    4. • • The problem: "Nothing fit in the box anymore, so it was time to find a new one. Laptops couldn't handle the computational demands." (Source: The Journey to Cloud Development: How Shopify Went All-in on Spin - Shopify) Evolution Journey: Local dev crisis → Resource exhaustion, slow builds GCE VMs experiment → Surprise success with self-contained teams Kubernetes pods → Developer feedback: "dual context was confusing" Isospin (systemd) → "Single place to work, developers don't need to understand infrastructure" CEO Insight: "Create abstractions that let developers defer understanding until curious" Key realizations: "It's better to be first than last" - control the OS "Development environment is an application" Isospin Architecture 18
  16. 3. Real-world rollouts Kubernetes lesson: why they left after 6

    years • • • • • • • • • • Scale Context: 1.5M+ users, thousands of environments daily, 6 years of Kubernetes experience (Source: We’re leaving Kubernetes - Blog) Why development environments are different: Extremely stateful and interactive - can't move between nodes Unpredictable resource patterns - need cores within 100ms Far-reaching permissions - root access, network capabilities Kubernetes problems at scale: CPU: VS Code disconnects when server starved for CPU time Storage: PVCs had unpredictable timing, reliability issues Networking: Service scaling became unreliable due to sheer numbers Security: Complex user namespaces, performance impacts The Decision: "Achieving all of this with Kubernetes is possible, but comes at significant cost" New architecture results: VM-level isolation: Up to 896 cores, 12TB RAM per environment 70% infrastructure cost reduction Eliminated Kubernetes operational complexity Source: Gitpod docs 19
  17. 4. Scaling the infrastructure: Architecture tradeoffs VMs vs Kubernetes -

    Infrastructure architecture: choose your trade-offs • • • • When to choose each: Kubernetes: Cloud-native teams, sophisticated orchestration needs Pros: Rich ecosystem, advanced scheduling, service mesh integration Cons: Operational complexity, noisy neighbor effects for dev workloads (Source: We’re leaving Kubernetes - Blog) VMs: Security-first organizations, compliance requirements Pros: Hardware isolation, familiar mental model, root access (or Docker Daemon access) Cons: Higher resource overhead & slower provisioning, if didn't manage it well. Criteria Kubernetes VMs Startup time ~5-15s ~30-50s Security Medium Strong Resource efficiency Medium High (if handles well) Developer UX Complex Familiar Operational overhead (large scale) High Medium Cost per environment Medium Medium 21
  18. 4. Scaling the infrastructure: Architecture tradeoffs Beyond VDI: Zero-trust development

    environments • • • • • • • • • Why VDI fails for development: Network latency: 100-200ms kills developer flow state Resource constraints: Shared GPU, limited device access Cost: $200-400/month per seat + infrastructure overhead Experience: Developers report "coding through molasses" Security benefits over local dev: Patch management: 100% coverage vs 40% on local machines Network control: All traffic through monitored gateways Data protection: Sensitive data never on developer laptops Incident response: Centralized logging enables forensics Compliance: Built-in controls vs hoping developers follow policies Real impact: 90% reduction in security incidents from development environments Zero-trust CDE architecture 22
  19. 4. Scaling the infrastructure: Architecture tradeoffs Multi-tenancy & cost control

    at scale • • • • • • • • • • • Cost control mechanisms: Auto-hibernation: Sleep after 30min idle Spot instances: 60-91% cost savings for non-critical workloads Time-based scaling: Business hours optimization Resource pooling: Shared base infrastructure Real cost optimization results: Uber: Automatic overnight updates, hibernation during off-hours (Source: Uber Improves Productivity with Remote Development Environment Devpod - InfoQ) Industry average: 68% of environments hibernate during off-hours Annual savings: $2.1M from hibernation alone (1000 developer org) Governance Automation: Policy as Code (Open Policy Agent) Automated compliance scanning Cost threshold alerting Chargeback reporting for team accountability Resource attribution pattern 23
  20. 4. Scaling the infrastructure: Architecture tradeoffs When to choose what:

    decision framework • • • • • • • • • • • Specific recommendations: Startups (< 100 devs): SaaS solutions Gitpod, GitHub Codespaces, CodeSandbox, Replit Approx. $10-30/developer/month (Pay-as-you-grow) Zero operational overhead Scale-ups (100-500 devs): Self-hosted + Self-managed solutions Gitpod Core, GitLab Workspaces, GitHub Codespaces 3-5 platform engineers required Balanced control vs convenience Enterprise (500+ devs): Build vs buy decision If build, Consider Uber's Devpod approach Or, if buy, Gitpod Enterprise (Self Hosted + Vendor managed) / Coder Enterprise Factor in compliance and security requirements Storage design decision: ephemeral or persistent volumes? CPU/Memory isolation: cgroups vs hardware-backed VMs 24
  21. 5. Platform engineering at scale: operations & cost models Beyond

    Copilot in enterprise AI development: Technical patterns • • • • • • Architecture 1: Client-side inference - Pros: Low latency, private code - Cons: Limited by client hardware - Example: GitHub Copilot, Claude Code, Gemini CLI, Ampcode in VS Code Architecture 2: Server-side inference Pros: Powerful models, centralized management Cons: Network latency, data privacy concerns, limited Editor extensions Example: Cursor IDE, Windsurf Architecture 3: Hybrid approach Local caching of common completions Server inference for complex queries Example: JetBrains AI Assistant Key technical challenge: Context window management for large codebases 26
  22. 5. Platform engineering at scale: operations & cost models Beyond

    Copilot in enterprise AI development: From single tools to Agent Orchestration GitHub's Copilot workspace 27
  23. Beyond Copilot in enterprise AI development: From single tools to

    Agent Orchestration Gitpod's SWE Agent, Ona 5. Platform engineering at scale: operations & cost models 28
  24. 5. Platform engineering at scale: operations & cost models Measured

    productivity impact • • • • • • • • • • GitHub Research (10,000+ developers): 55% faster task completion with Copilot 85% report increased confidence Measured across multiple programming languages DORA State of DevOps 2024: Platform engineering teams report 3.4% code quality improvement 7.5% documentation quality increase Temporary 1.5% deployment frequency decrease during implementation Cost Model Example (1000+ developers): (Source: Coder, Gitpod) Time savings: 75% reduction in onboarding time (10 days → 2.5 days) Productivity recovery: 5 hours/week per developer currently lost to environment issues Annual value creation: $2M in recovered engineering capacity Break-even: Under 12 months based on productivity gains alone 29
  25. Sustainability Impact: Beyond developer productivity • • • • •

    • • • • • • • Environmental Baseline: ~300-800kg CO2 per developer workstation (source: Oxford University study) 1000 developers = Upto ~800 tons CO2 annually from hardware alone Plus: 150W continuous power per idle workstation CDE Environmental Benefits: Hardware lifecycle extension: Traditional: 3-year laptop refresh cycle CDE model: 5-year cycle (hardware becomes thin client) CO2 reduction: 778 tons/year → 156 tons/year (80% improvement) Power consumption: Traditional: 1,314 MWh/year for 1000 idle workstations CDE model: 525 MWh/year (60% reduction) Total Environmental Impact: CO2 savings: 859 tons annually Cost savings: $800K annually E-waste reduction: 60% less hardware disposal ESG business value: Corporate sustainability commitments Regulatory compliance (EU sustainability reporting) 5. Platform engineering at scale: operations & cost models ##
  26. 6. Implementation roadmap for platform teams • • As a

    platform engineering team, Engineers are your customers, listen to their primary pain points. Align on the solution which tool to build v/s buy. 30-60-90 day plan for platform rollout • • • • • • • • Days 1-30: Foundation Week 1: Current state assessment Developer environment audit (setup time, failure rate, security) Network performance baseline (<50ms latency requirement) Cost analysis (hardware, power, IT support, incidents) Week 2-3: Architecture planning Security integration (SSO, certificates, policies) Technology selection (SaaS vs self-hosted vs build) Success criteria definition Week 4: Pilot preparation Security team onboarding (5-10 participants) Measurement framework implementation • • • • Days 31-60: Pilot execution Security-first pilot with champions Comprehensive feedback collection Technical integration refinement Document lessons learned • • • • Days 61-90: Expansion Planning Early adopter program (50-100 developers) Change management strategy Cost governance policies Enterprise rollout roadmap 31
  27. 6. Implementation roadmap for platform teams Success metrics for platform

    teams • • • • Technical metrics: Environment startup time < 2 minutes Network latency < 100ms (95th percentile) Uptime SLA > 99.5% Resource utilization 60-80% • • • • Developer Experience metrics: Time to first commit < 30 minutes Developer satisfaction > 4.0/5 NPS Support ticket reduction 40%+ "Works on my machine" incidents < 5% • • • • Business metrics: Onboarding time reduction 50%+ Infrastructure cost per developer Security incident reduction Compliance audit results Key Insight: CDE success is 20% technology, 80% change management 32
  28. 7. Q&A + Key takeaways • • • • •

    • • • • • • • • Key technical takeaways for platform teams: 1. Architecture follows organization Uber (K8s expertise) → Kubernetes-native success Shopify (simplicity focus) → VM-like abstraction Gitpod (scale lessons) → Post-Kubernetes architecture 2. Developer experience is non-negotiable If it's slower than local, adoption fails. Familiar mental models beat technical sophistication. "Create abstractions that let developers defer understanding until curious" (Source: The Journey to Cloud Development: How Shopify Went All-in on Spin - Shopify) 3. Economics favor cloud development Break-even: <12 months for 1000+ developers Hidden costs: Security incidents, hardware refresh, productivity loss Environmental impact: ~859 tons CO2 saved annually Identify the following: What's your biggest development environment pain point? How do you measure developer productivity today? What security/compliance requirements drive your architecture decisions? Understand your infra bottlenecks before choosing Kubernetes or VMs Next steps: Start with pilot, measure everything, scale based on results 34
  29. Thank You Connect with me on: Siddhant Khare github.com/Siddhant-K-code x.com/Siddhant_K_code

    linkedin.com/in/siddhantkhare24/ Resources Please complete the session survey in the sched mobile app 35