Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Containers and Hadoop

Containers and Hadoop

Dinesh Subhraveti

July 24, 2014
Tweet

More Decks by Dinesh Subhraveti

Other Decks in Technology

Transcript

  1. Brief History of Containers 2001 2002 2003 2005 Enterprise Linux

    Container solution — Meiosys 2004 First research paper on Linux Containers — OSDI’02 IBM acquires Meiosys — Focus shifted to AIX First container-based distributed checkpointing — HP Labs First implementation of containers based on syscall interposition — Columbia Most core kernel changes finally made into Linux mainline
  2. Containers ❏ Three dimensions of isolation • Software dependencies and

    configuration! • Resource consumption (CPU, RAM, disk, network)! • Information integrity and confidentiality (aka, security)! ❏ Why containers as process abstractions of DC-OS? • Provide mechanism for software isolation (via file system namespace)! • Provide defense-in-depth for security! • Better suited for distributed apps! • Containers contain (possibly multiple) restartable processes! • Support for checkpoint/restore, live migration, live OS upgrades, record/replay
  3. Why not Virtual Machines? Application — Hardware misalignment Virtual Machine

    Container Application Applications have round edges — system call interface Hypervisors have square holes — hardware interface Application
  4. Why not Virtual Machines? Application — Hardware misalignment Virtual Machine

    Container Application Applications have round edges — system call interface Hypervisors have square holes — hardware interface Guest OS Application
  5. Host iSCSI, NFS Image Format Interpreter Virtual Device VM Exit

    (Context Switch) Guest Driver Guest File System Host Application Why not Virtual Machines? Layers of Intermediate Software VMs Containers Application High IO overhead due to many intermediate layers
  6. Why not Virtual Machines? The Unwelcome Guest! Slow startup time

    Guest OS licensing and maintenance burden Poor scalability High resource consumption due to duplication Obfuscated network / storage / compute topologies Application semantic information is lost
  7. ! Hadoop Resource Manager Map Reduce ! YARN Map Reduce

    Spark Hbase ... Evolution of Hadoop from Map Reduce to YARN Isolation is an immediate challenge
  8. ! Hadoop Resource Manager Map Reduce ! YARN Map Reduce

    Spark Hbase ... Container Containers on YARN Containers provide a simple and elegant solution Container Virtualization
  9. ! Node Manager Job A Task 1 Job B Task

    1 Container Containers on YARN Node Manager Spawned Tasks as Containers Container Virtualization Tasks representing the same job share the same container Job A Task 2 Job C Task 1
  10. Containers on YARN Advantages Secure multitenancy Performance Isolation Utilization via

    coscheduling IO and CPU tasks Consistent cluster environment Isolation of software dependencies / configuration Reproducible way to define app environment Rapid provisioning
  11. Privilege Isolation through UID namespaces Host Container Container root UID

    0 Regular user UID 100 UID Virtualization U Host root UID 0 • UID namespaces are a recent addition to Kernel (DATE) • UIDs in containers can be mapped to different UIDs in host! • Tricky, because you need to translate UIDs of files and other resources! • Provides privilege isolation • Map superuser in container to regular user in host! • Great for YARN, and Docker in general! • Docker predates UID namespaces • Docker support for UID is forthcoming
  12. References ! ❏ Blog post describing UID virtualization support in

    Docker ❏ https://www.altiscale.com/making-docker-work-yarn/ ❏ Apache wiki page tracking work status across Docker and YARN projects ❏ https://wiki.apache.org/hadoop/dineshs/IsolatingYarnAppsInDockerContainers ❏ JIRA tracking Docker integration into YARN ❏ https://issues.apache.org/jira/browse/YARN-1964 ❏ Related Docker tickets ❏ Several tickets linked from: https://github.com/dotcloud/docker/pull/4572
 
 [email protected] Thank You
  13. Hadoop on Separate Physical Clusters Awesomely Secure ! Everybody gets

    private hardware running private services Customer 1 Customer 2 Customer 3
  14. Hadoop on Separate Physical Clusters Customer 1 Customer 2 Customer

    3 Cannot scale the business this way! Poor utilization Host platform is a huge maintenance burden ❖ Customer A needs R ❖ Customer B needs Matlab ❖ xyz needs ß∂ø… Utilization: 6 Spare: 0 Unused: 3 Utilization: 1 Spare: 6 Unused: 2 Utilization: 4 Spare: 3 Unused: 2
  15. Container Clusters to Decouple Host from Customer Each customer gets

    a container image ❖ Encapsulates customer specific software and configuration ❖ Host platform remains lean and simple Utilization: 6 Spare: 0 Unused: 3 Utilization: 1 Spare: 6 Unused: 2 Utilization: 4 Spare: 3 Unused: 2 Poor utilization Customer 1 Customer 2 Customer 3
  16. Global Pool of Resources Global Utilization: 11 Spare: 16 Unused:

    0 Container Clusters to Drive Utilization Each customer gets a container image ❖ Encapsulates customer specific software and configuration ❖ Host platform remains lean and simple Densely pack containers together
  17. Global Pool of Resources Containers with Fine-grain Resources ❖ Container

    resource levels adjusted dynamically per customer ➢ As dictated by business policy ❖ Fractional resource allocation
  18. Global Pool of Resources Disaggregated Compute and Storage DN NM

    ❖ Add more storage to Customer 1 cluster from a storage rich node ➢ While a compute intensive job from Customer 2 utilizes the available compute capacity on the same node Independently scale compute and storage