Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Hadoop and Java

Apache Hadoop and Java

Quick recap of Hadoop dev, coverage of Java 7 upgrade issues and what Hadoop's problems are (classpaths!), then thoughts about Java 8+

Avatar for Steve Loughran

Steve Loughran

June 11, 2014
Tweet

More Decks by Steve Loughran

Other Decks in Technology

Transcript

  1. © Hortonworks Inc. 2014 Apache Hadoop & Java Steve Loughran–

    Hortonworks stevel at hortonworks.com @steveloughran June 2014
  2. © Hortonworks Inc. HDFS: goals • Store Petabytes of web data:

    logs, web snapshots • Keep per-node costs down to afford more nodes • Commodity x86 servers, storage (SAS), GbE LAN • Open source software: O(1) costs • O(1) operations • Accept failure as a background noise • Support computation in each server Written for location aware applications -MapReduce, Pregel/Giraph & others that can tolerate partial failures Page 4
  3. © Hortonworks Inc. HDFS: what • Open Source: hadoop.apache.org • Java code

    on Linux, Unix, Windows • Replication rather than RAID – break file into blocks – store across servers and racks – delivers bandwidth and more locations for work • Background work handles failures – replication of under-replicated blocks – rebalancing of unbalanced servers – checksum verification of stored files Location data for work schedulers Page 5
  4. © Hortonworks Inc. Page 6 DataNode DataNode DataNode DataNode ToR

    Switch DataNode DataNode DataNode DataNode ToR Switch Switch ToR Switch Failover Name Node Name Node file block1 block2 block3 … Hadoop HDFS: replication is the key
  5. © Hortonworks Inc. HDD è HDD+ SSD è SSD • New

    solid state storage technologies emerging • When will HDDs go away? • How to take advantage of mixed storage • SSD retains the HDD metaphor, hides the details (access bus, wear levelling) Page 7 We need to give the OS and DFS control of the storage, work with the application
  6. © Hortonworks Inc. Hadoop MapReduce 1.  Map:  events  è  <k,v>*

     pairs   2.  Reduce:  <k,[v1 ,  v2 ,..  vn ]>  è  <k,v'>   • Map trivially parallelisable on blocks in a file • Reduce parallelise on keys • MapReduce engine can execute Map and Reduce sequences against data • HDFS provides data location for work placement Page 9
  7. © Hortonworks Inc. MapReduce democratised big data • Conceptual model easy

    to grasp • Can write and test locally, superlinear scaleup • Tools and stack Page 10 You don't need to understand parallel coding to run apps across 1000 machines
  8. Network HDFS Host OS YARN Kernighan Cerf Lamport Codd SQL,

    Stats… Your code Knuth Swinehart et al.
  9. © Hortonworks Inc. 2014 YARN runs code across the cluster

    Page 17 HDFS YARN Node Manager HDFS YARN Node Manager HDFS YARN Resource Manager “The RM” HDFS YARN Node Manager •  Servers run YARN Node Managers •  NM's heartbeat to Resource Manager •  RM schedules work over cluster •  RM allocates containers to apps •  NMs start containers •  NMs report container health
  10. © Hortonworks Inc. 2014 Client creates App Master Page 18

    HDFS YARN Node Manager HDFS YARN Node Manager HDFS YARN Resource Manager “The RM” HDFS YARN Node Manager Client Application Master
  11. © Hortonworks Inc. Hadoop is CS-Hard • Core HDFS, MR and

    YARN – Distributed Computing – Consensus Protocols & Consistency Models – Work Scheduling & Data Placement – Reliability theory – CPU Architecture; x86 assembler • Others – Machine learning – Distributed Transactions – Graph Theory – Queue Theory – Correctness proofs Page 22
  12. © Hortonworks Inc. Testing needs cluster time • Full time core

    business @ Hortonworks + Cloudera • Full time projects at others: LinkedIn, IBM, MSFT, VMWare • Single developers can't compete “worked on my VM” • Small test runs take too long • Review-then-Commit neglects everyone's patches • at-scale tests on limited OS/JVM/Network setups Page 24
  13. © Hortonworks Inc. Fear of damage The worth of Hadoop

    is the data in HDFS Ø the worth of all companies whose data it is Ø cost to individuals of data loss Ø cost to governments of losing their data Scheduling performance worth $100Ks to individual organisations Reliability costs time & support, even when Hadoop recovers Page 25
  14. © Hortonworks Inc. Fear of change in dependencies OS level:

    Kernel, Filesystem • Memory options (huge pages, vm stickiness, … ) • Scheduling • FS: performance vs durability • slow uptake of ext4 • interest in ZFS • RHAT and MSFT engagement bodes well for OS support Page 26
  15. © Hortonworks Inc. General troublespots Classpaths • JAR versions, transitive pain,

    esp. • Web layer: stuck in old Jetty version for Java6; known weaknesses resurfacing in webhdfs • Google guava • protobuf: protoc versions in OS; protbuf.jar update hell Networking • IPv4 only for now • hostname, getLocalHostname, DNS caching • Exception wrapping & diags: (hosts, ports, wiki links) Page 27
  16. © Hortonworks Inc. 2014 Hadoop, Java 6 & its dependencies

    There was enough change in 2.2 to worry about Page 28
  17. <dependency>      <groupId>org.apache.hadoop</groupId>      <artifactId>hadoop-­‐client</artifactId>      <version>${hadoop.version}</version>

         <exclusions>          <exclusion>              <groupId>org.codehaus.jackson</groupId>              <artifactId>jackson-­‐core-­‐asl</artifactId>          </exclusion>          <exclusion>              <groupId>com.google.guava</groupId>              <artifactId>guava</artifactId>          </exclusion>          <exclusion>              <groupId>org.apache.httpcomponents</groupId>              <artifactId>httpclient</artifactId>          </exclusion>          <exclusion>              <groupId>org.apache.httpcomponents</groupId>              <artifactId>httpcore</artifactId>          </exclusion>      </exclusions>   </dependency>  
  18. © Hortonworks Inc. What I'd like in Hadoop 2.x/3.0 • Java

    7+ baseline; Java 8 tested • Adopt Java 7 file IO APIs for file:// fs • Client-side code to implement Closeable • OSGi containers for hosting YARN apps/ MR jobs • Move up all the JARs we depend on • Move to Jersey for REST/Web • Gradual switch to SLF4J logging • Language features? multiple catch probably best Page 30
  19. © Hortonworks Inc. Java 7 :what broke • JUnit test re-ordering

    • Optimisations for String buffering • race condition in stream closing (Yahoo! found) • (+nervous of -XX:+UseCompressedOops since java 6) • TreeMap/TreeSet constructor and comparator semantic changes • ConcurrentHashMap structure size changes broke heap usage estimations (again in Java 8) Page 31
  20. © Hortonworks Inc. Short term Java 8 strategies • While waiting

    for language level feature adoption, sell the JRE 8 runtime as a superior platform to 7. GC and perf enhancements are landing here, not in 7 • design callbacks and classes for Lambda expressions • s/{Thread,Runnable}/r/java.util.concurrent • add libraries for Java 8-only. Twill? Get Hadoop developers experienced with writing Java 8 apps –even small ones– The further up the stack the faster moving you can be Page 32
  21. © Hortonworks Inc. Testing: how to help 1.  The core

    Hadoop test suites are in svn 2.  testing hadoop releases against Java 8+ will catch regressions (issue: whose fault?) 3.  Same for HBase & other common in-cluster apps. 4.  Apache have jenkins-managed builds; could adopt java 8 machines/VMs with help. (issue: who worries) Getting people to care about breaking builds on future Java versions/other platforms always trouble. Page 33
  22. © Hortonworks Inc 2014 Get involved! Page 34 svn.apache.org issues.apache.org

    {hadoop,hbase, mahout, pig, oozie, …}.apache.org