Hadoop

Shankar Radhakrishnan HCL Technologies

  State of the Data   What is Hadoop
  Hadoop Ecosystem   References

  Data driven businesses   Businesses have been collecting
information all the time   Mine more == Collect more (and vice-‐versa)   Challenges   Application Complexities   Data growth   Infrastructure   Economics   Need of the day

  Data driven business   Businesses have been collecting
information all the time   Mine more == Collect more (and vice-‐versa)   Challenges   Application Complexities   Data growth   Infrastructure   Economics

  Applications   Searches, Message posts, Comments, Emails,
Blogs, Photos, Video Clips, Product Listings   ERP, CRM, Databases, Internal Applications, Customer/ Consumer facing products   Mobile   Context   Web, Customers, Products, Business Systems, Processes, Services   Support Systems   CRM, SOA, Recommendation Systems/processes, Data warehouses, Business Intelligence, BPM

  Drivers   ROI   Customer Retention
  Product Aﬃnity   Market Trends   Research Analysis   Customer/Consumer Analytics   Process   Clustering   Classiﬁcation   Build Relationships   Regression   Types   Structured   Semi-‐structured   Unstructured

  Complex Applications   Data integration is a good
but complex problem to solve   Data Growth   Growth is exponential   Infrastructure   Availability   Unscalable hardware   Economics   Managing high data volume comes at a price   Failures are very costly

  System that can handle high volume data  
System that can perform complex operations   Scalable   Robust   Highly Available   Fault Tolerant   Cheap

  Top level Apache project   Open source
  Inspired by Google’s white papers on Map/Reduce (MR), Google File System (GFS)   Originally developed to support Apache Nutch Search Engine   Software Framework -‐ Java   Designed   For sophisticated analysis   To deal with structured and unstructured complex data

  Runs on commodity hardware   Shared-‐nothing architecture
  Scale hardware when ever you want   System compensates for hardware scaling and issues (if any)   Run large-‐scale, high volume data processes   Scales well with complex analysis jobs   Handles failures   Ideal to consolidate data from both new and legacy data sources   Value to the business

  HDFS Hadoop Distributed File System  
Map/Reduce Software framework for Clustered, Distributed data processing   ZooKeeper Scheduler   Avro Data Serialization   Chukwa Data Collection System to monitor Distributed Systems   HBase Data storage for distributed large tables   Hive Data warehousing infrastructure   Pig High-‐Level Query Language

  Master/Slave Architecture   Runs on commodity hardware
  Fault Tolerant   Handle large volumes of data   Provides High Throughput   Streaming data-‐access   Simple ﬁle coherency model   Portable to heterogeneous hardware and software   Robust   Handles disk failures, replication (& re-‐replication)   Performs cluster rebalancing, data integrity checks

Name node •  File system operations •  Maps
data-‐nodes Data node •  Process read/write •  Handles Data-‐blocks •  Replication

  Tagged by a job   Splits input data-‐set
into separate chunk’s   Processed by map tasks, in parallel   Sorts the output of the maps   Processed by reduce tasks, in parallel   Typically stored and processed in a ﬁle system   Framework takes care of   Scheduling tasks   Monitoring   Re-‐executing failed tasks

Hadoop

Hadoop

Shankar

More Decks by Shankar

Other Decks in Technology

Featured

Transcript

Shankar Radhakrishnan HCL Technologies

  State of the Data   What is Hadoop

  Data driven businesses   Businesses have been collecting

  Data driven business   Businesses have been collecting

  Applications   Searches, Message posts, Comments, Emails,

  Data driven businesses   Businesses have been collecting

  Drivers   ROI   Customer Retention

  Data driven businesses   Businesses have been collecting

  Complex Applications   Data integration is a good

  System that can handle high volume data 

  Top level Apache project   Open source

  Runs on commodity hardware   Shared-‐nothing architecture

  HDFS Hadoop Distributed File System 

  Master/Slave Architecture   Runs on commodity hardware

Name node •  File system operations •  Maps

  Tagged by a job   Splits input data-‐set