Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop HDFS & AWS

Erica Li
December 04, 2015

Hadoop HDFS & AWS

In this lecture, you will learn the infra of Hadoop, HDFS and how to build hadoop multi-nodes cluster on AWS

Erica Li

December 04, 2015
Tweet

More Decks by Erica Li

Other Decks in Technology

Transcript

  1. Erica Li • shrimp_li • ericalitw • Data Scientists •

    NPO side project • Girls in Tech Taiwan • Taiwan Spark User Group co-founder
  2. Agenda • Life of Big Data Technologies • Hadoop and

    its Ecosystem • Hadoop Architecture • Hadoop HDFS Introduction • Next Step to set up Hadoop
  3. Big data is high volume, high velocity, and/or high variety

    information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. - Doug Laney 2012
  4. What is Hadoop • A framework that allows for the

    distributed processing of large data sets across clusters of computers using simple programming models - hadoop.apache.org
  5. Question So, Hadoop is a framework allows for the distributed

    processing of 1) small data? 2) large data?
  6. Ans Large data # It is also capable of processing

    small one. However, to experience the power of hadoop, one needs to have data in TB's.
  7. Hadoop 2.X Core Components HDFS Cluster YARN Resource Manager Namenode

    Datanode Datanode Datanode Datanode Node Manager Node Manager Node Manager Node Manager
  8. Hadoop 2.X Cluster Architecture Slave01 -- DataNode -- Node Mannager

    Slave02 -- DataNode -- Node Mannager Slave03 -- DataNode -- Node Mannager Slave04 -- DataNode -- Node Mannager Slave06 -- DataNode -- Node Mannager Master ----- NameNode ----- Resource Manager Slave05 -- DataNode -- Node Mannager
  9. What we need? • Credit card • Ubuntu Server 14.04

    LTS (HVM) • Java 7 • hadoop-2.7.1.tar.gz
  10. What is HDFS • Hadoop Distributed File System • Good

    for ◦ Large dataset ◦ Streaming data access • Design ◦ Single master - namenode ◦ Multiple slaves - datanodes
  11. Client node Client JVM Distributed Filesystem Distributed Filesystem HDFS Client

    DataNode NameNode DataNode DataNode 1: open 2: get block locations 3: read 6: close 4: read 5: read
  12. Client node Client JVM Distributed Filesystem Distributed Filesystem HDFS Client

    DataNode NameNode DataNode DataNode 1: create 2: create 3: write 6: close 4: write packet 5: act packet 7: complete 5 5 4 4
  13. DataNode • Store data blocks • Receive blocks from Client

    • Receive delete command from NameNode NameNode • File system meta • File system name -> blocks • block -> replicas • File system image • Edit log for every file system modification
  14. • Files within a directory hadoop fs -ls / •

    Recursive version of ls hadoop fs -lsr / • Create a directory hadoop fs -mkdir /test • Copy src from local to HDFS hadoop fs -put localfile /user/hadoop/hadoopfile • Copy file to local system hadoop fs -get /user/hadoop/file localfile • Output file in text format hadoop fs -text /user/hadoop/file • Delete files hadoop fs -rm /user/hadoop/file https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html hadoop fs -help ls
  15. • Check the health of my HDFS cluster (name node

    and all data nodes)? hadoop dfsadmin -report • Displays sizes of files and directories hadoop fs -du /user/hadoop/file
  16. Training Materials • Cloudera VM ◦ Cloudera CDH 5.4 •

    Spark 1.3.0 • 64-bit host OS • RAM 4G • VMware, KVM, and VirtualBox
  17. TODO (HDFS) 1. Open your VM 2. Create folder named

    “todo_1” under /user/cloudera 3. Copy file /usr/lib/spark/NOTICE to /user/cloudera/todo_1 4. Output this hdfs file in text format, head 3 lines 5. Remove todo_1 folder 6. Voila