Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hadoop HDFS & AWS

906495aee953d1a6dc3d661d28da0081?s=47 Erica Li
December 04, 2015

Hadoop HDFS & AWS

In this lecture, you will learn the infra of Hadoop, HDFS and how to build hadoop multi-nodes cluster on AWS

906495aee953d1a6dc3d661d28da0081?s=128

Erica Li

December 04, 2015
Tweet

Transcript

  1. Hadoop Introduction Erica Li

  2. Erica Li • shrimp_li • ericalitw • Data Scientists •

    NPO side project • Girls in Tech Taiwan • Taiwan Spark User Group co-founder
  3. https://github.com/wlsherica/StarkTechnology

  4. Survey

  5. Agenda • Life of Big Data Technologies • Hadoop and

    its Ecosystem • Hadoop Architecture • Hadoop HDFS Introduction • Next Step to set up Hadoop
  6. Life of Big Data Technologies

  7. Big data is high volume, high velocity, and/or high variety

    information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. - Doug Laney 2012
  8. http://www.inside.com.tw/2015/02/06/big-data-1-origin-and-4vs

  9. None
  10. Hadoop and its Ecosystem

  11. Mike Gualtieri

  12. What is Hadoop • A framework that allows for the

    distributed processing of large data sets across clusters of computers using simple programming models - hadoop.apache.org
  13. Question So, Hadoop is a framework allows for the distributed

    processing of 1) small data? 2) large data?
  14. Ans Large data # It is also capable of processing

    small one. However, to experience the power of hadoop, one needs to have data in TB's.
  15. http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

  16. Hadoop 2.X Core Components HDFS Cluster YARN Resource Manager Namenode

    Datanode Datanode Datanode Datanode Node Manager Node Manager Node Manager Node Manager
  17. Hadoop 2.X Cluster Architecture Slave01 -- DataNode -- Node Mannager

    Slave02 -- DataNode -- Node Mannager Slave03 -- DataNode -- Node Mannager Slave04 -- DataNode -- Node Mannager Slave06 -- DataNode -- Node Mannager Master ----- NameNode ----- Resource Manager Slave05 -- DataNode -- Node Mannager
  18. Hadoop Setup (AWS)

  19. What we need? • Credit card • Ubuntu Server 14.04

    LTS (HVM) • Java 7 • hadoop-2.7.1.tar.gz
  20. Why not Java8? https://cwiki.apache.org/confluence/display/Hive/GettingStarted

  21. Hadoop HDFS Introduction

  22. What is HDFS • Hadoop Distributed File System • Good

    for ◦ Large dataset ◦ Streaming data access • Design ◦ Single master - namenode ◦ Multiple slaves - datanodes
  23. Client node Client JVM Distributed Filesystem Distributed Filesystem HDFS Client

    DataNode NameNode DataNode DataNode 1: open 2: get block locations 3: read 6: close 4: read 5: read
  24. Client node Client JVM Distributed Filesystem Distributed Filesystem HDFS Client

    DataNode NameNode DataNode DataNode 1: create 2: create 3: write 6: close 4: write packet 5: act packet 7: complete 5 5 4 4
  25. DataNode • Store data blocks • Receive blocks from Client

    • Receive delete command from NameNode NameNode • File system meta • File system name -> blocks • block -> replicas • File system image • Edit log for every file system modification
  26. Hadoop HDFS Operations

  27. • Files within a directory hadoop fs -ls / •

    Recursive version of ls hadoop fs -lsr / • Create a directory hadoop fs -mkdir /test • Copy src from local to HDFS hadoop fs -put localfile /user/hadoop/hadoopfile • Copy file to local system hadoop fs -get /user/hadoop/file localfile • Output file in text format hadoop fs -text /user/hadoop/file • Delete files hadoop fs -rm /user/hadoop/file https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html hadoop fs -help ls
  28. • Check the health of my HDFS cluster (name node

    and all data nodes)? hadoop dfsadmin -report • Displays sizes of files and directories hadoop fs -du /user/hadoop/file
  29. Training Materials • Cloudera VM ◦ Cloudera CDH 5.4 •

    Spark 1.3.0 • 64-bit host OS • RAM 4G • VMware, KVM, and VirtualBox
  30. TODO (HDFS) 1. Open your VM 2. Create folder named

    “todo_1” under /user/cloudera 3. Copy file /usr/lib/spark/NOTICE to /user/cloudera/todo_1 4. Output this hdfs file in text format, head 3 lines 5. Remove todo_1 folder 6. Voila
  31. https://www.openhub. net/p/_compare? project_0=Apache+Spark&project_1= Apache+Hadoop 04, Dec, 2015

  32. Spark World