Slide 1

Slide 1 text

Hadoop Introduction Erica Li

Slide 2

Slide 2 text

Erica Li ● shrimp_li ● ericalitw ● Data Scientists ● NPO side project ● Girls in Tech Taiwan ● Taiwan Spark User Group co-founder

Slide 3

Slide 3 text

https://github.com/wlsherica/StarkTechnology

Slide 4

Slide 4 text

Survey

Slide 5

Slide 5 text

Agenda ● Life of Big Data Technologies ● Hadoop and its Ecosystem ● Hadoop Architecture ● Hadoop HDFS Introduction ● Next Step to set up Hadoop

Slide 6

Slide 6 text

Life of Big Data Technologies

Slide 7

Slide 7 text

Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. - Doug Laney 2012

Slide 8

Slide 8 text

http://www.inside.com.tw/2015/02/06/big-data-1-origin-and-4vs

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

Hadoop and its Ecosystem

Slide 11

Slide 11 text

Mike Gualtieri

Slide 12

Slide 12 text

What is Hadoop ● A framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models - hadoop.apache.org

Slide 13

Slide 13 text

Question So, Hadoop is a framework allows for the distributed processing of 1) small data? 2) large data?

Slide 14

Slide 14 text

Ans Large data # It is also capable of processing small one. However, to experience the power of hadoop, one needs to have data in TB's.

Slide 15

Slide 15 text

http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

Slide 16

Slide 16 text

Hadoop 2.X Core Components HDFS Cluster YARN Resource Manager Namenode Datanode Datanode Datanode Datanode Node Manager Node Manager Node Manager Node Manager

Slide 17

Slide 17 text

Hadoop 2.X Cluster Architecture Slave01 -- DataNode -- Node Mannager Slave02 -- DataNode -- Node Mannager Slave03 -- DataNode -- Node Mannager Slave04 -- DataNode -- Node Mannager Slave06 -- DataNode -- Node Mannager Master ----- NameNode ----- Resource Manager Slave05 -- DataNode -- Node Mannager

Slide 18

Slide 18 text

Hadoop Setup (AWS)

Slide 19

Slide 19 text

What we need? ● Credit card ● Ubuntu Server 14.04 LTS (HVM) ● Java 7 ● hadoop-2.7.1.tar.gz

Slide 20

Slide 20 text

Why not Java8? https://cwiki.apache.org/confluence/display/Hive/GettingStarted

Slide 21

Slide 21 text

Hadoop HDFS Introduction

Slide 22

Slide 22 text

What is HDFS ● Hadoop Distributed File System ● Good for ○ Large dataset ○ Streaming data access ● Design ○ Single master - namenode ○ Multiple slaves - datanodes

Slide 23

Slide 23 text

Client node Client JVM Distributed Filesystem Distributed Filesystem HDFS Client DataNode NameNode DataNode DataNode 1: open 2: get block locations 3: read 6: close 4: read 5: read

Slide 24

Slide 24 text

Client node Client JVM Distributed Filesystem Distributed Filesystem HDFS Client DataNode NameNode DataNode DataNode 1: create 2: create 3: write 6: close 4: write packet 5: act packet 7: complete 5 5 4 4

Slide 25

Slide 25 text

DataNode ● Store data blocks ● Receive blocks from Client ● Receive delete command from NameNode NameNode ● File system meta ● File system name -> blocks ● block -> replicas ● File system image ● Edit log for every file system modification

Slide 26

Slide 26 text

Hadoop HDFS Operations

Slide 27

Slide 27 text

● Files within a directory hadoop fs -ls / ● Recursive version of ls hadoop fs -lsr / ● Create a directory hadoop fs -mkdir /test ● Copy src from local to HDFS hadoop fs -put localfile /user/hadoop/hadoopfile ● Copy file to local system hadoop fs -get /user/hadoop/file localfile ● Output file in text format hadoop fs -text /user/hadoop/file ● Delete files hadoop fs -rm /user/hadoop/file https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html hadoop fs -help ls

Slide 28

Slide 28 text

● Check the health of my HDFS cluster (name node and all data nodes)? hadoop dfsadmin -report ● Displays sizes of files and directories hadoop fs -du /user/hadoop/file

Slide 29

Slide 29 text

Training Materials ● Cloudera VM ○ Cloudera CDH 5.4 ● Spark 1.3.0 ● 64-bit host OS ● RAM 4G ● VMware, KVM, and VirtualBox

Slide 30

Slide 30 text

TODO (HDFS) 1. Open your VM 2. Create folder named “todo_1” under /user/cloudera 3. Copy file /usr/lib/spark/NOTICE to /user/cloudera/todo_1 4. Output this hdfs file in text format, head 3 lines 5. Remove todo_1 folder 6. Voila

Slide 31

Slide 31 text

https://www.openhub. net/p/_compare? project_0=Apache+Spark&project_1= Apache+Hadoop 04, Dec, 2015

Slide 32

Slide 32 text

Spark World