Hadoop HDFS & AWS

Hadoop Introduction Erica Li

Erica Li • shrimp_li • ericalitw • Data Scientists •
NPO side project • Girls in Tech Taiwan • Taiwan Spark User Group co-founder

https://github.com/wlsherica/StarkTechnology

Survey

Agenda • Life of Big Data Technologies • Hadoop and
its Ecosystem • Hadoop Architecture • Hadoop HDFS Introduction • Next Step to set up Hadoop

Life of Big Data Technologies

Big data is high volume, high velocity, and/or high variety
information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. - Doug Laney 2012

http://www.inside.com.tw/2015/02/06/big-data-1-origin-and-4vs

Hadoop and its Ecosystem

Mike Gualtieri

What is Hadoop • A framework that allows for the
distributed processing of large data sets across clusters of computers using simple programming models - hadoop.apache.org

Question So, Hadoop is a framework allows for the distributed
processing of 1) small data? 2) large data?

Ans Large data # It is also capable of processing
small one. However, to experience the power of hadoop, one needs to have data in TB's.

http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

Hadoop 2.X Core Components HDFS Cluster YARN Resource Manager Namenode
Datanode Datanode Datanode Datanode Node Manager Node Manager Node Manager Node Manager

Hadoop 2.X Cluster Architecture Slave01 -- DataNode -- Node Mannager
Slave02 -- DataNode -- Node Mannager Slave03 -- DataNode -- Node Mannager Slave04 -- DataNode -- Node Mannager Slave06 -- DataNode -- Node Mannager Master ----- NameNode ----- Resource Manager Slave05 -- DataNode -- Node Mannager

Hadoop Setup (AWS)

What we need? • Credit card • Ubuntu Server 14.04
LTS (HVM) • Java 7 • hadoop-2.7.1.tar.gz

Why not Java8? https://cwiki.apache.org/confluence/display/Hive/GettingStarted

Hadoop HDFS Introduction

What is HDFS • Hadoop Distributed File System • Good
for ◦ Large dataset ◦ Streaming data access • Design ◦ Single master - namenode ◦ Multiple slaves - datanodes

Client node Client JVM Distributed Filesystem Distributed Filesystem HDFS Client
DataNode NameNode DataNode DataNode 1: open 2: get block locations 3: read 6: close 4: read 5: read

Client node Client JVM Distributed Filesystem Distributed Filesystem HDFS Client
DataNode NameNode DataNode DataNode 1: create 2: create 3: write 6: close 4: write packet 5: act packet 7: complete 5 5 4 4

DataNode • Store data blocks • Receive blocks from Client
• Receive delete command from NameNode NameNode • File system meta • File system name -> blocks • block -> replicas • File system image • Edit log for every file system modification

Hadoop HDFS Operations

• Files within a directory hadoop fs -ls / •
Recursive version of ls hadoop fs -lsr / • Create a directory hadoop fs -mkdir /test • Copy src from local to HDFS hadoop fs -put localfile /user/hadoop/hadoopfile • Copy file to local system hadoop fs -get /user/hadoop/file localfile • Output file in text format hadoop fs -text /user/hadoop/file • Delete files hadoop fs -rm /user/hadoop/file https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html hadoop fs -help ls

• Check the health of my HDFS cluster (name node
and all data nodes)? hadoop dfsadmin -report • Displays sizes of files and directories hadoop fs -du /user/hadoop/file

Training Materials • Cloudera VM ◦ Cloudera CDH 5.4 •
Spark 1.3.0 • 64-bit host OS • RAM 4G • VMware, KVM, and VirtualBox

TODO (HDFS) 1. Open your VM 2. Create folder named
“todo_1” under /user/cloudera 3. Copy file /usr/lib/spark/NOTICE to /user/cloudera/todo_1 4. Output this hdfs file in text format, head 3 lines 5. Remove todo_1 folder 6. Voila

https://www.openhub. net/p/_compare? project_0=Apache+Spark&project_1= Apache+Hadoop 04, Dec, 2015

Spark World

Hadoop HDFS & AWS

Hadoop HDFS & AWS

Erica Li

More Decks by Erica Li

Other Decks in Technology

Featured

Transcript

Hadoop Introduction Erica Li

Erica Li • shrimp_li • ericalitw • Data Scientists •

https://github.com/wlsherica/StarkTechnology

Survey

Agenda • Life of Big Data Technologies • Hadoop and

Life of Big Data Technologies

Big data is high volume, high velocity, and/or high variety

http://www.inside.com.tw/2015/02/06/big-data-1-origin-and-4vs

Hadoop and its Ecosystem

Mike Gualtieri

What is Hadoop • A framework that allows for the

Question So, Hadoop is a framework allows for the distributed

Ans Large data # It is also capable of processing

http://thebigdatablog.weebly.com/blog/the-hadoop-ecosystem-overview

Hadoop 2.X Core Components HDFS Cluster YARN Resource Manager Namenode

Hadoop 2.X Cluster Architecture Slave01 -- DataNode -- Node Mannager

Hadoop Setup (AWS)

What we need? • Credit card • Ubuntu Server 14.04

Why not Java8? https://cwiki.apache.org/confluence/display/Hive/GettingStarted

Hadoop HDFS Introduction

What is HDFS • Hadoop Distributed File System • Good

Client node Client JVM Distributed Filesystem Distributed Filesystem HDFS Client

Client node Client JVM Distributed Filesystem Distributed Filesystem HDFS Client

DataNode • Store data blocks • Receive blocks from Client

Hadoop HDFS Operations

• Files within a directory hadoop fs -ls / •

• Check the health of my HDFS cluster (name node

Training Materials • Cloudera VM ◦ Cloudera CDH 5.4 •

TODO (HDFS) 1. Open your VM 2. Create folder named

https://www.openhub. net/p/_compare? project_0=Apache+Spark&project_1= Apache+Hadoop 04, Dec, 2015

Spark World