Erica Li
● shrimp_li
● ericalitw
● Data Scientists
● NPO side project
● Girls in Tech Taiwan
● Taiwan Spark User Group co-founder
Slide 3
Slide 3 text
https://github.com/wlsherica/StarkTechnology
Slide 4
Slide 4 text
Survey
Slide 5
Slide 5 text
Agenda
● Life of Big Data Technologies
● Hadoop and its Ecosystem
● Hadoop Architecture
● Hadoop HDFS Introduction
● Next Step to set up Hadoop
Slide 6
Slide 6 text
Life of Big Data Technologies
Slide 7
Slide 7 text
Big data is high volume, high
velocity, and/or high variety
information assets that require
new forms of processing to
enable enhanced decision
making, insight discovery and
process optimization.
- Doug Laney 2012
What is Hadoop
● A framework that allows for the distributed
processing of large data sets across clusters
of computers using simple programming
models - hadoop.apache.org
Slide 13
Slide 13 text
Question
So, Hadoop is a framework allows
for the distributed processing of
1) small data?
2) large data?
Slide 14
Slide 14 text
Ans
Large data
# It is also capable of processing
small one. However, to experience
the power of hadoop, one needs to
have data in TB's.
What we need?
● Credit card
● Ubuntu Server 14.04 LTS (HVM)
● Java 7
● hadoop-2.7.1.tar.gz
Slide 20
Slide 20 text
Why not Java8?
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
Slide 21
Slide 21 text
Hadoop HDFS Introduction
Slide 22
Slide 22 text
What is HDFS
● Hadoop Distributed File System
● Good for
○ Large dataset
○ Streaming data access
● Design
○ Single master - namenode
○ Multiple slaves - datanodes
Slide 23
Slide 23 text
Client node
Client JVM
Distributed
Filesystem
Distributed
Filesystem
HDFS
Client
DataNode
NameNode
DataNode
DataNode
1: open
2: get block locations
3: read
6: close
4: read 5: read
DataNode
● Store data blocks
● Receive blocks from
Client
● Receive delete
command from
NameNode
NameNode
● File system meta
● File system name ->
blocks
● block -> replicas
● File system image
● Edit log for every file
system modification
Slide 26
Slide 26 text
Hadoop HDFS Operations
Slide 27
Slide 27 text
● Files within a directory
hadoop fs -ls /
● Recursive version of ls
hadoop fs -lsr /
● Create a directory
hadoop fs -mkdir /test
● Copy src from local to HDFS
hadoop fs -put localfile /user/hadoop/hadoopfile
● Copy file to local system
hadoop fs -get /user/hadoop/file localfile
● Output file in text format
hadoop fs -text /user/hadoop/file
● Delete files
hadoop fs -rm /user/hadoop/file
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html
hadoop fs -help ls
Slide 28
Slide 28 text
● Check the health of my HDFS cluster (name node and all data nodes)?
hadoop dfsadmin -report
● Displays sizes of files and directories
hadoop fs -du /user/hadoop/file
Slide 29
Slide 29 text
Training Materials
● Cloudera VM
○ Cloudera CDH 5.4
● Spark 1.3.0
● 64-bit host OS
● RAM 4G
● VMware, KVM, and VirtualBox
Slide 30
Slide 30 text
TODO (HDFS)
1. Open your VM
2. Create folder named “todo_1” under /user/cloudera
3. Copy file /usr/lib/spark/NOTICE to
/user/cloudera/todo_1
4. Output this hdfs file in text format, head 3 lines
5. Remove todo_1 folder
6. Voila