Big Data Processing lol
@qcl
NTU NLP Lab 2014-08-06
Slide 2
Slide 2 text
Outline
● Hadoop
● Map Reduce
● Spark
Slide 3
Slide 3 text
Berkely Data Analysis Stack (BDAS)
Slide 4
Slide 4 text
Apache Hadoop
● http://hadoop.apache.org/
● Open source software for storage and large-
scale processing of datasets on clusters.
● Storage
○ Hadoop Distributed File System (HDFS)
● Parallel processing of large datasets
○ Hadoop MapResuce
Slide 5
Slide 5 text
Hadoop Distributed File System
● Google File System
○ http://research.google.com/archive/gfs.html
● NameNode
○ Metadata
● DataNode
○ Data blocks
HDFS on NLP Lab Workstations
● on NLG-WKS-* series
○ nlg17.csie.ntu.edu.tw / nlg-wks.csie.org
● NameNode
○ NLG-WKS-9
○ hdfs://NLG-WKS-9:9000/
● DataNode
○ NLG-WKS-[3-8]
● Status
○ 192.168.28.9:50070
Slide 8
Slide 8 text
HDFS on NLP Lab Workstations
● Add “/usr/local/hadoop/bin” into your $PATH
○ Edit your ~/.bashrc
○ $ export PATH=$PATH:/usr/local/hadoop/bin
● Put your data on /user/username
Slide 9
Slide 9 text
HDFS Shell Commands
● $ hadoop fs -ls path
○ list files in path
● $ hadoop fs -mkdir path
○ create a dir
● $ hadoop fs -rm path
○ remove path
● $ hadoop fs -put local-path hadoop-path
○ upload local-path to hadoop-path
● $ hadoop fs -get hadoop-path local-path
○ download hadoop-path to local-path
● More commands: http://goo.gl/BBdAzK
Example: Word Count
● Mapper
Line Number, Text in Line
Type of Key, Type of Value
key=word, value=one
Slide 14
Slide 14 text
Example: Word Count
● Reducer
Code: https://gist.github.com/qcl/bc381f33dbe6976f2aa6
Slide 15
Slide 15 text
MapReduce on NLP Lab Workstations
● Status
○ 192.168.28.9:50030
● hadoop *.jar path
○ /usr/local/hadoop/
Slide 16
Slide 16 text
MapReduce on NLP Lab Workstations
● Compile
○ $ mkdir YourMapReduce
○ $ javac -classpath=/usr/local/hadoop/hadoop-core-
1.2.1.jar -d YourMapReduce YourMapReduce.java
● Pack
○ $ jar -cvf YourMapReduce.jar -C YourMapReduce .
● Run
○ $ hadoop jar YourMapReduce.jar your.class.name.
YourMapReduce arguments
Slide 17
Slide 17 text
Apache Spark
● https://spark.apache.org/
● Fast and general engine for large-scale data
processing
● Run programs up to 100x faster than
Hadoop MapReduce in memory, or 10x
faster on disk.
● Write applications quickly in
Java/Scala/Python
Slide 18
Slide 18 text
Apache Spark
● RDD
● Transformations / Actions
Slide 19
Slide 19 text
Example: Word Count
Code: https://gist.github.com/qcl/dfa0fd979c18738539c8
Slide 20
Slide 20 text
Spark on NLP Lab Workstations
● MasterNode
○ spark://NLG-WKS-9:7077
● Status
○ 192.168.28.9:8080
● Add “/usr/local/spark/bin” into your $PATH
Edit your ~/.bashrc
$ export PATH=$PATH:/usr/local/spark/bin
● Run
○ $ spark-submit --master spark://NLG-WKS-9:7077
YourPyhonSparkScript.py
Slide 21
Slide 21 text
有什麼問題嗎?
Questions?
Slide 22
Slide 22 text
One more thing...
Slide 23
Slide 23 text
Virtualenv
● Build your own python env
○ $ virtualenv env-name
○ $ vittualenv --no-site-packages env-name
■ No site packages
● Work in your env
○ $ source env-name/bin/active
○ Do anything in your env ^_<
● Leave env
○ (env-name) $ deactivate
Slide 24
Slide 24 text
第一次全面佔領CPU就上手
Slide 25
Slide 25 text
第一次全面佔領CPU就上手
● while True ?
○ 這樣只用了一顆CPU,太弱了
● Processes pool
○ multiprocessing
# Create pool
pool = multiprocessing.Pool(processes=NumberOfProcesses)
# Apply tasks
pool.apply_async(function, (args) )
# Close pool
pool.close()
# Wait
pool.join()
Slide 26
Slide 26 text
第一次全面佔領CPU就上手
import multiprocessing
def doSth(job):
doSomething
return something
jobs = [...]
results = []
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
for job in jobs:
result.append( pool.apply_async(doSth, (job) ) )
pool.close()
pool.join()
for r in results:
result = r.get()