Slide 1

Slide 1 text

Big Data Processing lol @qcl NTU NLP Lab 2014-08-06

Slide 2

Slide 2 text

Outline ● Hadoop ● Map Reduce ● Spark

Slide 3

Slide 3 text

Berkely Data Analysis Stack (BDAS)

Slide 4

Slide 4 text

Apache Hadoop ● http://hadoop.apache.org/ ● Open source software for storage and large- scale processing of datasets on clusters. ● Storage ○ Hadoop Distributed File System (HDFS) ● Parallel processing of large datasets ○ Hadoop MapResuce

Slide 5

Slide 5 text

Hadoop Distributed File System ● Google File System ○ http://research.google.com/archive/gfs.html ● NameNode ○ Metadata ● DataNode ○ Data blocks

Slide 6

Slide 6 text

HDFS Architecture Ref: http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Slide 7

Slide 7 text

HDFS on NLP Lab Workstations ● on NLG-WKS-* series ○ nlg17.csie.ntu.edu.tw / nlg-wks.csie.org ● NameNode ○ NLG-WKS-9 ○ hdfs://NLG-WKS-9:9000/ ● DataNode ○ NLG-WKS-[3-8] ● Status ○ 192.168.28.9:50070

Slide 8

Slide 8 text

HDFS on NLP Lab Workstations ● Add “/usr/local/hadoop/bin” into your $PATH ○ Edit your ~/.bashrc ○ $ export PATH=$PATH:/usr/local/hadoop/bin ● Put your data on /user/username

Slide 9

Slide 9 text

HDFS Shell Commands ● $ hadoop fs -ls path ○ list files in path ● $ hadoop fs -mkdir path ○ create a dir ● $ hadoop fs -rm path ○ remove path ● $ hadoop fs -put local-path hadoop-path ○ upload local-path to hadoop-path ● $ hadoop fs -get hadoop-path local-path ○ download hadoop-path to local-path ● More commands: http://goo.gl/BBdAzK

Slide 10

Slide 10 text

Hadoop MapReduce ● MapReduce ○ http://research.google.com/archive/mapreduce.html ● Mapper ○ output (key, value) ● Reducer ○ input (key, [values])

Slide 11

Slide 11 text

MapReduce Framework

Slide 12

Slide 12 text

Example: Word Count

Slide 13

Slide 13 text

Example: Word Count ● Mapper Line Number, Text in Line Type of Key, Type of Value key=word, value=one

Slide 14

Slide 14 text

Example: Word Count ● Reducer Code: https://gist.github.com/qcl/bc381f33dbe6976f2aa6

Slide 15

Slide 15 text

MapReduce on NLP Lab Workstations ● Status ○ 192.168.28.9:50030 ● hadoop *.jar path ○ /usr/local/hadoop/

Slide 16

Slide 16 text

MapReduce on NLP Lab Workstations ● Compile ○ $ mkdir YourMapReduce ○ $ javac -classpath=/usr/local/hadoop/hadoop-core- 1.2.1.jar -d YourMapReduce YourMapReduce.java ● Pack ○ $ jar -cvf YourMapReduce.jar -C YourMapReduce . ● Run ○ $ hadoop jar YourMapReduce.jar your.class.name. YourMapReduce arguments

Slide 17

Slide 17 text

Apache Spark ● https://spark.apache.org/ ● Fast and general engine for large-scale data processing ● Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. ● Write applications quickly in Java/Scala/Python

Slide 18

Slide 18 text

Apache Spark ● RDD ● Transformations / Actions

Slide 19

Slide 19 text

Example: Word Count Code: https://gist.github.com/qcl/dfa0fd979c18738539c8

Slide 20

Slide 20 text

Spark on NLP Lab Workstations ● MasterNode ○ spark://NLG-WKS-9:7077 ● Status ○ 192.168.28.9:8080 ● Add “/usr/local/spark/bin” into your $PATH Edit your ~/.bashrc $ export PATH=$PATH:/usr/local/spark/bin ● Run ○ $ spark-submit --master spark://NLG-WKS-9:7077 YourPyhonSparkScript.py

Slide 21

Slide 21 text

有什麼問題嗎? Questions?

Slide 22

Slide 22 text

One more thing...

Slide 23

Slide 23 text

Virtualenv ● Build your own python env ○ $ virtualenv env-name ○ $ vittualenv --no-site-packages env-name ■ No site packages ● Work in your env ○ $ source env-name/bin/active ○ Do anything in your env ^_< ● Leave env ○ (env-name) $ deactivate

Slide 24

Slide 24 text

第一次全面佔領CPU就上手

Slide 25

Slide 25 text

第一次全面佔領CPU就上手 ● while True ? ○ 這樣只用了一顆CPU,太弱了 ● Processes pool ○ multiprocessing # Create pool pool = multiprocessing.Pool(processes=NumberOfProcesses) # Apply tasks pool.apply_async(function, (args) ) # Close pool pool.close() # Wait pool.join()

Slide 26

Slide 26 text

第一次全面佔領CPU就上手 import multiprocessing def doSth(job): doSomething return something jobs = [...] results = [] pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()) for job in jobs: result.append( pool.apply_async(doSth, (job) ) ) pool.close() pool.join() for r in results: result = r.get()

Slide 27

Slide 27 text

ccli @ nlg.csie.ntu.edu.tw qcl @ qcl.tw Thanks ^_<