Big Data Processing lol

Big Data Processing lol @qcl NTU NLP Lab 2014-08-06

Outline • Hadoop • Map Reduce • Spark

Berkely Data Analysis Stack (BDAS)

Apache Hadoop • http://hadoop.apache.org/ • Open source software for storage
and large- scale processing of datasets on clusters. • Storage ◦ Hadoop Distributed File System (HDFS) • Parallel processing of large datasets ◦ Hadoop MapResuce

Hadoop Distributed File System • Google File System ◦ http://research.google.com/archive/gfs.html
• NameNode ◦ Metadata • DataNode ◦ Data blocks

HDFS Architecture Ref: http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

HDFS on NLP Lab Workstations • on NLG-WKS-* series ◦
nlg17.csie.ntu.edu.tw / nlg-wks.csie.org • NameNode ◦ NLG-WKS-9 ◦ hdfs://NLG-WKS-9:9000/ • DataNode ◦ NLG-WKS-[3-8] • Status ◦ 192.168.28.9:50070

HDFS on NLP Lab Workstations • Add “/usr/local/hadoop/bin” into your
$PATH ◦ Edit your ~/.bashrc ◦ $ export PATH=$PATH:/usr/local/hadoop/bin • Put your data on /user/username

HDFS Shell Commands • $ hadoop fs -ls path ◦
list files in path • $ hadoop fs -mkdir path ◦ create a dir • $ hadoop fs -rm path ◦ remove path • $ hadoop fs -put local-path hadoop-path ◦ upload local-path to hadoop-path • $ hadoop fs -get hadoop-path local-path ◦ download hadoop-path to local-path • More commands: http://goo.gl/BBdAzK

Hadoop MapReduce • MapReduce ◦ http://research.google.com/archive/mapreduce.html • Mapper ◦ output
(key, value) • Reducer ◦ input (key, [values])

MapReduce Framework

Example: Word Count

Example: Word Count • Mapper Line Number, Text in Line
Type of Key, Type of Value key=word, value=one

Example: Word Count • Reducer Code: https://gist.github.com/qcl/bc381f33dbe6976f2aa6

MapReduce on NLP Lab Workstations • Status ◦ 192.168.28.9:50030 •
hadoop *.jar path ◦ /usr/local/hadoop/

MapReduce on NLP Lab Workstations • Compile ◦ $ mkdir
YourMapReduce ◦ $ javac -classpath=/usr/local/hadoop/hadoop-core- 1.2.1.jar -d YourMapReduce YourMapReduce.java • Pack ◦ $ jar -cvf YourMapReduce.jar -C YourMapReduce . • Run ◦ $ hadoop jar YourMapReduce.jar your.class.name. YourMapReduce arguments

Apache Spark • https://spark.apache.org/ • Fast and general engine for
large-scale data processing • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Write applications quickly in Java/Scala/Python

Apache Spark • RDD • Transformations / Actions

Example: Word Count Code: https://gist.github.com/qcl/dfa0fd979c18738539c8

Spark on NLP Lab Workstations • MasterNode ◦ spark://NLG-WKS-9:7077 •
Status ◦ 192.168.28.9:8080 • Add “/usr/local/spark/bin” into your $PATH Edit your ~/.bashrc $ export PATH=$PATH:/usr/local/spark/bin • Run ◦ $ spark-submit --master spark://NLG-WKS-9:7077 YourPyhonSparkScript.py

有什麼問題嗎？ Questions?

One more thing...

Virtualenv • Build your own python env ◦ $ virtualenv
env-name ◦ $ vittualenv --no-site-packages env-name ▪ No site packages • Work in your env ◦ $ source env-name/bin/active ◦ Do anything in your env ^_< • Leave env ◦ (env-name) $ deactivate

第一次全面佔領CPU就上手

第一次全面佔領CPU就上手 • while True ? ◦ 這樣只用了一顆CPU，太弱了 • Processes pool
◦ multiprocessing # Create pool pool = multiprocessing.Pool(processes=NumberOfProcesses) # Apply tasks pool.apply_async(function, (args) ) # Close pool pool.close() # Wait pool.join()

第一次全面佔領CPU就上手 import multiprocessing def doSth(job): doSomething return something jobs =
[...] results = [] pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()) for job in jobs: result.append( pool.apply_async(doSth, (job) ) ) pool.close() pool.join() for r in results: result = r.get()

ccli @ nlg.csie.ntu.edu.tw qcl @ qcl.tw Thanks ^_<

Big Data Processing lol

Big Data Processing lol

Qing-Cheng Li

More Decks by Qing-Cheng Li

Other Decks in Programming

Featured

Transcript