Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Processing lol

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Big Data Processing lol

Introduction to Apache Hadoop (HDFS, MapReduce) and Spark.

Avatar for Qing-Cheng Li

Qing-Cheng Li

August 06, 2014
Tweet

More Decks by Qing-Cheng Li

Other Decks in Programming

Transcript

  1. Apache Hadoop • http://hadoop.apache.org/ • Open source software for storage

    and large- scale processing of datasets on clusters. • Storage ◦ Hadoop Distributed File System (HDFS) • Parallel processing of large datasets ◦ Hadoop MapResuce
  2. HDFS on NLP Lab Workstations • on NLG-WKS-* series ◦

    nlg17.csie.ntu.edu.tw / nlg-wks.csie.org • NameNode ◦ NLG-WKS-9 ◦ hdfs://NLG-WKS-9:9000/ • DataNode ◦ NLG-WKS-[3-8] • Status ◦ 192.168.28.9:50070
  3. HDFS on NLP Lab Workstations • Add “/usr/local/hadoop/bin” into your

    $PATH ◦ Edit your ~/.bashrc ◦ $ export PATH=$PATH:/usr/local/hadoop/bin • Put your data on /user/username
  4. HDFS Shell Commands • $ hadoop fs -ls path ◦

    list files in path • $ hadoop fs -mkdir path ◦ create a dir • $ hadoop fs -rm path ◦ remove path • $ hadoop fs -put local-path hadoop-path ◦ upload local-path to hadoop-path • $ hadoop fs -get hadoop-path local-path ◦ download hadoop-path to local-path • More commands: http://goo.gl/BBdAzK
  5. Example: Word Count • Mapper Line Number, Text in Line

    Type of Key, Type of Value key=word, value=one
  6. MapReduce on NLP Lab Workstations • Compile ◦ $ mkdir

    YourMapReduce ◦ $ javac -classpath=/usr/local/hadoop/hadoop-core- 1.2.1.jar -d YourMapReduce YourMapReduce.java • Pack ◦ $ jar -cvf YourMapReduce.jar -C YourMapReduce . • Run ◦ $ hadoop jar YourMapReduce.jar your.class.name. YourMapReduce arguments
  7. Apache Spark • https://spark.apache.org/ • Fast and general engine for

    large-scale data processing • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Write applications quickly in Java/Scala/Python
  8. Spark on NLP Lab Workstations • MasterNode ◦ spark://NLG-WKS-9:7077 •

    Status ◦ 192.168.28.9:8080 • Add “/usr/local/spark/bin” into your $PATH Edit your ~/.bashrc $ export PATH=$PATH:/usr/local/spark/bin • Run ◦ $ spark-submit --master spark://NLG-WKS-9:7077 YourPyhonSparkScript.py
  9. Virtualenv • Build your own python env ◦ $ virtualenv

    env-name ◦ $ vittualenv --no-site-packages env-name ▪ No site packages • Work in your env ◦ $ source env-name/bin/active ◦ Do anything in your env ^_< • Leave env ◦ (env-name) $ deactivate
  10. 第一次全面佔領CPU就上手 • while True ? ◦ 這樣只用了一顆CPU,太弱了 • Processes pool

    ◦ multiprocessing # Create pool pool = multiprocessing.Pool(processes=NumberOfProcesses) # Apply tasks pool.apply_async(function, (args) ) # Close pool pool.close() # Wait pool.join()
  11. 第一次全面佔領CPU就上手 import multiprocessing def doSth(job): doSomething return something jobs =

    [...] results = [] pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()) for job in jobs: result.append( pool.apply_async(doSth, (job) ) ) pool.close() pool.join() for r in results: result = r.get()