$30 off During Our Annual Pro Sale. View Details »

Big Data Processing lol

Big Data Processing lol

Introduction to Apache Hadoop (HDFS, MapReduce) and Spark.

Qing-Cheng Li

August 06, 2014
Tweet

More Decks by Qing-Cheng Li

Other Decks in Programming

Transcript

  1. Apache Hadoop • http://hadoop.apache.org/ • Open source software for storage

    and large- scale processing of datasets on clusters. • Storage ◦ Hadoop Distributed File System (HDFS) • Parallel processing of large datasets ◦ Hadoop MapResuce
  2. HDFS on NLP Lab Workstations • on NLG-WKS-* series ◦

    nlg17.csie.ntu.edu.tw / nlg-wks.csie.org • NameNode ◦ NLG-WKS-9 ◦ hdfs://NLG-WKS-9:9000/ • DataNode ◦ NLG-WKS-[3-8] • Status ◦ 192.168.28.9:50070
  3. HDFS on NLP Lab Workstations • Add “/usr/local/hadoop/bin” into your

    $PATH ◦ Edit your ~/.bashrc ◦ $ export PATH=$PATH:/usr/local/hadoop/bin • Put your data on /user/username
  4. HDFS Shell Commands • $ hadoop fs -ls path ◦

    list files in path • $ hadoop fs -mkdir path ◦ create a dir • $ hadoop fs -rm path ◦ remove path • $ hadoop fs -put local-path hadoop-path ◦ upload local-path to hadoop-path • $ hadoop fs -get hadoop-path local-path ◦ download hadoop-path to local-path • More commands: http://goo.gl/BBdAzK
  5. Example: Word Count • Mapper Line Number, Text in Line

    Type of Key, Type of Value key=word, value=one
  6. MapReduce on NLP Lab Workstations • Compile ◦ $ mkdir

    YourMapReduce ◦ $ javac -classpath=/usr/local/hadoop/hadoop-core- 1.2.1.jar -d YourMapReduce YourMapReduce.java • Pack ◦ $ jar -cvf YourMapReduce.jar -C YourMapReduce . • Run ◦ $ hadoop jar YourMapReduce.jar your.class.name. YourMapReduce arguments
  7. Apache Spark • https://spark.apache.org/ • Fast and general engine for

    large-scale data processing • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Write applications quickly in Java/Scala/Python
  8. Spark on NLP Lab Workstations • MasterNode ◦ spark://NLG-WKS-9:7077 •

    Status ◦ 192.168.28.9:8080 • Add “/usr/local/spark/bin” into your $PATH Edit your ~/.bashrc $ export PATH=$PATH:/usr/local/spark/bin • Run ◦ $ spark-submit --master spark://NLG-WKS-9:7077 YourPyhonSparkScript.py
  9. Virtualenv • Build your own python env ◦ $ virtualenv

    env-name ◦ $ vittualenv --no-site-packages env-name ▪ No site packages • Work in your env ◦ $ source env-name/bin/active ◦ Do anything in your env ^_< • Leave env ◦ (env-name) $ deactivate
  10. 第一次全面佔領CPU就上手 • while True ? ◦ 這樣只用了一顆CPU,太弱了 • Processes pool

    ◦ multiprocessing # Create pool pool = multiprocessing.Pool(processes=NumberOfProcesses) # Apply tasks pool.apply_async(function, (args) ) # Close pool pool.close() # Wait pool.join()
  11. 第一次全面佔領CPU就上手 import multiprocessing def doSth(job): doSomething return something jobs =

    [...] results = [] pool = multiprocessing.Pool(processes=multiprocessing.cpu_count()) for job in jobs: result.append( pool.apply_async(doSth, (job) ) ) pool.close() pool.join() for r in results: result = r.get()