Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zeppelin(Spark)으로 데이터 분석하기

VCNC
November 05, 2014

Zeppelin(Spark)으로 데이터 분석하기

Zeppelin(powered by Apache Spark)으로 데이터 분석하기

스사모 (한국 스파크 사용자 모임) https://www.facebook.com/groups/sparkkoreauser/

VCNC

November 05, 2014
Tweet

More Decks by VCNC

Other Decks in Programming

Transcript

  1. Zeppelin (powered by Apache Spark) ਵ۽ ؘ੉ఠ ࠙ࢳೞӝ 2014-11-05 झࢎݽ

    (ೠҴ झ౵௼ ࢎਊ੗ ݽ੐) https://www.facebook.com/groups/sparkkoreauser/ ! ӣ࢚਋, VCNC(࠺౟ਦ) [email protected]
  2. Apache Spark? • MapReduce ৬ ਬࢎೠ ੘স੉ оמ • ഛ੢ࢿ

    (Spark SQL, Spark Streaming, MLLib, GraphX) • MapReduceࠁ׮ ഻ঁ рױೠ ੋఠಕ੉झ, ߓ਋ӝ ए਑ (Scala, REPL) • ੘স ઙܨী ٮۄ MapReduce੄ 5ߓ~50ߓ ࡅܴ (In- Memory Data) • Hadoop Storage ഐജ (HDFS, HBase, S3, ..)
  3. ৵ ೙ਃೠо? • MapReduce, Hive (ӝઓ੄ ૑ߓ ӝٜࣿ) • ݒ਋

    ъ۱ೞ૑݅, ੘স੉ ࠂ੟ೡࣻ۾ ࠺ബਯ੸੉׮. (઺ р Ѿҗܳ ҅ࣘ೧ࢲ HDFSী ੷੢) • APIо ࠂ੟ೞҊ, MR Job ৈ۞ѐܳ Chaining೧ࢲ ੘স ਸ ٜ݅য֬ਵݶ, ਬ૑ࠁࣻೞӝо য۵׮.
  4. Spark Key Concept • RDD (Resilient Distributed Datasets) ‣ ௿۞झఠ

    ੹୓ীࢲ ҕਬغח ܻझ౟, ݫݽܻ࢚ী ৢۄо੓਺. (ݫݽܻ ࠗ઒ೠ ҃਋, ٣झ௼ী spill) ‣ map, reduce, count, filter, join ١ ׮নೠ ੘স оמ ‣ ৈ۞ ੘সਸ ࢸ੿೧فҊ, Ѿҗܳ ঳ਸ ٸ lazyೞѱ ҅࢑ • Scala ‣ ؘ੉ఠ ࠙ࢳ ೞӝী ই઱ જ਷ ঱য ‣ ъ۱ೠ expression, Java৬੄ ഐജࢿ ‣ Interactive Shell (REPL)
  5. Spark਷ જ׮ • ࣻभ؀੄ Hadoop Cluster۽ ௾ ੘সਸ ج۰ঠ ೮؍

    ҃਋, 10؀ ੉ೞ੄ Cluster۽ ؀୓ೡ ࣻ ੓׮ • ௿۞झఠ۽ ج۰ঠ ೞ؍ ੘সਸ 1~2؀۽ جܾ ࣻ ੓׮ • ࣻभ࠙ ӝ׮۰ঠ ೞ؍ ੘স੉ 1࠙݅ী ৮ܐػ׮ • MR ੘স ௏٘ ٜ݅Ҋ, ಁః૚ೞҊ, submitೞҊ ೞ؍ ࠂ੟ ೠ җ੿੉, shellীࢲ ௏٘ ೠ઴ ஖חѪਵ۽ ؀୓ػ׮ • ୊਺ ੽ೞח ࢎۈب ߓ਋ӝ औ׮
  6. Word Count val file = spark.textFile("hdfs://...") val counts = file.flatMap(line

    => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  7. Getting Download Data case class CloudFrontPcVerChart(val date: String, val country:

    String, val ip: String, val http_method: String, val ua: String) val cloudFrontPcVerLogs = "s3n://assets-between-pc-logs/*2014-10-*" val cloudFrontPcVerDownloadLogs = sc.textFile(cloudFrontPcVerLogs).filter(_ contains "/downloads/ setup.exe").map(x => x.split("\t")) cloudFrontPcVerDownloadLogs.first val cloudFrontPcVerDownloadChart = cloudFrontPcVerDownloadLogs.map(arr => CloudFrontPcVerChart(arr(0), IP2C.get(arr(4)), arr(4), arr(5), arr(10))) cloudFrontPcVerDownloadChart.registerAsTable("pc_ver_download")
  8. Querying Data select country, count(1) value from pc_ver_download group by

    country order by value desc limit 10 Simple enough!
  9. ഛ੢ ೐۽ં౟ٜ • Spark SQL • Spark Streaming • MLlib

    • GraphX • SparkR (৘੿) • Zeppelin
  10. Zeppelin • A web-based notebook for Apache Spark (http://zeppelin- project.org)

    • Open source (https://github.com/NFLabs/zeppelin)
  11. Zeppelin • Early stage ೐۽ં౟ (Github 50 Star) • 1~2֙

    ࢎ੉ী ষ୒ ਬݺ೧૕ ೐۽ં౟ • 10઴݅ ழ޿೧ب contributor ۽ ֍য઱ח જ਷ ೐۽ં౟ • ए਍ ࢸ஖, प೯ೞݶ Sparkਸ ղࠗীࢲ ڸਕષ (৻ࠗ Cluster৬ োѾب оמ)
  12. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ Spark SQLҗ Ѿ೤ೞৈ

    Visualisation ో۽ب ֫਷ оמࢿ рױೠ SQL Query۽ ؀एࠁ٘ܳ ࣽधрী ݅ٞ
  13. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ Spark SQLҗ Ѿ೤ೞৈ

    Visualisation ో۽ب ֫਷ оמࢿ ਤ஖, և੉ ١ ઑ੺
  14. Zeppelin • рױೞѱ ؘ੉ఠ ࠙ࢳਸ द੘೧ࠁ۰ח ࢎۈٜীѱ ୶ୌ • ޹୏ೞѱ

    ੉۠੷۠ ؘ੉ఠܳ ࢓ಝࠁҊ ࠙ࢳೞ۰ח ࢎۈٜীѱ ୶ ୌ • Dashboardਸ ࡅܰѱ ٜ݅Ҋ र਷ ࢎۈٜীѱ ୶ୌ • Hotೠ Open Sourceী ଵৈ೧ࠁҊ र਷ ࢎۈٜীѱ ୶ୌ • Sparkਸ ୊਺ ࢎਊೞח ҃਋ח Spark Shellਸ ݢ੷ ࢎਊ೧ࠁחѪ ਸ ୶ୌ (Zeppelin Code Editor੄ Auto Completionӝמ੉ ࠁъؼ ٸ ө૑)