Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zeppelin(Spark)으로 데이터 분석하기

6a11050c8147e4f5fbf2637907c27964?s=47 VCNC
November 05, 2014

Zeppelin(Spark)으로 데이터 분석하기

Zeppelin(powered by Apache Spark)으로 데이터 분석하기

스사모 (한국 스파크 사용자 모임) https://www.facebook.com/groups/sparkkoreauser/

6a11050c8147e4f5fbf2637907c27964?s=128

VCNC

November 05, 2014
Tweet

Transcript

  1. Zeppelin (powered by Apache Spark) ਵ۽ ؘ੉ఠ ࠙ࢳೞӝ 2014-11-05 झࢎݽ

    (ೠҴ झ౵௼ ࢎਊ੗ ݽ੐) https://www.facebook.com/groups/sparkkoreauser/ ! ӣ࢚਋, VCNC(࠺౟ਦ) kevin@between.us
  2. Apache Spark? • MapReduce ৬ ਬࢎೠ ੘স੉ оמ • ഛ੢ࢿ

    (Spark SQL, Spark Streaming, MLLib, GraphX) • MapReduceࠁ׮ ഻ঁ рױೠ ੋఠಕ੉झ, ߓ਋ӝ ए਑ (Scala, REPL) • ੘স ઙܨী ٮۄ MapReduce੄ 5ߓ~50ߓ ࡅܴ (In- Memory Data) • Hadoop Storage ഐജ (HDFS, HBase, S3, ..)
  3. ৵ ೙ਃೠо? • MapReduce, Hive (ӝઓ੄ ૑ߓ ӝٜࣿ) • ݒ਋

    ъ۱ೞ૑݅, ੘স੉ ࠂ੟ೡࣻ۾ ࠺ബਯ੸੉׮. (઺ р Ѿҗܳ ҅ࣘ೧ࢲ HDFSী ੷੢) • APIо ࠂ੟ೞҊ, MR Job ৈ۞ѐܳ Chaining೧ࢲ ੘স ਸ ٜ݅য֬ਵݶ, ਬ૑ࠁࣻೞӝо য۵׮.
  4. Spark Key Concept • RDD (Resilient Distributed Datasets) ‣ ௿۞झఠ

    ੹୓ীࢲ ҕਬغח ܻझ౟, ݫݽܻ࢚ী ৢۄо੓਺. (ݫݽܻ ࠗ઒ೠ ҃਋, ٣झ௼ী spill) ‣ map, reduce, count, filter, join ١ ׮নೠ ੘স оמ ‣ ৈ۞ ੘সਸ ࢸ੿೧فҊ, Ѿҗܳ ঳ਸ ٸ lazyೞѱ ҅࢑ • Scala ‣ ؘ੉ఠ ࠙ࢳ ೞӝী ই઱ જ਷ ঱য ‣ ъ۱ೠ expression, Java৬੄ ഐജࢿ ‣ Interactive Shell (REPL)
  5. Spark਷ જ׮ • ࣻभ؀੄ Hadoop Cluster۽ ௾ ੘সਸ ج۰ঠ ೮؍

    ҃਋, 10؀ ੉ೞ੄ Cluster۽ ؀୓ೡ ࣻ ੓׮ • ௿۞झఠ۽ ج۰ঠ ೞ؍ ੘সਸ 1~2؀۽ جܾ ࣻ ੓׮ • ࣻभ࠙ ӝ׮۰ঠ ೞ؍ ੘স੉ 1࠙݅ী ৮ܐػ׮ • MR ੘স ௏٘ ٜ݅Ҋ, ಁః૚ೞҊ, submitೞҊ ೞ؍ ࠂ੟ ೠ җ੿੉, shellীࢲ ௏٘ ೠ઴ ஖חѪਵ۽ ؀୓ػ׮ • ୊਺ ੽ೞח ࢎۈب ߓ਋ӝ औ׮
  6. Code Examples (1) ! Word Count

  7. Word Count val file = spark.textFile("hdfs://...") val counts = file.flatMap(line

    => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  8. Code Examples (2) ! Getting Between PC Ver. Download

  9. Getting Download Data case class CloudFrontPcVerChart(val date: String, val country:

    String, val ip: String, val http_method: String, val ua: String) val cloudFrontPcVerLogs = "s3n://assets-between-pc-logs/*2014-10-*" val cloudFrontPcVerDownloadLogs = sc.textFile(cloudFrontPcVerLogs).filter(_ contains "/downloads/ setup.exe").map(x => x.split("\t")) cloudFrontPcVerDownloadLogs.first val cloudFrontPcVerDownloadChart = cloudFrontPcVerDownloadLogs.map(arr => CloudFrontPcVerChart(arr(0), IP2C.get(arr(4)), arr(4), arr(5), arr(10))) cloudFrontPcVerDownloadChart.registerAsTable("pc_ver_download")
  10. Querying Data select country, count(1) value from pc_ver_download group by

    country order by value desc limit 10 Simple enough!
  11. Result * Visualization powered by Zeppelin

  12. ഛ੢ ೐۽ં౟ٜ • Spark SQL • Spark Streaming • MLlib

    • GraphX • SparkR (৘੿) • Zeppelin
  13. Zeppelin • A web-based notebook for Apache Spark (http://zeppelin- project.org)

    • Open source (https://github.com/NFLabs/zeppelin)
  14. Zeppelin • Early stage ೐۽ં౟ (Github 50 Star) • 1~2֙

    ࢎ੉ী ষ୒ ਬݺ೧૕ ೐۽ં౟ • 10઴݅ ழ޿೧ب contributor ۽ ֍য઱ח જ਷ ೐۽ં౟ • ए਍ ࢸ஖, प೯ೞݶ Sparkਸ ղࠗীࢲ ڸਕષ (৻ࠗ Cluster৬ োѾب оמ)
  15. Zeppelin Implementing dashboard via Zeppelin with few codes and queries

  16. Zeppelin Spark & Zeppelin Live Demo

  17. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ ETLࠗఠ ࠙ࢳ, visualisationө૑

    ೞա੄ ో۽ ݽف ୊ܻ
  18. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ ETLࠗఠ ࠙ࢳ, visualisationө૑

    ೞա੄ ో۽ ݽف ୊ܻ
  19. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ Interactive! ௏٘ա ௪ܻܳ

    ֍Ҋ Ѣ੄ ૊द Ѿҗо ա১
  20. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ Spark SQLҗ Ѿ೤ೞৈ

    Visualisation ో۽ب ֫਷ оמࢿ
  21. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ Spark SQLҗ Ѿ೤ೞৈ

    Visualisation ో۽ب ֫਷ оמࢿ
  22. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ Spark SQLҗ Ѿ೤ೞৈ

    Visualisation ో۽ب ֫਷ оמࢿ рױೠ SQL Query۽ ؀एࠁ٘ܳ ࣽधрী ݅ٞ
  23. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ Spark SQLҗ Ѿ೤ೞৈ

    Visualisation ో۽ب ֫਷ оמࢿ ਤ஖, և੉ ١ ઑ੺
  24. Zeppelin • рױೞѱ ؘ੉ఠ ࠙ࢳਸ द੘೧ࠁ۰ח ࢎۈٜীѱ ୶ୌ • ޹୏ೞѱ

    ੉۠੷۠ ؘ੉ఠܳ ࢓ಝࠁҊ ࠙ࢳೞ۰ח ࢎۈٜীѱ ୶ ୌ • Dashboardਸ ࡅܰѱ ٜ݅Ҋ र਷ ࢎۈٜীѱ ୶ୌ • Hotೠ Open Sourceী ଵৈ೧ࠁҊ र਷ ࢎۈٜীѱ ୶ୌ • Sparkਸ ୊਺ ࢎਊೞח ҃਋ח Spark Shellਸ ݢ੷ ࢎਊ೧ࠁחѪ ਸ ୶ୌ (Zeppelin Code Editor੄ Auto Completionӝמ੉ ࠁъؼ ٸ ө૑)
  25. хࢎ೤פ׮