Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zeppelin(Spark)으로 데이터 분석하기

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.
Avatar for VCNC VCNC
November 05, 2014

Zeppelin(Spark)으로 데이터 분석하기

Zeppelin(powered by Apache Spark)으로 데이터 분석하기

스사모 (한국 스파크 사용자 모임) https://www.facebook.com/groups/sparkkoreauser/

Avatar for VCNC

VCNC

November 05, 2014
Tweet

More Decks by VCNC

Other Decks in Programming

Transcript

  1. Zeppelin (powered by Apache Spark) ਵ۽ ؘ੉ఠ ࠙ࢳೞӝ 2014-11-05 झࢎݽ

    (ೠҴ झ౵௼ ࢎਊ੗ ݽ੐) https://www.facebook.com/groups/sparkkoreauser/ ! ӣ࢚਋, VCNC(࠺౟ਦ) [email protected]
  2. Apache Spark? • MapReduce ৬ ਬࢎೠ ੘স੉ оמ • ഛ੢ࢿ

    (Spark SQL, Spark Streaming, MLLib, GraphX) • MapReduceࠁ׮ ഻ঁ рױೠ ੋఠಕ੉झ, ߓ਋ӝ ए਑ (Scala, REPL) • ੘স ઙܨী ٮۄ MapReduce੄ 5ߓ~50ߓ ࡅܴ (In- Memory Data) • Hadoop Storage ഐജ (HDFS, HBase, S3, ..)
  3. ৵ ೙ਃೠо? • MapReduce, Hive (ӝઓ੄ ૑ߓ ӝٜࣿ) • ݒ਋

    ъ۱ೞ૑݅, ੘স੉ ࠂ੟ೡࣻ۾ ࠺ബਯ੸੉׮. (઺ р Ѿҗܳ ҅ࣘ೧ࢲ HDFSী ੷੢) • APIо ࠂ੟ೞҊ, MR Job ৈ۞ѐܳ Chaining೧ࢲ ੘স ਸ ٜ݅য֬ਵݶ, ਬ૑ࠁࣻೞӝо য۵׮.
  4. Spark Key Concept • RDD (Resilient Distributed Datasets) ‣ ௿۞झఠ

    ੹୓ীࢲ ҕਬغח ܻझ౟, ݫݽܻ࢚ী ৢۄо੓਺. (ݫݽܻ ࠗ઒ೠ ҃਋, ٣झ௼ী spill) ‣ map, reduce, count, filter, join ١ ׮নೠ ੘স оמ ‣ ৈ۞ ੘সਸ ࢸ੿೧فҊ, Ѿҗܳ ঳ਸ ٸ lazyೞѱ ҅࢑ • Scala ‣ ؘ੉ఠ ࠙ࢳ ೞӝী ই઱ જ਷ ঱য ‣ ъ۱ೠ expression, Java৬੄ ഐജࢿ ‣ Interactive Shell (REPL)
  5. Spark਷ જ׮ • ࣻभ؀੄ Hadoop Cluster۽ ௾ ੘সਸ ج۰ঠ ೮؍

    ҃਋, 10؀ ੉ೞ੄ Cluster۽ ؀୓ೡ ࣻ ੓׮ • ௿۞झఠ۽ ج۰ঠ ೞ؍ ੘সਸ 1~2؀۽ جܾ ࣻ ੓׮ • ࣻभ࠙ ӝ׮۰ঠ ೞ؍ ੘স੉ 1࠙݅ী ৮ܐػ׮ • MR ੘স ௏٘ ٜ݅Ҋ, ಁః૚ೞҊ, submitೞҊ ೞ؍ ࠂ੟ ೠ җ੿੉, shellীࢲ ௏٘ ೠ઴ ஖חѪਵ۽ ؀୓ػ׮ • ୊਺ ੽ೞח ࢎۈب ߓ਋ӝ औ׮
  6. Word Count val file = spark.textFile("hdfs://...") val counts = file.flatMap(line

    => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  7. Getting Download Data case class CloudFrontPcVerChart(val date: String, val country:

    String, val ip: String, val http_method: String, val ua: String) val cloudFrontPcVerLogs = "s3n://assets-between-pc-logs/*2014-10-*" val cloudFrontPcVerDownloadLogs = sc.textFile(cloudFrontPcVerLogs).filter(_ contains "/downloads/ setup.exe").map(x => x.split("\t")) cloudFrontPcVerDownloadLogs.first val cloudFrontPcVerDownloadChart = cloudFrontPcVerDownloadLogs.map(arr => CloudFrontPcVerChart(arr(0), IP2C.get(arr(4)), arr(4), arr(5), arr(10))) cloudFrontPcVerDownloadChart.registerAsTable("pc_ver_download")
  8. Querying Data select country, count(1) value from pc_ver_download group by

    country order by value desc limit 10 Simple enough!
  9. ഛ੢ ೐۽ં౟ٜ • Spark SQL • Spark Streaming • MLlib

    • GraphX • SparkR (৘੿) • Zeppelin
  10. Zeppelin • A web-based notebook for Apache Spark (http://zeppelin- project.org)

    • Open source (https://github.com/NFLabs/zeppelin)
  11. Zeppelin • Early stage ೐۽ં౟ (Github 50 Star) • 1~2֙

    ࢎ੉ী ষ୒ ਬݺ೧૕ ೐۽ં౟ • 10઴݅ ழ޿೧ب contributor ۽ ֍য઱ח જ਷ ೐۽ં౟ • ए਍ ࢸ஖, प೯ೞݶ Sparkਸ ղࠗীࢲ ڸਕષ (৻ࠗ Cluster৬ োѾب оמ)
  12. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ Spark SQLҗ Ѿ೤ೞৈ

    Visualisation ో۽ب ֫਷ оמࢿ рױೠ SQL Query۽ ؀एࠁ٘ܳ ࣽधрী ݅ٞ
  13. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮ Spark SQLҗ Ѿ೤ೞৈ

    Visualisation ో۽ب ֫਷ оמࢿ ਤ஖, և੉ ١ ઑ੺
  14. Zeppelin • рױೞѱ ؘ੉ఠ ࠙ࢳਸ द੘೧ࠁ۰ח ࢎۈٜীѱ ୶ୌ • ޹୏ೞѱ

    ੉۠੷۠ ؘ੉ఠܳ ࢓ಝࠁҊ ࠙ࢳೞ۰ח ࢎۈٜীѱ ୶ ୌ • Dashboardਸ ࡅܰѱ ٜ݅Ҋ र਷ ࢎۈٜীѱ ୶ୌ • Hotೠ Open Sourceী ଵৈ೧ࠁҊ र਷ ࢎۈٜীѱ ୶ୌ • Sparkਸ ୊਺ ࢎਊೞח ҃਋ח Spark Shellਸ ݢ੷ ࢎਊ೧ࠁחѪ ਸ ୶ୌ (Zeppelin Code Editor੄ Auto Completionӝמ੉ ࠁъؼ ٸ ө૑)