Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Zeppelin(Spark)으로 데이터 분석하기

VCNC
November 05, 2014

Zeppelin(Spark)으로 데이터 분석하기

Zeppelin(powered by Apache Spark)으로 데이터 분석하기

스사모 (한국 스파크 사용자 모임) https://www.facebook.com/groups/sparkkoreauser/

VCNC

November 05, 2014
Tweet

More Decks by VCNC

Other Decks in Programming

Transcript

  1. Zeppelin (powered by Apache Spark)
    ਵ۽ ؘ੉ఠ ࠙ࢳೞӝ
    2014-11-05
    झࢎݽ (ೠҴ झ౵௼ ࢎਊ੗ ݽ੐)
    https://www.facebook.com/groups/sparkkoreauser/
    !
    ӣ࢚਋, VCNC(࠺౟ਦ)
    [email protected]

    View Slide

  2. Apache Spark?
    • MapReduce ৬ ਬࢎೠ ੘স੉ оמ
    • ഛ੢ࢿ (Spark SQL, Spark Streaming, MLLib, GraphX)
    • MapReduceࠁ׮ ഻ঁ рױೠ ੋఠಕ੉झ, ߓ਋ӝ ए਑
    (Scala, REPL)
    • ੘স ઙܨী ٮۄ MapReduce੄ 5ߓ~50ߓ ࡅܴ (In-
    Memory Data)
    • Hadoop Storage ഐജ (HDFS, HBase, S3, ..)

    View Slide

  3. ৵ ೙ਃೠо?
    • MapReduce, Hive (ӝઓ੄ ૑ߓ ӝٜࣿ)
    • ݒ਋ ъ۱ೞ૑݅, ੘স੉ ࠂ੟ೡࣻ۾ ࠺ബਯ੸੉׮. (઺
    р Ѿҗܳ ҅ࣘ೧ࢲ HDFSী ੷੢)
    • APIо ࠂ੟ೞҊ, MR Job ৈ۞ѐܳ Chaining೧ࢲ ੘স
    ਸ ٜ݅য֬ਵݶ, ਬ૑ࠁࣻೞӝо য۵׮.

    View Slide

  4. Spark Key Concept
    • RDD (Resilient Distributed Datasets)
    ‣ ௿۞झఠ ੹୓ীࢲ ҕਬغח ܻझ౟, ݫݽܻ࢚ী ৢۄо੓਺. (ݫݽܻ ࠗ઒ೠ ҃਋,
    ٣झ௼ী spill)
    ‣ map, reduce, count, filter, join ١ ׮নೠ ੘স оמ
    ‣ ৈ۞ ੘সਸ ࢸ੿೧فҊ, Ѿҗܳ ঳ਸ ٸ lazyೞѱ ҅࢑
    • Scala
    ‣ ؘ੉ఠ ࠙ࢳ ೞӝী ই઱ જ਷ ঱য
    ‣ ъ۱ೠ expression, Java৬੄ ഐജࢿ
    ‣ Interactive Shell (REPL)

    View Slide

  5. Spark਷ જ׮
    • ࣻभ؀੄ Hadoop Cluster۽ ௾ ੘সਸ ج۰ঠ ೮؍ ҃਋,
    10؀ ੉ೞ੄ Cluster۽ ؀୓ೡ ࣻ ੓׮
    • ௿۞झఠ۽ ج۰ঠ ೞ؍ ੘সਸ 1~2؀۽ جܾ ࣻ ੓׮
    • ࣻभ࠙ ӝ׮۰ঠ ೞ؍ ੘স੉ 1࠙݅ী ৮ܐػ׮
    • MR ੘স ௏٘ ٜ݅Ҋ, ಁః૚ೞҊ, submitೞҊ ೞ؍ ࠂ੟
    ೠ җ੿੉, shellীࢲ ௏٘ ೠ઴ ஖חѪਵ۽ ؀୓ػ׮
    • ୊਺ ੽ೞח ࢎۈب ߓ਋ӝ औ׮

    View Slide

  6. Code Examples (1)
    !
    Word Count

    View Slide

  7. Word Count
    val file = spark.textFile("hdfs://...")

    val counts = file.flatMap(line => line.split(" "))

    .map(word => (word, 1))

    .reduceByKey(_ + _)

    counts.saveAsTextFile("hdfs://...")

    View Slide

  8. Code Examples (2)
    !
    Getting
    Between PC Ver. Download

    View Slide

  9. Getting Download Data
    case class CloudFrontPcVerChart(val date: String, val country: String, val ip:
    String, val http_method: String, val ua: String)

    val cloudFrontPcVerLogs = "s3n://assets-between-pc-logs/*2014-10-*"

    val cloudFrontPcVerDownloadLogs =
    sc.textFile(cloudFrontPcVerLogs).filter(_ contains "/downloads/
    setup.exe").map(x => x.split("\t"))

    cloudFrontPcVerDownloadLogs.first

    val cloudFrontPcVerDownloadChart =
    cloudFrontPcVerDownloadLogs.map(arr => CloudFrontPcVerChart(arr(0),
    IP2C.get(arr(4)), arr(4), arr(5), arr(10)))

    cloudFrontPcVerDownloadChart.registerAsTable("pc_ver_download")

    View Slide

  10. Querying Data
    select country, count(1) value
    from pc_ver_download
    group by country
    order by value desc
    limit 10
    Simple enough!

    View Slide

  11. Result
    * Visualization powered by Zeppelin

    View Slide

  12. ഛ੢ ೐۽ં౟ٜ
    • Spark SQL
    • Spark Streaming
    • MLlib
    • GraphX
    • SparkR (৘੿)
    • Zeppelin

    View Slide

  13. Zeppelin
    • A web-based notebook for Apache Spark (http://zeppelin-
    project.org)
    • Open source (https://github.com/NFLabs/zeppelin)

    View Slide

  14. Zeppelin
    • Early stage ೐۽ં౟ (Github 50 Star)
    • 1~2֙ ࢎ੉ী ষ୒ ਬݺ೧૕ ೐۽ં౟
    • 10઴݅ ழ޿೧ب contributor ۽ ֍য઱ח જ਷ ೐۽ં౟
    • ए਍ ࢸ஖, प೯ೞݶ Sparkਸ ղࠗীࢲ ڸਕષ (৻ࠗ
    Cluster৬ োѾب оמ)

    View Slide

  15. Zeppelin
    Implementing dashboard via Zeppelin with few codes and queries

    View Slide

  16. Zeppelin
    Spark & Zeppelin
    Live Demo

    View Slide

  17. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮
    ETLࠗఠ ࠙ࢳ, visualisationө૑ ೞա੄ ో۽ ݽف ୊ܻ

    View Slide

  18. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮
    ETLࠗఠ ࠙ࢳ, visualisationө૑ ೞա੄ ో۽ ݽف ୊ܻ

    View Slide

  19. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮
    Interactive! ௏٘ա ௪ܻܳ ֍Ҋ Ѣ੄ ૊द Ѿҗо ա১

    View Slide

  20. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮
    Spark SQLҗ Ѿ೤ೞৈ Visualisation ో۽ب ֫਷ оמࢿ

    View Slide

  21. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮
    Spark SQLҗ Ѿ೤ೞৈ Visualisation ో۽ب ֫਷ оמࢿ

    View Slide

  22. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮
    Spark SQLҗ Ѿ೤ೞৈ Visualisation ో۽ب ֫਷ оמࢿ
    рױೠ SQL Query۽ ؀एࠁ٘ܳ
    ࣽधрী ݅ٞ

    View Slide

  23. Live Demoܳ Keynoteী ֍ӝо য۰ਕ झ௼ܽࢫਵ۽ ؀୓೤פ׮
    Spark SQLҗ Ѿ೤ೞৈ Visualisation ో۽ب ֫਷ оמࢿ
    ਤ஖, և੉ ١ ઑ੺

    View Slide

  24. Zeppelin
    • рױೞѱ ؘ੉ఠ ࠙ࢳਸ द੘೧ࠁ۰ח ࢎۈٜীѱ ୶ୌ
    • ޹୏ೞѱ ੉۠੷۠ ؘ੉ఠܳ ࢓ಝࠁҊ ࠙ࢳೞ۰ח ࢎۈٜীѱ ୶

    • Dashboardਸ ࡅܰѱ ٜ݅Ҋ र਷ ࢎۈٜীѱ ୶ୌ
    • Hotೠ Open Sourceী ଵৈ೧ࠁҊ र਷ ࢎۈٜীѱ ୶ୌ
    • Sparkਸ ୊਺ ࢎਊೞח ҃਋ח Spark Shellਸ ݢ੷ ࢎਊ೧ࠁחѪ
    ਸ ୶ୌ (Zeppelin Code Editor੄ Auto Completionӝמ੉ ࠁъؼ ٸ ө૑)

    View Slide

  25. хࢎ೤פ׮

    View Slide