Zeppelin(powered by Apache Spark)으로 데이터 분석하기
스사모 (한국 스파크 사용자 모임) https://www.facebook.com/groups/sparkkoreauser/
Zeppelin (powered by Apache Spark)ਵ۽ ؘఠ ࠙ࢳೞӝ2014-11-05झࢎݽ (ೠҴ झ ࢎਊ ݽ)https://www.facebook.com/groups/sparkkoreauser/!ӣ࢚, VCNC(࠺ਦ)[email protected]
View Slide
Apache Spark?• MapReduce ৬ ਬࢎೠ স оמ• ഛࢿ (Spark SQL, Spark Streaming, MLLib, GraphX)• MapReduceࠁ ഻ঁ рױೠ ੋఠಕझ, ߓӝ ए(Scala, REPL)• স ઙܨী ٮۄ MapReduce 5ߓ~50ߓ ࡅܴ (In-Memory Data)• Hadoop Storage ഐജ (HDFS, HBase, S3, ..)
৵ ਃೠо?• MapReduce, Hive (ӝઓ ߓ ӝٜࣿ)• ݒ ъ۱ೞ݅, স ࠂೡࣻ۾ ࠺ബਯ. (р Ѿҗܳ ҅ࣘ೧ࢲ HDFSী )• APIо ࠂೞҊ, MR Job ৈ۞ѐܳ Chaining೧ࢲ সਸ ٜ݅য֬ਵݶ, ਬࠁࣻೞӝо য۵.
Spark Key Concept• RDD (Resilient Distributed Datasets)‣ ۞झఠ ীࢲ ҕਬغח ܻझ, ݫݽܻ࢚ী ৢۄо. (ݫݽܻ ࠗೠ ҃,٣झী spill)‣ map, reduce, count, filter, join ١ নೠ স оמ‣ ৈ۞ সਸ ࢸ೧فҊ, Ѿҗܳ ਸ ٸ lazyೞѱ ҅• Scala‣ ؘఠ ࠙ࢳ ೞӝী ই જ য‣ ъ۱ೠ expression, Java৬ ഐജࢿ‣ Interactive Shell (REPL)
Spark જ• ࣻभ Hadoop Cluster۽ সਸ ج۰ঠ ೮؍ ҃,10 ೞ Cluster۽ ೡ ࣻ • ۞झఠ۽ ج۰ঠ ೞ؍ সਸ 1~2۽ جܾ ࣻ • ࣻभ࠙ ӝ۰ঠ ೞ؍ স 1࠙݅ী ৮ܐػ• MR স ٘ ٜ݅Ҋ, ಁఃೞҊ, submitೞҊ ೞ؍ ࠂೠ җ, shellীࢲ ٘ ೠ חѪਵ۽ ػ• ೞח ࢎۈب ߓӝ औ
Code Examples (1)!Word Count
Word Countval file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")
Code Examples (2)!GettingBetween PC Ver. Download
Getting Download Datacase class CloudFrontPcVerChart(val date: String, val country: String, val ip:String, val http_method: String, val ua: String)val cloudFrontPcVerLogs = "s3n://assets-between-pc-logs/*2014-10-*" val cloudFrontPcVerDownloadLogs =sc.textFile(cloudFrontPcVerLogs).filter(_ contains "/downloads/setup.exe").map(x => x.split("\t"))cloudFrontPcVerDownloadLogs.firstval cloudFrontPcVerDownloadChart =cloudFrontPcVerDownloadLogs.map(arr => CloudFrontPcVerChart(arr(0),IP2C.get(arr(4)), arr(4), arr(5), arr(10)))cloudFrontPcVerDownloadChart.registerAsTable("pc_ver_download")
Querying Dataselect country, count(1) valuefrom pc_ver_downloadgroup by countryorder by value desclimit 10Simple enough!
Result* Visualization powered by Zeppelin
ഛ ۽ંٜ• Spark SQL• Spark Streaming• MLlib• GraphX• SparkR ()• Zeppelin
Zeppelin• A web-based notebook for Apache Spark (http://zeppelin-project.org)• Open source (https://github.com/NFLabs/zeppelin)
Zeppelin• Early stage ۽ં (Github 50 Star)• 1~2֙ ࢎী ষ ਬݺ೧ ۽ં• 10݅ ழ೧ب contributor ۽ ֍যח જ ۽ં• ए ࢸ, प೯ೞݶ Sparkਸ ղࠗীࢲ ڸਕષ (৻ࠗCluster৬ োѾب оמ)
ZeppelinImplementing dashboard via Zeppelin with few codes and queries
ZeppelinSpark & ZeppelinLive Demo
Live Demoܳ Keynoteী ֍ӝо য۰ਕ झܽࢫਵ۽ פETLࠗఠ ࠙ࢳ, visualisationө ೞա ో۽ ݽف ܻ
Live Demoܳ Keynoteী ֍ӝо য۰ਕ झܽࢫਵ۽ פInteractive! ٘ա ௪ܻܳ ֍Ҋ Ѣ द Ѿҗо ա১
Live Demoܳ Keynoteী ֍ӝо য۰ਕ झܽࢫਵ۽ פSpark SQLҗ Ѿೞৈ Visualisation ో۽ب ֫ оמࢿ
Live Demoܳ Keynoteী ֍ӝо য۰ਕ झܽࢫਵ۽ פSpark SQLҗ Ѿೞৈ Visualisation ో۽ب ֫ оמࢿрױೠ SQL Query۽ एࠁ٘ܳࣽधрী ݅ٞ
Live Demoܳ Keynoteী ֍ӝо য۰ਕ झܽࢫਵ۽ פSpark SQLҗ Ѿೞৈ Visualisation ో۽ب ֫ оמࢿਤ, և ١ ઑ
Zeppelin• рױೞѱ ؘఠ ࠙ࢳਸ द೧ࠁ۰ח ࢎۈٜীѱ ୶ୌ• ೞѱ ۠۠ ؘఠܳ ಝࠁҊ ࠙ࢳೞ۰ח ࢎۈٜীѱ ୶ୌ• Dashboardਸ ࡅܰѱ ٜ݅Ҋ र ࢎۈٜীѱ ୶ୌ• Hotೠ Open Sourceী ଵৈ೧ࠁҊ र ࢎۈٜীѱ ୶ୌ• Sparkਸ ࢎਊೞח ҃ח Spark Shellਸ ݢ ࢎਊ೧ࠁחѪਸ ୶ୌ (Zeppelin Code Editor Auto Completionӝמ ࠁъؼ ٸ ө)
хࢎפ