Upgrade to Pro — share decks privately, control downloads, hide ads and more …

데이터 분석을 위한 Scala

VCNC
December 03, 2014

데이터 분석을 위한 Scala

데이터 분석을 위한 Scala
한국 스파크 사용자 모임

개요
- Scala 개요
- 왜 Scala인가?
- Scala 기초 맛보기
- 좀만 더 파보기

정리
- Scala는 데이터 분석하기에 좋은 언어 (다른 용도로도 좋아요)
- 간결한 표현, 좋은 성능, Functional Programming
- REPL, Scripting가능
- 우아한 방식으로 원하는 개념을 구현할 수 있음

VCNC

December 03, 2014
Tweet

More Decks by VCNC

Other Decks in Programming

Transcript

  1. ؘ੉ఠ ࠙ࢳਸ ਤೠ
    2014-12-03
    झࢎݽ (ೠҴ झ౵௼ ࢎਊ੗ ݽ੐)
    ӣ࢚਋, VCNC(࠺౟ਦ)
    [email protected]

    View Slide

  2. द੘ೞӝ ੹ী
    1. Scalaח ౠ੿ ࠙ঠী Ҵೠغ૑ ঋ਷ ߧਊ ೐۽Ӓې߁ ঱যੑפ׮. ࠄ ੗ܐীࢲח ؘ੉ఠ ࠙ࢳ ࠙ঠ
    ী ୡ੼ਸ ݏ୶য Scalaо ࢤࣗೠ ࢎۈٜਸ ਤ೧ Scala੄ ੌࠗܳ ࣗѐೞҊ ੓णפ׮.
    Scalaী ؀೧ ؊ ੗ࣁ൤ ঌইࠁҊ रਵन ࠙਷ Ҵղ ࢎਊ੗ Ӓܛੋ ‘ۄ झணۄ ௏٬ױ’ਸ ୶ୌ೤פ׮.
    2. ੉ ੗ܐীࢲ ׮ܖҊ੗ ೞח ؘ੉ఠ ࠙ࢳ਷ R, Matlab١ਸ ࢎਊೞח Ҋә ࠙ࢳࠁ׮ח, ઱۽ ؀ਊ۝
    ؘ੉ఠ੄ ࠙࢑ ୊ܻ ߂ ࠙ࢳ ࠙ঠੑפ׮.

    View Slide

  3. public class WordCount {
    public static class Map extends MapReduceBase implements Mapper {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
    word.set(tokenizer.nextToken());
    output.collect(word, one);
    }
    }
    }
    public static class Reduce extends MapReduceBase implements Reducer {
    public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws
    IOException {
    int sum = 0;
    while (values.hasNext()) {
    sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
    }
    }
    public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);
    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);
    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));
    JobClient.runJob(conf);
    }
    }
    Word count in MapReduce (Java)

    View Slide

  4. public class WordCount {
    public static class Map extends MapReduceBase implements Mapper {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
    word.set(tokenizer.nextToken());
    output.collect(word, one);
    }
    }
    }
    public static class Reduce extends MapReduceBase implements Reducer {
    public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws
    IOException {
    int sum = 0;
    while (values.hasNext()) {
    sum += values.next().get();
    }
    output.collect(key, new IntWritable(sum));
    }
    }
    public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");
    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);
    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);
    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);
    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));
    JobClient.runJob(conf);
    }
    }
    val file = spark.textFile("hdfs://...")
    val counts = file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    counts.saveAsTextFile("hdfs://...")
    Word count in Spark(Scala)

    View Slide

  5. Index
    • Scala ѐਃ
    • ৵ Scalaੋо?
    • Scala ӝୡ ݍࠁӝ
    • ખ݅ ؊ ౵ࠁӝ

    View Slide

  6. Scala ѐਃ

    View Slide

  7. Scalable Language!
    • рѾೠ ಴അҗ ъ۱ೠ ӝמਸ ా೧ ؊ ௾ ೐۽Ӓ۔ਸ ٜ݅ӝ ਤೠ
    ঱য
    • Scalaо о૓ ৈ۞о૑ ౠ૚ٜ੉ ؘ੉ఠ ࠙ࢳೞӝী જ਷ Ѫٜ੉
    ݆׮

    View Slide

  8. Scala
    • ই઱ рѾೠ ޙߨ (like, Python)
    • OOP, Functional Programming झఋੌ оמ
    • JVMীࢲ प೯, Java৬ ഐജ
    • જ਷ ࢿמ (== Java)
    • ੿੸ ఋੑ (!= Python, == Java)
    • REPL (Shell), Scripting
    * Ӓ ߆ীب જ਷ ౠ૚੉ ݆૑݅, ؘ੉ఠ ࠙ࢳ ࠙ঠ৬ ҙ۲ػ ౠ૚ ਤ઱۽ ঱әೞ৓णפ׮

    View Slide

  9. рѾೠ ޙߨ (Java৬ ࠺Ү)
    public class Person {
    private String name;
    private String work;
    public void setName(String name) {
    this.name = name;
    }
    public String getName() {
    return name;
    }
    public void setWork(String work) {
    this.work = work;
    }
    public String getWork() {
    return work;
    }
    }
    Person.java
    Job.java
    public class Job {
    public void main(String[] args) {
    Person kevin = new Person();
    kevin.setName("Kevin");
    kevin.setWork("Between");
    }
    }
    job.scala
    class Person(val name: String, val work: String)
    val kevin = new Person("Kevin", "Between")
    ஢੉ ݽ੗ۄ..
    GOOD

    View Slide

  10. OOP & Functional Programming
    • য়೧: OOP৬ Functional Programming਷ ߈؀݈੉׮? (X)
    • Scalaח Pure OOP
    class Person(val name: String, val work: String)
    val kevin = new Person("Kevin", "Between")
    • Scalaח Functional Programming੉ оמ
    val list = List(1, 2, 3)
    def aMultiplyFunction(x: Int) = {
    x * 2
    }
    val result = list.map(aMultiplyFunction) ೣࣻо 1st-class citizen!
    ೣࣻܳ ؘ੉ఠ۽ р઱ೞҊ,
    ੋ੗۽ ֈӝח ١੄ ೯ਤо оמ

    View Slide

  11. JVMীࢲ प೯, Java৬ ഐജ
    • Scala ௏٘ܳ ஹ౵ੌೞݶ Java৬ ݃ଲо૑۽ .class ౵ੌ੉ ա১
    • JVMীࢲ प೯, Java৬ Ѣ੄ زੌೠ प೯ ࢿמਸ о૗
    • Java Class Importೞৈ ࢎਊ оמ
    • Java fileҗ Scala fileਸ ഒਊೞৈ ஹ౵ੌب оמ

    View Slide

  12. ੿੸ ఋੑ ঱য
    • ੿੸ ఋੑ vs ز੸ ఋੑ?
    • ࢲ۽ ੢ױ੼੉ ڢ۶ೣ
    • ੿੸ ఋੑ ঱য੄ ੢੼: ஹ౵ੌद ఋੑ ୓ఊ, જ਷ ࢿמ
    • ز੸ ఋੑ ঱য੄ ੢੼: рಞೠ ௏٘੘ࢿ, ӭՔೠ ௏٘
    • Scalaח ੿੸ ఋੑ ঱য
    • ஹ౵ੌद ఋੑ୓௼, type safety, જ਷ ࢿמ
    • ࠺Ү੸ ӭՔೠ type interface - ఋੑਸ ୶ۿ(type inference)ೞৈ ֍যષ
    • ௏٘ܳ ױࣽೞѱ ਬ૑ೞӝ ਤೠ implicit conversion١੄ ੢஖

    View Slide

  13. ৵ Scalaੋо?

    View Slide

  14. ৵ Scalaੋо?
    • рѾೠ ޙߨҗ ъ۱ೠ expression
    • Functional Programming
    • Java৬ ഐജ (= Hadoop ഐജ!)
    • REPL, Scripting
    • Apache Spark
    • Collection library, Pattern matching, Ӓ ৻ ݧ૓ بҳٜ

    View Slide

  15. рѾೠ ޙߨ, ъ۱ೠ ಴അ۱
    • (׼োೞѱب) ޙߨ੉ рѾೞݶ જ׮.
    • if-else࠙ӝ ഑਷ try-catch ١੉ ݽف expression੐
    // if statement is an expression!
    println(if (a == "A") "It's A!" else "It's not A")
    // try catch is an expression!
    val value = try {
    doSomeDangerousOperation
    } catch {
    case _ => "some value"
    }
    val file = spark.textFile("hdfs://...")
    val counts = file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    counts.saveAsTextFile("hdfs://...")

    View Slide

  16. рѾೠ ޙߨ, ъ۱ೠ ಴അ۱
    • ੌҙࢿ ੓ח operatorٜ
    // Java
    "A".equals("B")
    // Scala
    "A" == "B"
    case class Person(name: String, work: String)
    val kevin = Person("Kevin", "Between")
    val anotherKevin = Person("Kevin", “Between”)
    kevin == anotherKevin // true case class੄ ࢤࢿীח
    new о ೙ਃ হ׮
    • ೤ܻ੸ੋ class equality

    View Slide

  17. Functional Programming
    • ӝઓ੄ ೐۽Ӓ۔ীࢲ੄ ೣࣻо ইצ, ࣻ೟੸ੋ ੄޷ীࢲ੄ ೣࣻܳ
    ࢤп೧ ࠇद׮!
    • y = sin(x) : Side effectо হ਺. যڃ ࢚ടীࢲب x ܳ ֍ਵݶ Ӓ
    ী ݏח yо ա১
    • tan(x) = sin(x) / cos(x) : ೣࣻܳ ؘ੉ఠ୊ۢ ࢤпೞৈ, ౵ۄݫఠ
    ۽ ֈӝѢա ઑ೤ೞח ١੄ ੘স੉ оמ
    • y = sin(x) : yח xо ೠߣ ੿೧૑ݶ ߸ೞ૑ ঋ਺. ’߸ࣻ’ о হѱ!
    • ߸ٜࣻਸ immutableೞѱ ٜ݅੗!
    * ৘ઁ ߂ ੌࠗ ࢸݺਸ Programming Scala ଼ীࢲ ରਊ೮णפ׮.

    View Slide

  18. FP੄ ੉۞ೠ ౠࢿٜ੉ ৵ જ਷о?
    • ߡӒܳ ઴ৈળ׮ (߸ࣻী ੄೧ ৘ӝ஖ޅೠ ز੘ী ࡅ૑חѪਸ)
    • ೠߣ ٜ݅য֬਷ ೣࣻܳ ޺ਸ ࣻ ੓׮ (no side effect!)
    • immutable ߸ࣻח ޙઁܳ ױࣽച೧ળ׮ (data share,
    parallelismী ъೣ)

    View Slide

  19. Java৬੄ ഐജࢿ
    • JVMীࢲ ҳز -> ݆਷ ন੄ ؘ੉ఠ ୊ܻೡ ٸ ࢿמ જ਺!
    • Java libraryٜਸ Ӓ؀۽ ഝਊ оמ
    • Hadoop eco-system੄ Java ௏ٜ٘ਸ Ӓ؀۽ ࢎਊೡ ࣻ ੓׮!
    • ৘੹ী ઓ੤ೞ؍ ௏٘ܳ ੸਷ ֢۱ਵ۽ convert೧ࢲ ࢎਊ оמ
    • Java ௏٘৬ ഒਊ೧ࢲ ஹ౵ੌ оמ
    • src/java/…, src/scala/…

    View Slide

  20. REPL
    • Read–Eval–Print Loop (aka Shell)
    • ࢜۽਍ ঱যܳ ࡅܰѱ ߓ਋Ҋ, द೷ೡ ࣻ ੓׮!
    • ؘ੉ఠܳ ٜৈ׮ ࠅ ҃਋, step-by-stepਵ۽ ੘স੉ оמ೧ࢲ જ׮
    ী۞о աب ૊п ঌࣻ ੓׮
    ؘ੉ఠܳ ׮ܖח җ੿੉ interactive೧૗!

    View Slide

  21. Apache Spark
    • ݫݽܻ ӝ߈ Ҋࢿמ ࠙࢑ ؘ੉ఠ ୊ܻ दझమ (ӝઓ੄ 10~100ߓ)
    • Scala۽ ॳৈ૗. Scala੄ collection library৬ ਬࢎೠ ੋఠಕ੉झ
    • Scala shellী ӝמਸ ୶оೠ Spark shell ઁҕ
    • ߧਊ੸ਵ۽ ࢎਊೞӝ ਤೠ ׮নೠ োҙ ೐۽ં౟ ઓ੤
    • SQL, Machine Learning, Graph Analysis.. ١١
    • ૑Әب ࡅܰѱ ѐߊغҊ ੓Ҋ ݆਷ ࢎۈٜ੄ ҙबਸ ߉Ҋ ੓਺

    View Slide

  22. Ӓ ߆ী..
    • Collection library
    • Pattern matching
    • implicitэ਷ ਋ইೠ بҳٜ
    • ّࠗ࠙ীࢲ ؊ ੗ࣁ൤ ׮ܙ ৘੿

    View Slide

  23. Scala ӝୡ ݍࠁӝ
    *ؘ੉ఠ৬ ҙ۲ػ ࠗ࠙݅*

    View Slide

  24. ؘ੉ఠ ҳઑ
    • List, Map, Set ١੄ collection ٜ
    • List(1, 2, 3), Map(1 -> “a”, 2 -> “b”), Set(1, 2)
    • Tuple
    • val sparkTechTalk = (“2014-12-03”, 50)
    • sparkTechTalk._1
    • case (key, value) => println(key)
    • Option
    • ч੉ হਸ ٸ, null ؀न! (؊ ಞೞҊ, উ੹ೠ ೐۽Ӓې߁)
    • a = 1, a = null (ӝઓ) a = Some(1) a = None (Optionഝਊ)
    • a.nonEmpty, a.getOrElse(0)
    • Range
    • for (i • (0 to 10).foreach(println)
    • (0 until 10) (0 to 10) (0 to -10 by -1)

    View Slide

  25. Collections

    View Slide

  26. Collection ׮ܖӝ
    • (n), head, tail, last, contains, distinct, drop, …
    • Functional Combinators
    • map: elementী ೣࣻܳ ੸ਊೞৈ ׮ܲ ഋక۽ ߸ജ
    • filter: elementܳ true/false ౸߹ ೣࣻ ੸ਊ റ trueੋ ೦ݾ݅ թӣ
    • foreach: mapҗ ࠺त, ׮ܲഋక۽ ߸ജೞ૑ ঋҊ iteration݅ ࣻ೯
    • foldLeft (foldRight, reduce): ৽ଃ੄ elementࠗఠ द੘ೞৈ ೞա
    ۽ ೤ஜ
    • ّࠗ࠙ী ࢎਊ ৘ܳ ࠇद׮

    View Slide

  27. Function Literal
    val list = List(1, 2, 3, 4)
    list.filter((x: Int) => x < 3)
    val testNumber1 = (x: Int) => x < 3 // function as a 1st-class object!
    list.filter(testNumber1)
    list.filter((x) => x < 3) // target typing
    list.filter(x => x < 3)
    list.filter(_ < 3) // placeholder
    def testNumber2(x: Int) = x < 3 // function
    list.filter(x => testNumber2(x))
    list.filter(testNumber2(_))
    list.filter(testNumber2 _)
    list.filter(testNumber2)
    ݆਷ ࠗ࠙ਸ ୷ড оמ!
    ࣻৌীࢲ 3 ޷݅ੋ ч ҳೞӝ

    View Slide

  28. val input1 = "three"
    case class Chart(date: String, count: Int)
    val input2 = Chart("2014-12-02", 50)
    val input3 = ("spark-techtalk", 100)
    def matchTest(x: Any): Any = {
    x match {
    case 1 => "one"
    case "two" => 2
    case (key, value) => s"key: $key, value: $value"
    case Chart(date, count) => s"date: $date, count: $count"
    case _ => "others"
    }
    }
    matchTest(input1)
    res0: Any = others
    matchTest(input2)
    res1: Any = date: 2014-12-02, count: 50
    matchTest(input3)
    res2: Any = key: spark-techtalk, value: 100
    Pattern Matching & Case Class
    • Java੄ switch ~ case ৬ ࠺तೞ૑݅, ഻ঁ ъ۱ೠ بҳ
    ׮ܲ ઙܨ੄ ఋੑ੉ۄب ݒ஖ оמ
    case ഑਷ case class ഝਊೞݶ ؊਌ ಞܻ
    case class: ؘ੉ఠ ҳઑചী ಞܻ

    View Slide

  29. ӝୡ ޙߨٜ੉ա, ؊ ੗ࣁೠ ੉ۿ੸ ղਊ਷ ଼ਸ ଵҊ೤द׮.
    ୶ୌبࢲ: Programming in Scala (ೠҴয౸ ੓਺)

    View Slide

  30. ৘ઁ: ۽Ӓীࢲ рױೠ ૑಴ ҳೞӝ
    // load log file
    val logFile = new java.io.File(path + "example_log.txt")
    val log = scala.io.Source.fromFile(logFile).getLines().toList
    // parse log and get sign up numbers
    case class LogEntry(dateTime: String, action: String, id: String)
    val logEntries = log.map(csv => csv.split(",")).map(arr => LogEntry(arr(0), arr(1),
    arr(2))).toList
    // get sign up
    val logEntriesToday = logEntries.filter(_.dateTime.contains("2014-12-04"))
    val signUp = logEntriesToday.filter(_.action == "SIGN_UP").size
    // active user
    val userIds = logEntriesToday.map(_ id)
    val activeUser = userIds.distinct.size

    View Slide

  31. Bonus: Spark Version
    // load log file
    val log = sc.textFile("file:///example_log.txt")
    // parse log and get sign up numbers
    case class LogEntry(dateTime: String, action: String, id: String)
    val logEntries = log.map(csv => csv.split(",")).map(arr => LogEntry(arr(0), arr(1),
    arr(2)))
    // get sign up
    val logEntriesToday = logEntries.filter(_.dateTime.contains("2014-12-04"))
    val signUp = logEntriesToday.filter(_.action == "SIGN_UP").count
    // active user
    val userIds = logEntriesToday.map(_ id)
    val activeUser = userIds.distinct.count
    Scala collection API৬ Ѣ੄ ৮੹൤ زੌ!

    View Slide

  32. ખ݅ ؊ ౵ࠁӝ
    Implicits

    View Slide

  33. Implicit Conversion
    • ӝמ੄ ഛ੢ਸ ಞೞѱ ೞҊरਸٸ
    • ৘࢚غח ఋੑਵ۽ ߸ജೞח ೣࣻܳ ੿੄೧֬Ҋ, ੗زਵ۽ ੸ਊ
    implicit def stringToInt(number: String): Int = {
    number match {
    case "one" => 1
    case "two" => 2
    }
    }
    def printNumber(n: Int) = println(n)
    printNumber("one") ਗې؀۽ۄݶ, compile error.
    implicit conversion੉ ࢶ঱غয ੓ਵ޲۽,
    String => Int ۽ ੗ز ߸ജ੉ ੌযթ

    View Slide

  34. Implicit Conversion ഝਊ
    DateParser.parse("2014-12-03") // java style
    "2014-12-03".toDateTime // better solution using implicit conversion
    object DateParser {
    def parse(dateString :String) = new java.util.Date
    }
    DateParser.parse("2014-12-03")
    class DateConverter(val s: String) {
    def toDateTime = DateParser.parse(s)
    }
    implicit def string2DateConverter(s: String) = new DateConverter(s)
    "2014-12-03".toDateTime
    ؊ ૒ҙ੸੉Ҋ ੌҙࢿ ੓ח ௏٘ܳ ٜ݅ ࣻ ੓׮!

    View Slide

  35. Implicit Parameter
    • ߈ࠂ ੸ਊغח ౵ۄݫఠܳ рױೞѱ ٜ݅Ҋ रਸٸ
    val date = "2014-12-03"
    calculateSignUp(date)
    calculateActiveUser(date)
    calculateActionCount(date)
    def calculateSignUp(implicit date: String) = ...
    implicit val date = "2014-12-03"
    calculateSignUp
    calculateActiveUser
    calculateActionCount(date)
    • ױ, implicitਸ թߊೞݶ ൨ٜয૓׮!

    View Slide

  36. ੿ܻ
    • Scalaח ؘ੉ఠ ࠙ࢳೞӝী જ਷ ঱য (׮ܲ ਊب۽ب જইਃ)
    • рѾೠ ಴അ, જ਷ ࢿמ, Functional Programming
    • REPL, Scriptingоמ
    • ਋ইೠ ߑधਵ۽ ਗೞח ѐ֛ਸ ҳഅೡ ࣻ ੓਺

    View Slide

  37. хࢎ೤פ׮

    View Slide

  38. ଵҊೡ݅ೠ ੗ܐ
    • Scala 5࠙݅ী ߓ਋ӝ
    http://learnxinyminutes.com/docs/scala/
    • Coursera Scala ъ੄
    https://www.coursera.org/course/progfun
    • Scala ߓ਋ӝ (࠶۽Ӓ)
    http://joelabrahamsson.com/learning-scala/
    • Scala School (౟ਤఠ)
    http://twitter.github.io/scala_school/ko/
    • Programming in Scala (ೠҴয౸)
    Scala੄ ହद੗ੋ ݃౯ য়؊झఃо ૒੽ ੷ࣿ, ੹Ҵ ࢲ੼ীࢲ ҳݒ оמ

    View Slide