Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science in Go

Xuanyi
October 12, 2017

Data Science in Go

Data science is a solved problem: just use Pandas. Or is it? In this talk, Xuanyi will explore why Go may be a better solution for some problems and introduce a few libraries for that.

Xuanyi

October 12, 2017
Tweet

More Decks by Xuanyi

Other Decks in Technology

Transcript

  1. Data Science in Go
    @chewxy

    View Slide

  2. WHY GO?
    Follow @chewxy on Twi/er

    View Slide

  3. GO PROVERBS
    Follow @chewxy on Twi/er

    View Slide

  4. Go Proverbs
    gofmt's style is no one's favourite, yet
    gofmt is everyone's favourite.
    Follow @chewxy on Twi/er

    View Slide

  5. Go Proverbs
    gofmt's style is no one's favourite, yet
    gofmt is everyone's favourite.
    Clear is better than clever.
    Follow @chewxy on Twi/er

    View Slide

  6. Go Proverbs
    gofmt's style is no one's favourite, yet
    gofmt is everyone's favourite.
    Clear is better than clever.
    Don't just check errors. Handle them
    gracefully.
    Follow @chewxy on Twi/er

    View Slide

  7. THE ZEN OF PYTHON
    Follow @chewxy on Twi/er

    View Slide

  8. The Zen of Python
    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    Flat is better than nested.
    Sparse is better than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is better than never.
    Although never is often better than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea -- let's do more of those!
    Follow @chewxy on Twi/er

    View Slide

  9. The Zen of Python
    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    Flat is better than nested.
    Sparse is better than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is better than never.
    Although never is often better than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea -- let's do more of those!
    Follow @chewxy on Twi/er

    View Slide

  10. The Zen of Python
    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    Flat is better than nested.
    Sparse is better than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is better than never.
    Although never is often better than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea -- let's do more of those!
    Follow @chewxy on Twi/er

    View Slide

  11. The Zen of Python
    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    Flat is better than nested.
    Sparse is better than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is better than never.
    Although never is often better than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea -- let's do more of those!
    Follow @chewxy on Twi/er

    View Slide

  12. DATA SCIENCE, BRIEFLY
    Follow @chewxy on Twi/er

    View Slide

  13. Statistics S/W Eng
    X
    Data Science
    Follow @chewxy on Twi/er

    View Slide

  14. Ad-hocness of work
    longer-lived programs
    shorter-lived programs
    Follow @chewxy on Twi/er

    View Slide

  15. Ad-hocness of work
    work that exists in production
    exploratory work
    Follow @chewxy on Twi/er

    View Slide

  16. Ad-hocness of work
    longer-lived programs
    shorter-lived programs
    Complexity/Relative Effort
    Follow @chewxy on Twi/er

    View Slide

  17. Ad-hocness of work
    longer-lived programs
    shorter-lived programs
    Complexity/Relative Effort
    Python
    *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics
    Follow @chewxy on Twi/er

    View Slide

  18. Ad-hocness of work
    longer-lived programs
    shorter-lived programs
    Complexity/Relative Effort
    Python
    Go
    *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics
    Follow @chewxy on Twi/er

    View Slide

  19. Ad-hocness of work
    longer-lived programs
    shorter-lived programs
    Complexity/Relative Effort
    Python
    Go
    *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics
    Haskell
    Follow @chewxy on Twi/er

    View Slide

  20. Ad-hocness of work
    longer-lived programs
    shorter-lived programs
    Complexity/Relative Effort
    Python
    Go
    *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics
    Most data science programs are here
    Follow @chewxy on Twi/er

    View Slide

  21. "Nothing more permanent than a temporary hack"
    By Joy Leelawat (2016)
    Follow @chewxy on Twi/er

    View Slide

  22. ROBUST DATA SCIENCE
    Follow @chewxy on Twi/er

    View Slide

  23. Robust Data Science
    •  Good statistical understanding
    Follow @chewxy on Twi/er

    View Slide

  24. Robust Data Science
    •  Good statistical understanding
    – Use the right statistical underpinnings
    Follow @chewxy on Twi/er

    View Slide

  25. Robust Data Science
    •  Good statistical understanding
    – Use the right statistical underpinnings
    – Do it on pen and paper to check
    understanding
    Follow @chewxy on Twi/er

    View Slide

  26. Robust Data Science
    •  Good statistical understanding
    – Use the right statistical underpinnings
    – Do it on pen and paper to check
    understanding
    – Topic for another day
    Follow @chewxy on Twi/er

    View Slide

  27. Robust Data Science
    •  Good statistical understanding
    – Use the right statistical underpinnings
    – Do it on pen and paper to check
    understanding
    •  Robust software engineering
    Follow @chewxy on Twi/er

    View Slide

  28. WHAT DOES A
    DATA SCIENTIST DO?
    Follow @chewxy on Twi/er

    View Slide

  29. 3%
    60%
    19%
    9%
    4%
    4% 1%
    Building Training Sets
    Cleaning Data
    Collecting Data
    Statistical Analysis
    Refining Algorithms
    Other
    Telling people you shouldn't
    use pie charts
    *Data from Forbes
    Follow @chewxy on Twi/er

    View Slide

  30. 3%
    60%
    19%
    9%
    4%
    4% 1%
    Building Training Sets
    Cleaning Data
    Collecting Data
    Statistical Analysis
    Refining Algorithms
    Other
    Telling people you shouldn't
    use pie charts
    *Data from Forbes
    Follow @chewxy on Twi/er

    View Slide

  31. Robust software engineering to the rescue!
    Follow @chewxy on Twi/er

    View Slide

  32. PYTHON V GO
    DAWN OF ROBUST
    Follow @chewxy on Twi/er

    View Slide

  33. example.csv
    1,testval1
    2,testval2
    3,testval3
    *example taken from Dan Whitenack
    Follow @chewxy on Twi/er

    View Slide

  34. import pandas as pd!
    data =
    pd.read_csv('examples.csv',
    names=['fst','snd'])!
    print(data['fst'].max())!
    f, _ := os.Open("example.csv")!
    r :=
    csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    for _, record := range records {!
    intVal, err :=
    strconv.Atoi(record[0])!
    if err != nil {!
    err =
    errors.Wrap(err, "Parse failed")!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  35. import pandas as pd!
    data =
    pd.read_csv('examples.csv',
    names=['fst','snd'])!
    print(data['fst'].max())!
    f, _ := os.Open("example.csv")!
    r :=
    csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    for _, record := range records {!
    intVal, err :=
    strconv.Atoi(record[0])!
    if err != nil {!
    err =
    errors.Wrap(err, "Parse failed")!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  36. $ python ex.py!
    3!
    $ go run ex.go!
    3!
    Follow @chewxy on Twi/er

    View Slide

  37. example.csv
    1,testval1
    2,testval2
    ,testval3
    Follow @chewxy on Twi/er

    View Slide

  38. $ python ex.py!
    2.0!
    $ go run ex.go!
    Parse failed:
    strconv.Atoi:
    parsing "":
    invalid syntax!
    exit status 1!
    Follow @chewxy on Twi/er

    View Slide

  39. $ python ex.py!
    2.0!
    $ go run ex.go!
    Parse failed:
    strconv.Atoi:
    parsing "":
    invalid syntax!
    exit status 1!
    Follow @chewxy on Twi/er
    WTF?
    •  Suddenly a float?!
    •  Why does it even work??

    View Slide

  40. The Zen of Python
    Beautiful is better than ugly.
    Explicit is better than implicit.
    Simple is better than complex.
    Complex is better than complicated.
    Flat is better than nested.
    Sparse is better than dense.
    Readability counts.
    Special cases aren't special enough to break the rules.
    Although practicality beats purity.
    Errors should never pass silently.
    Unless explicitly silenced.
    In the face of ambiguity, refuse the temptation to guess.
    There should be one-- and preferably only one --obvious way to do it.
    Although that way may not be obvious at first unless you're Dutch.
    Now is better than never.
    Although never is often better than *right* now.
    If the implementation is hard to explain, it's a bad idea.
    If the implementation is easy to explain, it may be a good idea.
    Namespaces are one honking great idea -- let's do more of those!
    Follow @chewxy on Twi/er

    View Slide

  41. A Closer Look
    f, _ := os.Open("example.csv")!
    r := csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    for _, record := range records {!
    intVal, err := strconv.Atoi(record[0])!
    if err != nil {!
    err = errors.Wrap(err, "Parse failed")!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  42. A Closer Look
    f, _ := os.Open("example.csv")!
    r := csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    for _, record := range records {!
    intVal, err := strconv.Atoi(record[0])!
    if err != nil {!
    err = errors.Wrap(err, "Parse failed")!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  43. A Closer Look
    f, _ := os.Open("example.csv")!
    r := csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    for _, record := range records {!
    intVal, err := strconv.Atoi(record[0])!
    if err != nil {!
    err = errors.Wrap(err, "Parse failed")!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  44. A Closer Look
    f, _ := os.Open("example.csv")!
    r := csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    for _, record := range records {!
    intVal, err := strconv.Atoi(record[0])!
    if err != nil {!
    err = errors.Wrap(err, "Parse failed")!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  45. A Closer Look
    f, _ := os.Open("example.csv")!
    r := csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    for _, record := range records {!
    intVal, err := strconv.Atoi(record[0])!
    if err != nil {!
    err = errors.Wrap(err, "Parse failed")!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  46. Go Proverbs
    gofmt's style is no one's favourite, yet
    gofmt is everyone's favourite.
    Clear is better than clever.
    Don't just check errors. Handle them
    gracefully.
    Follow @chewxy on Twi/er

    View Slide

  47. A Closer Look
    f, _ := os.Open("example.csv")!
    r := csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    for _, record := range records {!
    intVal, err := strconv.Atoi(record[0])!
    if err != nil {!
    err = errors.Wrap(err, "Parse failed")!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  48. A Closer Look
    f, _ := os.Open("example.csv")!
    r := csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    For i, record := range records {!
    intVal, err := strconv.Atoi(record[0])!
    if err != nil {!
    err = errors.Wrapf(err, "Failed at %d", i)!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  49. A Closer Look
    f, _ := os.Open("example.csv")!
    r := csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    for _, record := range records {!
    intVal, err := strconv.Atoi(record[0])!
    if err != nil {!
    err = errors.Wrap(err, "Parse failed")!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  50. Go Proverbs
    gofmt's style is no one's favourite, yet
    gofmt is everyone's favourite.
    Clear is better than clever.
    Don't just check errors. Handle them
    gracefully.
    Make the zero value useful.
    Follow @chewxy on Twi/er

    View Slide

  51. A Closer Look
    f, _ := os.Open("example.csv")!
    r := csv.NewReader(bufio.NewReader(f))!
    records, _ := r.ReadAll()!
    !
    var intMax int!
    for _, record := range records {!
    intVal, err := strconv.Atoi(record[0])!
    if err != nil {!
    err = errors.Wrap(err, "Parse failed")!
    log.Fatal(err)!
    }!
    if intVal > intMax {!
    intMax = intVal!
    }!
    }!
    !
    fmt.Println(intMax)!
    Follow @chewxy on Twi/er

    View Slide

  52. ON PANDAS
    Follow @chewxy on Twi/er

    View Slide

  53. On Pandas
    •  Pandas is great! I <3 Pandas
    Follow @chewxy on Twi/er

    View Slide

  54. On Pandas
    •  Pandas is great! I <3 Pandas.
    •  Pandas makes assumptions for you.
    Follow @chewxy on Twi/er

    View Slide

  55. On Pandas
    •  Pandas is great! I <3 Pandas.
    •  Pandas makes assumptions for you.
    •  90% of the time, the assumption works
    100% of the time.
    Follow @chewxy on Twi/er

    View Slide

  56. Ad-hocness of work
    longer-lived programs
    shorter-lived programs
    Complexity/Relative Effort
    Python
    Go
    *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics
    Most data science programs are here
    Follow @chewxy on Twi/er

    View Slide

  57. Ad-hocness of work
    longer-lived programs
    shorter-lived programs
    Complexity/Relative Effort
    Python
    Go
    *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics
    Pandas+Jupyter
    services this area
    Follow @chewxy on Twi/er

    View Slide

  58. On Pandas
    •  Pandas is great! I <3 Pandas.
    •  Pandas makes assumptions for you.
    •  90% of the time, the assumption works
    100% of the time.
    •  Pandas + Jupyter = match made in
    heaven.
    Follow @chewxy on Twi/er

    View Slide

  59. Ad-hocness of work
    longer-lived programs
    shorter-lived programs
    Complexity/Relative Effort
    Python
    Go
    *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics
    What about here?
    Follow @chewxy on Twi/er

    View Slide

  60. C/C++ TO THE RESCUE!
    Follow @chewxy on Twi/er

    View Slide

  61. Follow @chewxy on Twi/er

    View Slide

  62. JAVA TO THE RESCUE?
    Follow @chewxy on Twi/er

    View Slide

  63. Follow @chewxy on Twi/er

    View Slide

  64. GO TO THE RESCUE!
    Follow @chewxy on Twi/er

    View Slide

  65. WHY GO?
    Follow @chewxy on Twi/er

    View Slide

  66. Why Go?
    •  Philosophy that drives robust software
    Follow @chewxy on Twi/er

    View Slide

  67. Why Go?
    •  Philosophy that drives robust software.
    •  Language that promotes mechanical
    sympathy.
    Follow @chewxy on Twi/er

    View Slide

  68. Why Go?
    •  Philosophy that drives robust software.
    •  Language that promotes mechanical
    sympathy.
    – Data structures map closely to machine
    layout.
    Follow @chewxy on Twi/er

    View Slide

  69. Why Go?
    •  Philosophy that drives robust software.
    •  Language that promotes mechanical
    sympathy.
    – Data structures map closely to machine
    layout.
    – As a result, fast(ish)!
    Follow @chewxy on Twi/er

    View Slide

  70. Why Go?
    •  Philosophy that drives robust software.
    •  Language that promotes mechanical
    sympathy.
    Follow @chewxy on Twi/er

    View Slide

  71. Why Go?
    •  Philosophy that drives robust software.
    •  Language that promotes mechanical
    sympathy.
    •  Right levels of abstraction.
    Follow @chewxy on Twi/er

    View Slide

  72. Why Go?
    •  Philosophy that drives robust software.
    •  Language that promotes mechanical
    sympathy.
    •  Right levels of abstraction.
    – Encourages users to understand underlying
    data structures and algorithms.
    Follow @chewxy on Twi/er

    View Slide

  73. Why Go?
    •  Philosophy that drives robust software.
    •  Language that promotes mechanical
    sympathy.
    •  Right levels of abstraction.
    – Encourages users to understand underlying
    data structures and algorithms.
    – High level enough to be productive.
    Follow @chewxy on Twi/er

    View Slide

  74. USING GO
    FOR DATA SCIENCE
    Follow @chewxy on Twi/er

    View Slide

  75. Introducing Go
    There are Go libraries for data science.
    Follow @chewxy on Twi/er

    View Slide

  76. Introducing Go
    There are Go libraries for data science.
    •  Gonum – set of packages for numerical
    and scientific algorithms
    Follow @chewxy on Twi/er

    View Slide

  77. Introducing Go
    There are Go libraries for data science.
    •  Gonum – set of packages for numerical
    and scientific algorithms
    •  Gophernotes – like Jupyter for Go
    Follow @chewxy on Twi/er

    View Slide

  78. Introducing Go
    There are Go libraries for data science.
    •  Gonum – set of packages for numerical
    and scientific algorithms
    •  Gophernotes – like Jupyter for Go
    •  Gota – data frames for Go
    Follow @chewxy on Twi/er

    View Slide

  79. Introducing Go
    There are Go libraries for data science.
    •  Gonum – set of packages for numerical
    and scientific algorithms
    •  Gophernotes – like Jupyter for Go
    •  Gota – data frames for Go
    •  Gorgonia* – packages for deep learning
    in Go
    * @chewxy is the author of Gorgonia
    Follow @chewxy on Twi/er

    View Slide

  80. Introducing Go
    There are Go libraries for data science.
    •  Gonum – set of packages for numerical
    and scientific algorithms
    •  Gophernotes – like Jupyter for Go
    •  Gota – data frames for Go
    •  Gorgonia – packages for deep learning in
    Go
    Follow @chewxy on Twi/er

    View Slide

  81. Gonum + Gorgonia = <3
    Coming from Numpy/Scipy?
    Handy guide here.
    Follow @chewxy on Twi/er

    View Slide

  82. Other Resources
    Follow @chewxy on Twi/er

    View Slide

  83. Q&A
    Follow @chewxy on Twi/er

    View Slide

  84. THE END
    FOLLOW @CHEWXY ON TWITTER
    Follow @chewxy on Twi/er

    View Slide