Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science in Go

3e45c02f2ae5f812a55c4975124da6b2?s=47 Xuanyi
October 12, 2017

Data Science in Go

Data science is a solved problem: just use Pandas. Or is it? In this talk, Xuanyi will explore why Go may be a better solution for some problems and introduce a few libraries for that.

3e45c02f2ae5f812a55c4975124da6b2?s=128

Xuanyi

October 12, 2017
Tweet

Transcript

  1. Data Science in Go @chewxy

  2. WHY GO? Follow @chewxy on Twi/er

  3. GO PROVERBS Follow @chewxy on Twi/er

  4. Go Proverbs gofmt's style is no one's favourite, yet gofmt

    is everyone's favourite. Follow @chewxy on Twi/er
  5. Go Proverbs gofmt's style is no one's favourite, yet gofmt

    is everyone's favourite. Clear is better than clever. Follow @chewxy on Twi/er
  6. Go Proverbs gofmt's style is no one's favourite, yet gofmt

    is everyone's favourite. Clear is better than clever. Don't just check errors. Handle them gracefully. Follow @chewxy on Twi/er
  7. THE ZEN OF PYTHON Follow @chewxy on Twi/er

  8. The Zen of Python Beautiful is better than ugly. Explicit

    is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! Follow @chewxy on Twi/er
  9. The Zen of Python Beautiful is better than ugly. Explicit

    is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! Follow @chewxy on Twi/er
  10. The Zen of Python Beautiful is better than ugly. Explicit

    is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! Follow @chewxy on Twi/er
  11. The Zen of Python Beautiful is better than ugly. Explicit

    is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! Follow @chewxy on Twi/er
  12. DATA SCIENCE, BRIEFLY Follow @chewxy on Twi/er

  13. Statistics S/W Eng X Data Science Follow @chewxy on Twi/er

  14. Ad-hocness of work longer-lived programs shorter-lived programs Follow @chewxy on

    Twi/er
  15. Ad-hocness of work work that exists in production exploratory work

    Follow @chewxy on Twi/er
  16. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Follow

    @chewxy on Twi/er
  17. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Follow @chewxy on Twi/er
  18. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Follow @chewxy on Twi/er
  19. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Haskell Follow @chewxy on Twi/er
  20. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Most data science programs are here Follow @chewxy on Twi/er
  21. "Nothing more permanent than a temporary hack" By Joy Leelawat

    (2016) Follow @chewxy on Twi/er
  22. ROBUST DATA SCIENCE Follow @chewxy on Twi/er

  23. Robust Data Science •  Good statistical understanding Follow @chewxy on

    Twi/er
  24. Robust Data Science •  Good statistical understanding – Use the right

    statistical underpinnings Follow @chewxy on Twi/er
  25. Robust Data Science •  Good statistical understanding – Use the right

    statistical underpinnings – Do it on pen and paper to check understanding Follow @chewxy on Twi/er
  26. Robust Data Science •  Good statistical understanding – Use the right

    statistical underpinnings – Do it on pen and paper to check understanding – Topic for another day Follow @chewxy on Twi/er
  27. Robust Data Science •  Good statistical understanding – Use the right

    statistical underpinnings – Do it on pen and paper to check understanding •  Robust software engineering Follow @chewxy on Twi/er
  28. WHAT DOES A DATA SCIENTIST DO? Follow @chewxy on Twi/er

  29. 3% 60% 19% 9% 4% 4% 1% Building Training Sets

    Cleaning Data Collecting Data Statistical Analysis Refining Algorithms Other Telling people you shouldn't use pie charts *Data from Forbes Follow @chewxy on Twi/er
  30. 3% 60% 19% 9% 4% 4% 1% Building Training Sets

    Cleaning Data Collecting Data Statistical Analysis Refining Algorithms Other Telling people you shouldn't use pie charts *Data from Forbes Follow @chewxy on Twi/er
  31. Robust software engineering to the rescue! Follow @chewxy on Twi/er

  32. PYTHON V GO DAWN OF ROBUST Follow @chewxy on Twi/er

  33. example.csv 1,testval1 2,testval2 3,testval3 *example taken from Dan Whitenack Follow

    @chewxy on Twi/er
  34. import pandas as pd! data = pd.read_csv('examples.csv', names=['fst','snd'])! print(data['fst'].max())! f,

    _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))! records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  35. import pandas as pd! data = pd.read_csv('examples.csv', names=['fst','snd'])! print(data['fst'].max())! f,

    _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))! records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  36. $ python ex.py! 3! $ go run ex.go! 3! Follow

    @chewxy on Twi/er
  37. example.csv 1,testval1 2,testval2 ,testval3 Follow @chewxy on Twi/er

  38. $ python ex.py! 2.0! $ go run ex.go! Parse failed:

    strconv.Atoi: parsing "": invalid syntax! exit status 1! Follow @chewxy on Twi/er
  39. $ python ex.py! 2.0! $ go run ex.go! Parse failed:

    strconv.Atoi: parsing "": invalid syntax! exit status 1! Follow @chewxy on Twi/er WTF? •  Suddenly a float?! •  Why does it even work??
  40. The Zen of Python Beautiful is better than ugly. Explicit

    is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! Follow @chewxy on Twi/er
  41. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  42. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  43. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  44. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  45. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  46. Go Proverbs gofmt's style is no one's favourite, yet gofmt

    is everyone's favourite. Clear is better than clever. Don't just check errors. Handle them gracefully. Follow @chewxy on Twi/er
  47. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  48. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! For i, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrapf(err, "Failed at %d", i)! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  49. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  50. Go Proverbs gofmt's style is no one's favourite, yet gofmt

    is everyone's favourite. Clear is better than clever. Don't just check errors. Handle them gracefully. Make the zero value useful. Follow @chewxy on Twi/er
  51. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  52. ON PANDAS Follow @chewxy on Twi/er

  53. On Pandas •  Pandas is great! I <3 Pandas Follow

    @chewxy on Twi/er
  54. On Pandas •  Pandas is great! I <3 Pandas. • 

    Pandas makes assumptions for you. Follow @chewxy on Twi/er
  55. On Pandas •  Pandas is great! I <3 Pandas. • 

    Pandas makes assumptions for you. •  90% of the time, the assumption works 100% of the time. Follow @chewxy on Twi/er
  56. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Most data science programs are here Follow @chewxy on Twi/er
  57. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Pandas+Jupyter services this area Follow @chewxy on Twi/er
  58. On Pandas •  Pandas is great! I <3 Pandas. • 

    Pandas makes assumptions for you. •  90% of the time, the assumption works 100% of the time. •  Pandas + Jupyter = match made in heaven. Follow @chewxy on Twi/er
  59. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics What about here? Follow @chewxy on Twi/er
  60. C/C++ TO THE RESCUE! Follow @chewxy on Twi/er

  61. Follow @chewxy on Twi/er

  62. JAVA TO THE RESCUE? Follow @chewxy on Twi/er

  63. Follow @chewxy on Twi/er

  64. GO TO THE RESCUE! Follow @chewxy on Twi/er

  65. WHY GO? Follow @chewxy on Twi/er

  66. Why Go? •  Philosophy that drives robust software Follow @chewxy

    on Twi/er
  67. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. Follow @chewxy on Twi/er
  68. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. – Data structures map closely to machine layout. Follow @chewxy on Twi/er
  69. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. – Data structures map closely to machine layout. – As a result, fast(ish)! Follow @chewxy on Twi/er
  70. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. Follow @chewxy on Twi/er
  71. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. •  Right levels of abstraction. Follow @chewxy on Twi/er
  72. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. •  Right levels of abstraction. – Encourages users to understand underlying data structures and algorithms. Follow @chewxy on Twi/er
  73. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. •  Right levels of abstraction. – Encourages users to understand underlying data structures and algorithms. – High level enough to be productive. Follow @chewxy on Twi/er
  74. USING GO FOR DATA SCIENCE Follow @chewxy on Twi/er

  75. Introducing Go There are Go libraries for data science. Follow

    @chewxy on Twi/er
  76. Introducing Go There are Go libraries for data science. • 

    Gonum – set of packages for numerical and scientific algorithms Follow @chewxy on Twi/er
  77. Introducing Go There are Go libraries for data science. • 

    Gonum – set of packages for numerical and scientific algorithms •  Gophernotes – like Jupyter for Go Follow @chewxy on Twi/er
  78. Introducing Go There are Go libraries for data science. • 

    Gonum – set of packages for numerical and scientific algorithms •  Gophernotes – like Jupyter for Go •  Gota – data frames for Go Follow @chewxy on Twi/er
  79. Introducing Go There are Go libraries for data science. • 

    Gonum – set of packages for numerical and scientific algorithms •  Gophernotes – like Jupyter for Go •  Gota – data frames for Go •  Gorgonia* – packages for deep learning in Go * @chewxy is the author of Gorgonia Follow @chewxy on Twi/er
  80. Introducing Go There are Go libraries for data science. • 

    Gonum – set of packages for numerical and scientific algorithms •  Gophernotes – like Jupyter for Go •  Gota – data frames for Go •  Gorgonia – packages for deep learning in Go Follow @chewxy on Twi/er
  81. Gonum + Gorgonia = <3 Coming from Numpy/Scipy? Handy guide

    here. Follow @chewxy on Twi/er
  82. Other Resources Follow @chewxy on Twi/er

  83. Q&A Follow @chewxy on Twi/er

  84. THE END FOLLOW @CHEWXY ON TWITTER Follow @chewxy on Twi/er