Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science in Go

Xuanyi
October 12, 2017

Data Science in Go

Data science is a solved problem: just use Pandas. Or is it? In this talk, Xuanyi will explore why Go may be a better solution for some problems and introduce a few libraries for that.

Xuanyi

October 12, 2017
Tweet

More Decks by Xuanyi

Other Decks in Technology

Transcript

  1. Go Proverbs gofmt's style is no one's favourite, yet gofmt

    is everyone's favourite. Follow @chewxy on Twi/er
  2. Go Proverbs gofmt's style is no one's favourite, yet gofmt

    is everyone's favourite. Clear is better than clever. Follow @chewxy on Twi/er
  3. Go Proverbs gofmt's style is no one's favourite, yet gofmt

    is everyone's favourite. Clear is better than clever. Don't just check errors. Handle them gracefully. Follow @chewxy on Twi/er
  4. The Zen of Python Beautiful is better than ugly. Explicit

    is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! Follow @chewxy on Twi/er
  5. The Zen of Python Beautiful is better than ugly. Explicit

    is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! Follow @chewxy on Twi/er
  6. The Zen of Python Beautiful is better than ugly. Explicit

    is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! Follow @chewxy on Twi/er
  7. The Zen of Python Beautiful is better than ugly. Explicit

    is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! Follow @chewxy on Twi/er
  8. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Follow @chewxy on Twi/er
  9. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Follow @chewxy on Twi/er
  10. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Haskell Follow @chewxy on Twi/er
  11. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Most data science programs are here Follow @chewxy on Twi/er
  12. Robust Data Science •  Good statistical understanding – Use the right

    statistical underpinnings Follow @chewxy on Twi/er
  13. Robust Data Science •  Good statistical understanding – Use the right

    statistical underpinnings – Do it on pen and paper to check understanding Follow @chewxy on Twi/er
  14. Robust Data Science •  Good statistical understanding – Use the right

    statistical underpinnings – Do it on pen and paper to check understanding – Topic for another day Follow @chewxy on Twi/er
  15. Robust Data Science •  Good statistical understanding – Use the right

    statistical underpinnings – Do it on pen and paper to check understanding •  Robust software engineering Follow @chewxy on Twi/er
  16. 3% 60% 19% 9% 4% 4% 1% Building Training Sets

    Cleaning Data Collecting Data Statistical Analysis Refining Algorithms Other Telling people you shouldn't use pie charts *Data from Forbes Follow @chewxy on Twi/er
  17. 3% 60% 19% 9% 4% 4% 1% Building Training Sets

    Cleaning Data Collecting Data Statistical Analysis Refining Algorithms Other Telling people you shouldn't use pie charts *Data from Forbes Follow @chewxy on Twi/er
  18. import pandas as pd! data = pd.read_csv('examples.csv', names=['fst','snd'])! print(data['fst'].max())! f,

    _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))! records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  19. import pandas as pd! data = pd.read_csv('examples.csv', names=['fst','snd'])! print(data['fst'].max())! f,

    _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))! records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  20. $ python ex.py! 2.0! $ go run ex.go! Parse failed:

    strconv.Atoi: parsing "": invalid syntax! exit status 1! Follow @chewxy on Twi/er
  21. $ python ex.py! 2.0! $ go run ex.go! Parse failed:

    strconv.Atoi: parsing "": invalid syntax! exit status 1! Follow @chewxy on Twi/er WTF? •  Suddenly a float?! •  Why does it even work??
  22. The Zen of Python Beautiful is better than ugly. Explicit

    is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those! Follow @chewxy on Twi/er
  23. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  24. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  25. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  26. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  27. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  28. Go Proverbs gofmt's style is no one's favourite, yet gofmt

    is everyone's favourite. Clear is better than clever. Don't just check errors. Handle them gracefully. Follow @chewxy on Twi/er
  29. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  30. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! For i, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrapf(err, "Failed at %d", i)! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  31. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  32. Go Proverbs gofmt's style is no one's favourite, yet gofmt

    is everyone's favourite. Clear is better than clever. Don't just check errors. Handle them gracefully. Make the zero value useful. Follow @chewxy on Twi/er
  33. A Closer Look f, _ := os.Open("example.csv")! r := csv.NewReader(bufio.NewReader(f))!

    records, _ := r.ReadAll()! ! var intMax int! for _, record := range records {! intVal, err := strconv.Atoi(record[0])! if err != nil {! err = errors.Wrap(err, "Parse failed")! log.Fatal(err)! }! if intVal > intMax {! intMax = intVal! }! }! ! fmt.Println(intMax)! Follow @chewxy on Twi/er
  34. On Pandas •  Pandas is great! I <3 Pandas. • 

    Pandas makes assumptions for you. Follow @chewxy on Twi/er
  35. On Pandas •  Pandas is great! I <3 Pandas. • 

    Pandas makes assumptions for you. •  90% of the time, the assumption works 100% of the time. Follow @chewxy on Twi/er
  36. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Most data science programs are here Follow @chewxy on Twi/er
  37. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics Pandas+Jupyter services this area Follow @chewxy on Twi/er
  38. On Pandas •  Pandas is great! I <3 Pandas. • 

    Pandas makes assumptions for you. •  90% of the time, the assumption works 100% of the time. •  Pandas + Jupyter = match made in heaven. Follow @chewxy on Twi/er
  39. Ad-hocness of work longer-lived programs shorter-lived programs Complexity/Relative Effort Python

    Go *Mere visual approximation. No hard data. Purely anecdotal with plenty of heuristics What about here? Follow @chewxy on Twi/er
  40. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. Follow @chewxy on Twi/er
  41. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. – Data structures map closely to machine layout. Follow @chewxy on Twi/er
  42. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. – Data structures map closely to machine layout. – As a result, fast(ish)! Follow @chewxy on Twi/er
  43. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. Follow @chewxy on Twi/er
  44. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. •  Right levels of abstraction. Follow @chewxy on Twi/er
  45. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. •  Right levels of abstraction. – Encourages users to understand underlying data structures and algorithms. Follow @chewxy on Twi/er
  46. Why Go? •  Philosophy that drives robust software. •  Language

    that promotes mechanical sympathy. •  Right levels of abstraction. – Encourages users to understand underlying data structures and algorithms. – High level enough to be productive. Follow @chewxy on Twi/er
  47. Introducing Go There are Go libraries for data science. • 

    Gonum – set of packages for numerical and scientific algorithms Follow @chewxy on Twi/er
  48. Introducing Go There are Go libraries for data science. • 

    Gonum – set of packages for numerical and scientific algorithms •  Gophernotes – like Jupyter for Go Follow @chewxy on Twi/er
  49. Introducing Go There are Go libraries for data science. • 

    Gonum – set of packages for numerical and scientific algorithms •  Gophernotes – like Jupyter for Go •  Gota – data frames for Go Follow @chewxy on Twi/er
  50. Introducing Go There are Go libraries for data science. • 

    Gonum – set of packages for numerical and scientific algorithms •  Gophernotes – like Jupyter for Go •  Gota – data frames for Go •  Gorgonia* – packages for deep learning in Go * @chewxy is the author of Gorgonia Follow @chewxy on Twi/er
  51. Introducing Go There are Go libraries for data science. • 

    Gonum – set of packages for numerical and scientific algorithms •  Gophernotes – like Jupyter for Go •  Gota – data frames for Go •  Gorgonia – packages for deep learning in Go Follow @chewxy on Twi/er