Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyDATA NYC 2014 Keynote

PyDATA NYC 2014 Keynote

Andy R. Terrel

November 22, 2014
Tweet

More Decks by Andy R. Terrel

Other Decks in Programming

Transcript

  1. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 2 First a Thank You! Without these two PyDataNYC would have been a lot colder. Leah Silen Executive Director NumFOCUS James Powell blog author, python enthusiast, code cowboy
  2. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 3 You: Who the hell is this on stage?
  3. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 Python in the Hadoop Ecosystem / PyData NYC 2014 About Andy Andy R. Terrel @aterrel Chief Scientist, Continuum Analytics President, NumFOCUS Background: • High Performance Computing • Computational Mathematics • President, NumFOCUS foundation Experience with: • Finance • Simulations • Web data • Social media
  4. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 About Continuum Analytics http://continuum.io/ We build technologies that enable analysts and data scientist to answer questions from the data all around us. Committed to Open Source Areas of Focus • Software solutions • Consulting • Training • Anaconda: Free Python distribution • Numba, Conda, Blaze, Bokeh, dynd • Sponsor
  5. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 “ At my company X, we have peta/terabytes of data, just lying around, waiting for someone to explore it” - someone at PyTexas Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Bokeh Numba Wakari Power to speed up Share and deploy Interactive data visualizations Scale
  6. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 7 You: Great some jerk with lots of titles.
  7. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 8 Andy use to solve problems like these
  8. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 9 Andy use to solve problems like these
  9. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 10 Andy use to solve problems like these
  10. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 11 Andy use to solve problems like these
  11. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 12 Andy now solves problems like these
  12. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 13 Andy now solves problems like these
  13. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 14 Andy now solves problems like these
  14. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 15 Andy now solves problems like these
  15. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 16 You: Okay what are you here to talk about?
  16. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 About this talk Python in the Hadoop Ecosystem 1. Discussion of Hadoop and Python
 2. Large scale data analytics - Blaze 3. Interactive data visualization - Bokeh Introduction to Hadoop and Python tools for large-scale data analytics and interactive visualization Objective Structure
  17. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 18 The Ecosystem
  18. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 Intro Large scale data analytics Interactive data visualization A practical example Large scale data analytics - An Overview BI - DB DM/Stats/ML Scientific Computing Distributed Systems Numba bcolz RHadoop
  19. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 21 Base Hadoop Stack
  20. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 22 Berkeley Data Science Stack
  21. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 23 You: And where exactly is Python?
  22. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 24 Python as the Driver Hadoop Streaming PySpark Pig
  23. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 25 Python UDFs
  24. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 26 You: But what about that speed layer?
  25. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 27 Python Streams
  26. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 28 You: Okay, but how do you use all this? I just see pictures
  27. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 29 Me: Genie give me a Hadoop
  28. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 31 conda cluster create \ demo \ -p simple_aws \ -n 3 \ —size m1.large \ —id ami-3c994355
  29. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 33 You: Great what do I do with that?
  30. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 34 Me: Genie give me a useful cluster
  31. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 36 conda cluster manage \ demo create\ -n test_dev \ python=2.7 \ numpy=1.6 \ …
  32. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 37 Me: Genie make spark faster
  33. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 38 Numba on Spark: https://gist.github.com/jlyons871/14c7c12606aec7ff80f9
  34. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 39 sc = SparkContext(appName="PythonALS") print "Running ALS with M=%d, U=%d, F=%d, iters=%d, partitions= %d\n" % \ (M, U, F, ITERATIONS, partitions) R = matrix(rand(M, F)) * matrix(rand(U, F).T) ms = matrix(rand(M, F)) us = matrix(rand(U, F)) Rb = sc.broadcast(R) msb = sc.broadcast(ms) usb = sc.broadcast(us)
  35. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 40 ms = sc.parallelize(range(M), partitions) \ .map(lambda x: update(x, msb.value[x, :], usb.value, Rb.value)) \ .collect() # collect() returns a list, so array ends up being # a 3-d array, we take the first 2 dims for the matrix ms = matrix(np.array(ms)[:, :, 0]) msb = sc.broadcast(ms) us = sc.parallelize(range(U), partitions) \ .map(lambda x: update(x, usb.value[x, :], msb.value, Rb.value.T)) \ .collect() us = matrix(np.array(us)[:, :, 0])
  36. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 41 def jit_add(xtx, _uu): for j in range(xtx.shape[0]): xtx[j, j] += LAMBDA * _uu def update(i, vec, mat, ratings): XtX = mat.T * mat Xty = mat.T * ratings[i, :].T NAL = jit(jit_add) NAL(XtX, mat.shape[0]) return np.linalg.solve(XtX, Xty)
  37. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 42 Me: Genie run this guy
  38. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 43 conda cluster launch \ demo \ spark_script
  39. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 45 You: How do sane people use this?
  40. • Dealing with data applications has numerous pain points
 -

    Hundreds of data formats - Basic programs expect all data to fit in memory - Data analysis pipelines constantly changing from one form to another - Sharing analysis contains significant overhead to configure systems - Parallelizing analysis requires expert in particular distributed computing stack Data Pain
  41. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 Blaze Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/
  42. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 Distributed Systems Scientific Computing BI - DB DM/Stats/ML Blaze bcolz Connecting technologies to users Connecting technologies to each other Blaze hdf5
  43. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze
  44. Deferred Expr Compilers Interpreters Data Compute API Blaze Architecture •

    Flexible architecture to accommodate exploration
 • Use compilation of deferred expressions to optimize data interactions
  45. Blaze Example - Counting Weblinks Common Blaze Code #  Expr

      t_idx  =  TableSymbol('{name:  string,                                              node_id:  int32}')   t_arc  =  TableSymbol('{node_out:  int32,                                              node_id:  int32}')   joined  =  Join(t_arc,  t_idx,  "node_id")   t  =  By(joined,  joined['name'],                  joined['node_id'].count())   #  Data  Load   idx,  arc  =  load_data()
 #  Computations   ans  =  compute(t,  {t_arc:  arc,  t_idx:  idx})
 in_deg  =  dict(ans)   in_deg[u'blogspot.com']
  46. Blaze Example - Counting Weblinks Using Spark + HDFS load_data

    sc  =  SparkContext("local",  "Simple  App")   idx  =  sc.textFile(“hdfs://master.continuum.io/example_index.txt”)   idx  =  idx.map(lambda  x:  x.split(‘\t’))\                    .map(lambda  x:  [x[0],  int(x[1])])   arc  =  sc.textFile("hdfs://master.continuum.io/example_arcs.txt")   arc  =  arc.map(lambda  x:  x.split(‘\t’))\                    .map(lambda  x:  [int(x[0]),  int(x[1])])   Using Pandas + Local Disc with  open("example_index.txt")  as  f:          idx  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]   idx  =  DataFrame(idx,  columns=['name',  'node_id'])   with  open("example_arcs.txt")  as  f:          arc  =  [  ln.strip().split('\t')  for  ln  in  f.readlines()]   arc  =  DataFrame(arc,  columns=['node_out',  'node_id'])
  47. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 Want to learn more about Blaze? Free Webinar: http://www.continuum.io/webinars/getting-started-with-blaze Blogpost: http://continuum.io/blog/blaze-expressions http://continuum.io/blog/blaze-migrations http://continuum.io/blog/blaze-hmda Docs and source code: http://blaze.pydata.org/ https://github.com/ContinuumIO/blaze
  48. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 Data visualization - An Overview Results presentation Visual analytics Static Interactive Small datasets Large datasets Traditional plots Novel graphics
  49. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 Bokeh • Interactive visualization • Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • Matplotlib compatibility • No need to write Javascript http://bokeh.pydata.org/ https://github.com/ContinuumIO/bokeh
  50. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 Bokeh - Interactive, Visual analytics • Tools (e.g. Pan, Wheel Zoom, Save, Resize, Select, Reset View)
  51. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 58 Bokeh - Interactive, Visual analytics • Widgets and dashboards
  52. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 59 Bokeh - Interactive, Visual analytics • Crossfilter
  53. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 60 Bokeh - Large datasets Server-side downsampling and abstract rendering
  54. Python in the Hadoop Ecosystem / Andy R. Terrel /

    PyData NYC 2014 61 Bokeh - No JavaScript