PyDATA NYC 2014 Keynote

Python in the Hadoop Ecosystem Presented by: Andy R. Terrel

Python in the Hadoop Ecosystem / Andy R. Terrel /
PyData NYC 2014 2 First a Thank You! Without these two PyDataNYC would have been a lot colder. Leah Silen Executive Director NumFOCUS James Powell blog author, python enthusiast, code cowboy

PyData NYC 2014 3 You: Who the hell is this on stage?

PyData NYC 2014 Python in the Hadoop Ecosystem / PyData NYC 2014 About Andy Andy R. Terrel @aterrel Chief Scientist, Continuum Analytics President, NumFOCUS Background: • High Performance Computing • Computational Mathematics • President, NumFOCUS foundation Experience with: • Finance • Simulations • Web data • Social media

PyData NYC 2014 About Continuum Analytics http://continuum.io/ We build technologies that enable analysts and data scientist to answer questions from the data all around us. Committed to Open Source Areas of Focus • Software solutions • Consulting • Training • Anaconda: Free Python distribution • Numba, Conda, Blaze, Bokeh, dynd • Sponsor

PyData NYC 2014 “ At my company X, we have peta/terabytes of data, just lying around, waiting for someone to explore it” - someone at PyTexas Let’s make it easier for users to explore and extract useful insights out of data. Package manager Free enterprise-ready Python distribution Anaconda Conda Blaze Bokeh Numba Wakari Power to speed up Share and deploy Interactive data visualizations Scale

PyData NYC 2014 7 You: Great some jerk with lots of titles.

PyData NYC 2014 8 Andy use to solve problems like these

PyData NYC 2014 12 Andy now solves problems like these

PyData NYC 2014 16 You: Okay what are you here to talk about?

PyData NYC 2014 About this talk Python in the Hadoop Ecosystem 1. Discussion of Hadoop and Python  2. Large scale data analytics - Blaze 3. Interactive data visualization - Bokeh Introduction to Hadoop and Python tools for large-scale data analytics and interactive visualization Objective Structure

PyData NYC 2014 18 The Ecosystem

PyData NYC 2014 Intro Large scale data analytics Interactive data visualization A practical example Large scale data analytics - An Overview BI - DB DM/Stats/ML Scientific Computing Distributed Systems Numba bcolz RHadoop

PyData NYC 2014 20

PyData NYC 2014 21 Base Hadoop Stack

PyData NYC 2014 22 Berkeley Data Science Stack

PyData NYC 2014 23 You: And where exactly is Python?

PyData NYC 2014 24 Python as the Driver Hadoop Streaming PySpark Pig

PyData NYC 2014 25 Python UDFs

PyData NYC 2014 26 You: But what about that speed layer?

PyData NYC 2014 27 Python Streams

PyData NYC 2014 28 You: Okay, but how do you use all this? I just see pictures

PyData NYC 2014 29 Me: Genie give me a Hadoop

PyData NYC 2014 30

PyData NYC 2014 31 conda cluster create \ demo \ -p simple_aws \ -n 3 \ —size m1.large \ —id ami-3c994355

PyData NYC 2014 32

PyData NYC 2014 33 You: Great what do I do with that?

PyData NYC 2014 34 Me: Genie give me a useful cluster

PyData NYC 2014 35

PyData NYC 2014 36 conda cluster manage \ demo create\ -n test_dev \ python=2.7 \ numpy=1.6 \ …

PyData NYC 2014 37 Me: Genie make spark faster

PyData NYC 2014 38 Numba on Spark: https://gist.github.com/jlyons871/14c7c12606aec7ff80f9

PyData NYC 2014 39 sc = SparkContext(appName="PythonALS") print "Running ALS with M=%d, U=%d, F=%d, iters=%d, partitions= %d\n" % \ (M, U, F, ITERATIONS, partitions) R = matrix(rand(M, F)) * matrix(rand(U, F).T) ms = matrix(rand(M, F)) us = matrix(rand(U, F)) Rb = sc.broadcast(R) msb = sc.broadcast(ms) usb = sc.broadcast(us)

PyData NYC 2014 40 ms = sc.parallelize(range(M), partitions) \ .map(lambda x: update(x, msb.value[x, :], usb.value, Rb.value)) \ .collect() # collect() returns a list, so array ends up being # a 3-d array, we take the first 2 dims for the matrix ms = matrix(np.array(ms)[:, :, 0]) msb = sc.broadcast(ms) us = sc.parallelize(range(U), partitions) \ .map(lambda x: update(x, usb.value[x, :], msb.value, Rb.value.T)) \ .collect() us = matrix(np.array(us)[:, :, 0])

PyData NYC 2014 41 def jit_add(xtx, _uu): for j in range(xtx.shape[0]): xtx[j, j] += LAMBDA * _uu def update(i, vec, mat, ratings): XtX = mat.T * mat Xty = mat.T * ratings[i, :].T NAL = jit(jit_add) NAL(XtX, mat.shape[0]) return np.linalg.solve(XtX, Xty)

PyData NYC 2014 42 Me: Genie run this guy

PyData NYC 2014 43 conda cluster launch \ demo \ spark_script

PyData NYC 2014 44

PyData NYC 2014 45 You: How do sane people use this?

PyData NYC 2014 46 Blaze

• Dealing with data applications has numerous pain points  -
Hundreds of data formats - Basic programs expect all data to fit in memory - Data analysis pipelines constantly changing from one form to another - Sharing analysis contains significant overhead to configure systems - Parallelizing analysis requires expert in particular distributed computing stack Data Pain

PyData NYC 2014 Blaze Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/

PyData NYC 2014 Distributed Systems Scientific Computing BI - DB DM/Stats/ML Blaze bcolz Connecting technologies to users Connecting technologies to each other Blaze hdf5

PyData NYC 2014 Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json Blaze

Deferred Expr Compilers Interpreters Data Compute API Blaze Architecture •
Flexible architecture to accommodate exploration  • Use compilation of deferred expressions to optimize data interactions

Blaze Example - Counting Weblinks Common Blaze Code # Expr
t_idx = TableSymbol('{name: string, node_id: int32}') t_arc = TableSymbol('{node_out: int32, node_id: int32}') joined = Join(t_arc, t_idx, "node_id") t = By(joined, joined['name'], joined['node_id'].count()) # Data Load idx, arc = load_data()  # Computations ans = compute(t, {t_arc: arc, t_idx: idx})  in_deg = dict(ans) in_deg[u'blogspot.com']

Blaze Example - Counting Weblinks Using Spark + HDFS load_data
sc = SparkContext("local", "Simple App") idx = sc.textFile(“hdfs://master.continuum.io/example_index.txt”) idx = idx.map(lambda x: x.split(‘\t’))\ .map(lambda x: [x[0], int(x[1])]) arc = sc.textFile("hdfs://master.continuum.io/example_arcs.txt") arc = arc.map(lambda x: x.split(‘\t’))\ .map(lambda x: [int(x[0]), int(x[1])]) Using Pandas + Local Disc with open("example_index.txt") as f: idx = [ ln.strip().split('\t') for ln in f.readlines()] idx = DataFrame(idx, columns=['name', 'node_id']) with open("example_arcs.txt") as f: arc = [ ln.strip().split('\t') for ln in f.readlines()] arc = DataFrame(arc, columns=['node_out', 'node_id'])

PyData NYC 2014 Want to learn more about Blaze? Free Webinar: http://www.continuum.io/webinars/getting-started-with-blaze Blogpost: http://continuum.io/blog/blaze-expressions http://continuum.io/blog/blaze-migrations http://continuum.io/blog/blaze-hmda Docs and source code: http://blaze.pydata.org/ https://github.com/ContinuumIO/blaze

PyData NYC 2014 Data visualization - An Overview Results presentation Visual analytics Static Interactive Small datasets Large datasets Traditional plots Novel graphics

PyData NYC 2014 Bokeh • Interactive visualization • Novel graphics • Streaming, dynamic, large data • For the browser, with or without a server • Matplotlib compatibility • No need to write Javascript http://bokeh.pydata.org/ https://github.com/ContinuumIO/bokeh

PyData NYC 2014 Bokeh - Interactive, Visual analytics • Tools (e.g. Pan, Wheel Zoom, Save, Resize, Select, Reset View)

PyData NYC 2014 58 Bokeh - Interactive, Visual analytics • Widgets and dashboards

PyData NYC 2014 59 Bokeh - Interactive, Visual analytics • Crossﬁlter

PyData NYC 2014 60 Bokeh - Large datasets Server-side downsampling and abstract rendering

PyData NYC 2014 61 Bokeh - No JavaScript

PyData NYC 2014 Thank you! :)

PyDATA NYC 2014 Keynote

PyDATA NYC 2014 Keynote

More Decks by Andy R. Terrel

Other Decks in Programming

Featured

Transcript