Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015

Data Science in 2016: Moving Up 2015-10-15 • Madrid •
http://bigdataspain.org/ Paco Nathan, @pacoid  O’Reilly Media

• general patterns • trends and analysis: the discipline, the
jobs • some good examples: moving up into use cases • glimpses ahead: an emerging content • a proposed theme Data Science 2016: Moving Up

Design Patterns

Design Patterns Methodology for cloud-computing architecture  (2008-06-29) http://ceteri.blogspot.com/2008/06/methodology-for- cloud-computing.html

cluster scheduler data pipes some cloud containers analytics search/index elastic
compute elastic storage Design Patterns

Design Patterns some cloud

Design Patterns some cloud DataStax $189.7M Confluent $30.9M Databricks $47M
Jupyter $6M Elastic $104M Docker $162M Mesosphere $48.75M

Design Patterns: Issues some cloud • integration could be better
• that implies sharing markets • VCs in Silicon Valley dislike that • customers need integration

some cloud Design Patterns: Where?

Design Patterns: Where? some cloud

Design Patterns: Where? some cloud • that playing ﬁeld becomes
overly crowded, soon… • what happens at that point?

• so much emphasis on plumbing: `data engineering` • not
enough on domain expertise, which trumps all Much activity in Big Data seems awkwardly focused at the bottom of the tech stack: infrastructure, not domain However, that may be changing… Design Patterns: Opinion

Interesting Trends

Interesting Trends There are many possible trends to discuss, but
let’s   concentrate on four of these going into 2016: • leveraging multicore and large memory spaces • generalized libraries for frequently repeated work • workﬂows blend the best of people and computing • framework for a big leap ahead, not just incremental

Original deﬁnitions for what became relational databases had less to
do with dedicated SQL products, more similarity with something like   Spark SQL Interesting Trend #1: Contemporary Hardware A relational model of data   for large shared data banks  Edgar Codd  Communications of the ACM (1970)  dl.acm.org/citation.cfm?id=362685

Python Java/Scala R SQL … DataFrame Logical Plan LLVM JVM
GPU NVRAM Unified API, One Engine, Automatically Optimized Tungsten backend language frontend … from Databricks Interesting Trend #1: Contemporary Hardware

Deep Dive into Project Tungsten:   Bringing Spark Closer to
Bare Metal  Josh Rosen  spark-summit.org/2015/events/deep-dive-into-project- tungsten-bringing-spark-closer-to-bare-metal/ Set Footer from Insert Dropdown Menu Physical Execution: CPU Efficient Data Structures Keep data closure to CPU cache Interesting Trend #1: Contemporary Hardware from Databricks

Interesting Trend #2: Generalized Libraries Tensors are a good way
to handle time-series   geo-spatially distributed linked data with lots   of N-dimensional attributes In other words, nearly a general case for handling much of the data that we’re likely to encounter That’s better than attempting to shoehorn data into matrix representation, then writing lots of custom code to support it

Tensor factorization may be problematic, but probabilistic solutions seem to
provide relatively general case solutions: The Tensor Renaissance in Data Science  Anima Anandkumar @UC Irvine  radar.oreilly.com/2015/05/the-tensor- renaissance-in-data-science.html Spacey Random Walks and   Higher Order Markov Chains  David Gleich @Purdue  slideshare.net/dgleich/spacey-random- walks-and-higher-order-markov-chains Interesting Trend #2: Generalized Libraries

Interesting Trend #3: Leveraging Workflows evaluation optimization representation circa 2010
ETL into cluster/cloud data data visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms APIs, algorithms, developer-centric template thinking –   these only go so far; the overall context is a workﬂow…

evaluation optimization representation circa 2010 ETL into cluster/cloud data data
visualize, reporting Data Prep Features Learners, Parameters Unsupervised Learning Explore train set test set models Evaluate Optimize Scoring production data use cases data pipelines actionable results decisions, feedback bar developers foo algorithms look beyond an API, beyond a code repo … think of people and machines working together Interesting Trend #3: Leveraging Workflows APIs, algorithms, developer-centric template thinking – these only

Chris Ré, @Stanford  https://www.macfound.org/fellows/943/ Drugs, DNA, and Dinosaurs: Building High
Quality Knowledge Bases with DeepDive  Strata CA (2015) The Thorn in the Side of Big Data: too few artists  Strata CA (2014) Interesting Trend #4: A Leap Ahead

Chris Ré https://www.macfound.org/fellows/943/ Drugs, DNA, and Dinosaurs: Building High Quality
Knowledge Bases with DeepDive Strata CA (2015) The Thorn in the Side of Big Data: too few artists Strata CA (2014) Interesting Trend #4: A Leap Ahead cognitive computing “ﬂywheel”: probabilistic reasoning about complex data and predictions together

Chris Ré https://www.macfound.org/fellows/943/ Drugs, DNA, and Dinosaurs: Building High Quality
Knowledge Bases with DeepDive Strata CA (2015) The Thorn in the Side of Big Data: too few artists Strata CA (2014) Interesting Trend #4: A Leap Ahead

Data Scientists

William Cleveland   “Data Science: an Action Plan for Expanding
  the Technical Areas of the Field of Statistics,”   International Statistical Review (2001), 69, 21-26 http://www.stat.purdue.edu/~wsc/papers/ datascience.pdf Leo Breiman  “Statistical modeling: the two cultures”,   Statistical Science (2001), 16:199-231 http://projecteuclid.org/euclid.ss/1009213726 …also good to mention John Tukey Data Scientists: Primary Sources

Data Scientists: Five Years of Strata Conference

One 2015 report (RJMetrics) tallied a minimum of   11,400
data scientists worldwide by scraping LinkedIn So many suddenly, really? Perhaps that’s doubtful… Comparing surveys: O’Reilly Media conducts salary surveys   for data scientists, along with exploring about the tools used 2013 – tools, trends, not all data is “Big”, coding scripts! 2014 – correlation of tools and skills, rapid evolution 2015 – divide blurring between open source and proprietary Data Scientists: Everywhere, all the time?

http://radar.oreilly.com/2015/09/2015-data-science-salary-survey.html John King, Roger Magoulas Data Scientists: 2015 Survey

Data Scientists: 2015 Survey

Moving Up

Enlitic http://www.enlitic.com/ deep learning to assist doctors treating cancer Moving
Up: Medicine

Moving Up: Medicine “Whatever the models might discover or predict,
Howard isn’t suggesting they’ll do away with a doctor’s judgment. Rather, artificially intelligent computers could provide strong, unbiased second opinions, or perhaps lead a doctor down   a path of investigation she other wouldn’t have considered.” With Enlitic, a veteran data scientist plans   to fight disease using deep learning  GigaOM (2014-08-22)  https://gigaom.com/2014/08/22/with-enlitic-a-veteran- data-scientist-plans-to-fight-disease-using-deep-learning/

Moving Up: Political Platform http://www.predikon.ch/en/voting-patterns/residents

Moving Up: Political Platform Mining Democracy  Matthias Grossglauser @EPFL  ICT
Labs (2015)  http://ictlabs-summer-school.sics.se/ slides/mining%20democracy.pdf What if a political candidate could cluster political positions in a multi-dimensional data space, to optimize for being recommended to voters? http://www.predikon.ch/en/voting-patterns/residents

Moving Up: Government Ethics The White House has a plan
to help society through data analysis  Fortune (2018-09-30)  http://fortune.com/2015/09/30/dj-patil-white-house-data/

Moving Up: Government Ethics The White House has a plan
to help society through data analysis  Fortune (2018-09-30)  http://fortune.com/2015/09/30/dj-patil-white-house-data/ “Opening up government data about child labor to concerned data scientists; recruiting folks to help analyze data about suicide prevention, social injustice and incarceration; a call for mandatory and `intrinsic` ethics instruction in every course teaching students data science; and an effort to help the transgender community create its own census of sorts, so that members and society can get a better grasp on the issues that matter to the group.”

Moving Up: Neuroscience Analytics + Visualization for Neuroscience: Spark, Thunder,
Lightning Jeremy Freeman  2015-01-29 youtu.be/cBQm4LhHn9g?t=28m55s

For excellent examples of Science and Data together see CodeNeuro,
particularly for   use of Jupyter notebooks + Apache Spark Moving Up: Neuroscience

Learning

Learning: What About MOOCs?

Massive Open Online Courses –   seven year trend, beginning
with: Connectivism and Connective Knowledge  George Siemens, Stephen Downes  University of PEI (2008)  http://cck11.mooc.ca/ Learning: What About MOOCs? Adios Ed Tech. Hola something else  George Siemens (2015-09-09)  http://www.elearnspace.org/blog/2015/09/09/ adios-ed-tech-hola-something-else/

Online education: MOOCs taken by educated few  Ezekiel Emanuel, Nature
503, 342 (2013-11-21) • 80% students already have an advanced degree • 80% come from the richest 6% of the population Michael Shanks @Stanford: “retrenchment around traditional disciplines will make disparities even more pronounced” An Early Report Card on Massive Open Online Courses  Geoffrey Fowler, WSJ (2013-10-08) Amherst, Duke, etc., have rejected edX Learning: What About MOOCs?

Online education: MOOCs taken by educated few Ezekiel Emanuel •
80% students already have an advanced degree • 80% come from the richest 6% of the population Michael Shanks disciplines will make disparities even more pronounced” An Early Report Card on Massive Open Online Courses Geoffrey Fowler Amhers Learning: What About MOOCs? So then, what else works better?

How to Flip a Class   CTL @UT/Austin  http://ctl.utexas.edu/teaching/ﬂipping-a-class/how 1.
identify where the ﬂipped classroom model makes   the most sense for your course 2. spend class time engaging students in application activities with feedback 3. clarify connections between inside and outside   of class learning 4. adapt your materials for students to acquire course content in preparation of class 5. extend learning beyond class through individual   and collaborative practice Learning: Inverted Classroom

Scalable Learning  David Black-Schaffer @Uppsala  Sverker Janson @KTH SICS https://www.scalable-learning.com/
• active learning: Flipped Classroom and Just-in-time Teaching • exams built directly into speciﬁc diagrams within videos • metrics for where in video+code that students get stuck • instructor can customize subsequent classroom discussions   (active teaching phase) based on stuck/unstuck metrics Learning: Inverted Classroom

Learning programming at scale Philip Guo   O’Reilly Radar (2015-08-13)
http://radar.oreilly.com/2015/08/learning- programming-at-scale.html • PythonTutor • Codechella Tutors could keep an eye on around   50 learners during a 30-minute session,   start 12 chat conversations, and   concurrently help 3 learners at once Learning: Collaborative Learning

Data-driven Education and the Quantiﬁed Student Lorena Barba @GWU PyData
Seattle (2015) https://youtu.be/2YIZ2SY9mW4 • keynote talk: abstract, slides • homepage • Open edX Universities Symposium, DC 2015-11-11 Learning: If you study just one link from this talk…

If by some bizarre chance you haven’t used   it
already, go to https://jupyter.org/ • 50+ different language kernels • new funding 2015-07 • UC Berkeley, Cal Poly • nbgrader autograder by Jess Hamrick • jupyterhub multi-user server • curating a list of examples • repeatable science! see also:  Teaching with Jupyter Notebooks  http://tinyurl.com/scipy2015-education Learning: Jupyter Project

Embracing Jupyter Notebooks at O'Reilly  Andrew Odewahn  O’Reilly Media (2015-05-07)
https://beta.oreilly.com/ideas/jupyter-at-oreilly O’Reilly Media is using our Atlas platform   to make Jupyter Notebooks a ﬁrst class authoring environment for our publishing program Jupyter, Thebe, Atlas, Docker, etc. Learning: O’Reilly Media

Learning: O’Reilly Media https://beta.oreilly.com/

in-person blended on-demand Mostly Synchronous Mostly Asynch Inverted Classroom Subscription
Free Content Learning: Audience Patterns

Is it possible to measure “distance” between   a learner
and a subject community? From Amateurs to Connoisseurs:  Modeling the Evolution of User   Expertise through Online Reviews  Julian McAuley, Jure Leskovec  http://i.stanford.edu/~julian/pdfs/www13.pdf Learning: Machine Learning about People Learning

Learning, Assessment, Team Building, Diversity – these can be accomplished
together, in situ Collective Intelligence in Human Groups  Anita Williams Woolley @CMU  https://youtu.be/Bz1dDiW2mvM • balance of participation (no one dominates) • 2+ women engaging within the group • group size < 9 • diversity of formal backgrounds Learning: Machine Learning about People Learning

People + Automation

Data Science teams apply machine learning (automation) to help arrive
at key insights, to learn what is important   in data sets – ﬁnding the proverbial needle in the haystack Cognitive Computing exhibits people + automation   as a process, in a learning context That’s also a basic tenet of workﬂows in general:   people + automation And a key aspect of the emerging gig economy too… People + Automation

People + Automation: Gig Economy

People + Automation: Gig Economy http://orchestra.unlimitedlabs.com/ “Workﬂows with humans and
machines”

People + Automation: Gig Economy Workers in a World of
Continuous Partial Employment Tim O’Reilly Medium (2015-08-31)  https://medium.com/the-wtf-economy/workers-in-a- world-of-continuous-partial-employment-4d7b53f18f96 http://conferences.oreilly.com/next-economy

Learning is key. Effective use of Data Science in these
new economic conditions requires people + automation, learning together – albeit in different ways. Plus, there’s an excellent framework for that: Autopoiesis and Cognition  Humberto Maturana, Francisco Varela  Springer (1973) https://books.google.es/books?id=nVmcN9Ja68kC People + Automation

I’d like to leave this as a theme for you
to consider about   Data Science 2016, Moving Up into use cases… We see an intersection of key points in both the emerging Cognitive Computing context and the Gig Economy in general: systems of people + automation, learning together It posits an interesting duality for use to leverage With that I wish you a great conference here at Big Data Spain! People + Automation

Gracias

contact: Just Enough Math O’Reilly (2014) justenoughmath.com  preview: youtu.be/TQ58cWgdCpA monthly
newsletter for updates,   events, conf summaries, etc.: liber118.com/pxn/ Intro to Apache Spark  O’Reilly (2015)  shop.oreilly.com/product/ 0636920036807.do

Data Science in 2016: Moving up by Paco Nathan ...

Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript