From Idea to Product: Customer Profiling in Apache Zeppelin with PySpark by Sarah Sprich

Pycon From Idea to Product: Building Models in Apache Zeppelin
with PySpark Introduction to Zeppelin This is Zeppelin. Welcome. Its a notebook, much like Jupyter. It is part of Apache, and the Hortonworks Stack (Hive, Spark, Yarn, Kafka etc. all managed by Ambari) Contents 1. Introduction 2. The Notebook Way of Working 3. Zeppelin vs Jupyter 4. Reading and Writing Data with Zeppelin 5. Visualising Data with Zeppelin 6. Using Zeppelin to build Models

Setup Must be used before SparkInterpreter (%spark) initialized Hint: put
this paragraph before any Spark code and restart Zeppelin/Interpreter Introduction About Digitata: About Me: - I've worked in the modelling team at Digitata for the last 5 years. - Before that, I worked as a MATLAB consultant. - I trained as an electrical engineer. Technologies: In today's talk, I'll be using the following technologies: - Zeppelin - Python (including numpy and pandas) - Pyspark and Spark MLlib - Matplotlib and Bokeh

The Notebook Way of Working Using Zeppelin has completely changed
the way that our team works: Our old analysis workﬂow used to look something like this: This has now been simpliﬁed to this: We are using Notebooks in three ways: 1. Personal Environment: Each of our team members has a VM with a Zeppelin instance running to do some adhoc-analysis on small datasets or prototype analysis on data samples. 2. Test Environment: We have a test cluster with the Hortonworks stack installed. We use the Zeppelin instance here to collaborate on work, and to run larger analysis. 3. Production Environment: We have Zeppelin installed on customer production servers. Here we can run analysis on large datasets, at the source of the data.

Zeppelin vs Jupyter Here's a comparison, with a focus on
why we chose Zeppelin. Reading and Writing Data with Zeppelin We typically use the following databases:

Postgres and Postgres-XL: Used to store Call Data Record (CDR),
Event Data Record (EDR) and network statistics data which we collect from our customers, network operators. MongoDB: Used to store data required in real-time components of our solution. Hive: used on our larger customers to store CDR and EDR data. Although intepreters can be created explicitely for these languages, we typically use the following two approaches to read from and write to these sources: Python libraries such as psycopg2 and pymongo (for 'small' data) Pyspark(for 'big' data) Write SQL query Read data from Postgres using Pyspark Fetched 168124 rows of data Display the data... uh... no +----------+-------------------+-------------------+--------------------+---------------- ----+-----------------------+------------------+--------------------+-------------------+ -------------------+--------------------+--------------------+--------------------+------ ------------+--------------------+-------------------+-------------------+--------------- -----+--------------------+---------------------+------------------+--------------------+ | msisdn| firstcall| lastcall| totalcallduration| chargedcalldura tion|zerochargedcallduration| chargedcallamt| voiceactivedays| firstsms| lastsms| totalsmscnt| chargedsmscnt| zerochargedsmscnt| c hargedsmsamt| smsactivedays| firstdata| lastdata| totaldatav olume| chargeddatavolume|zerochargeddatavolume| chargeddataamt| dataactivedays| +----------+-------------------+-------------------+--------------------+---------------- ----+-----------------------+------------------+--------------------+-------------------+ -------------------+--------------------+--------------------+--------------------+------ ------------+--------------------+-------------------+-------------------+--------------- -----+--------------------+---------------------+------------------+--------------------+ |1773293996|2018-03-12 08:41:06|2018-04-08 17:49:57|111.8214285714285...|110.285714285714 2...| 1.535714285714285700| 6578.607142857143|0.678571428571428571| null| null| 0E-18| 0E-18| 0E-18| Display the data... much better 1769344060 2018-03-12 11:13:59.0 2018-03-14 13:54:23.0 1.678571428571428600 1.6785714285 1777260828 2018-03-12 18:11:33.0 2018-04-04 19:49:13.0 38.535714285714285700 21.3214285714 1760996104 2018-03-17 08:37:46.0 2018-04-07 09:46:59.0 91.035714285714285700 91.035714285 1761007372 2018-03-12 01:14:14.0 2018-04-08 15:59:34.0 200.285714285714285700 158.964285714 1762939758 2018-04-03 16:12:52.0 2018-04-08 05:26:04.0 5.750000000000000000 5.7500000000 1768950000 2018-03-12 15:33:45.0 2018-04-08 15:56:51.0 174.928571428571428600 123.50000000 1767456272 2018-03-16 21:35:48.0 2018-04-07 19:52:29.0 18.714285714285714300 14.357142857 1776982500 2018-03-19 19:56:19.0 2018-04-08 22:16:48.0 16.750000000000000000 15.3214285714 1772958074 2018-03-16 16:38:31.0 2018-03-19 17:47:44.0 12.928571428571428600 12.9285714285 Output exceeds 102400. Truncated. ▼ msisdn ▼ ﬁrstcall ▼ lastcall ▼ totalcallduration chargedcalldu

Visualising Data with Zeppelin A big advantage of working with
notebooks is that data analysis code can be displayed side by side with visualisations of the results. With libraries such as Bokeh, insightful, interactive and beautiful data visualisations can be built. I'll be showing some visualisations using: Helium (built in to Zeppelin) Matplotlib Bokeh Helium: Display attribute distribution chart Select Attribute: totalcallduration

0.59 3.03 9.21 24.85 64.48 5,000 10,000 15,000 0 19,067
Matplotlib: Scatter plot comparing two attributes Select Attribute for x-axis: totalcallduration Select Attribute for y-axis: totaldatavolume field x, log2 field y, log2 field y, log10 x: 0.0 12.7360696697 y: 0.0 9.91812372759 De ne box-whisker Bokeh function

Bokeh: Box and whisker plot comparing distribution of similar attributes
Loading BokehJS ... (https://bokeh.pydata.org) (http Using Zeppelin to build Models Once you've got a good data set together, this is almost too easy. Many tools could be used to build models, but I'll demo K-means clustering using the Spark MLlib implementation. Prepare data for clustering Applied log scaling to skewed features. Features scaled to range: [0.000000, 1.000000] Build a K-means model Training a k-means model Making predictions on feature data Cluster Centers: Cluster 0 0.48 0.65 0.58 0.04 0.27 Cluster 1 0.44 0.6 0.48 0.07 0.26 Cluster 2 0.22 0.42 0.14 0.02 0.06 Cluster 3 0.51 0.65 0.69 0.43 0.58 ▼ ▼ totalcallduration ▼ chargedcallamt ▼ voiceactivedays ▼ totalsmscnt chargedsmsa ▼ ▼ totalcallduration ▼ chargedcallamt ▼ voiceactivedays ▼ totalsmscnt chargedsmsa

Radar chart of cluster centres BokehJS 0.12.14 successfully loaded. (https://bokeh.pydata.org)
(http Count subscribers per cluster

Share the results as iframes

BokehJS 0.12.14 successfully loaded. (https://bokeh.pydata.org) (http

Conclusion I have illustrated some of the advantages of working
with Notebooks, the primary ones being: - Repeatable analysis - Easy collaboration - Easy communication of data analysis More speciﬁcally, I've shown how Zeppelin can be used to: - Read and write data - Visualise data - Build models I hope I have inspired you to investigate Zeppelin further. Thank you

From Idea to Product: Customer Profiling in Apa...

From Idea to Product: Customer Profiling in Apache Zeppelin with PySpark by Sarah Sprich

Pycon ZA

More Decks by Pycon ZA

Other Decks in Programming

Featured

Transcript

Pycon From Idea to Product: Building Models in Apache Zeppelin

Setup Must be used before SparkInterpreter (%spark) initialized Hint: put

The Notebook Way of Working Using Zeppelin has completely changed

Zeppelin vs Jupyter Here's a comparison, with a focus on

Postgres and Postgres-XL: Used to store Call Data Record (CDR),

Visualising Data with Zeppelin A big advantage of working with

0.59 3.03 9.21 24.85 64.48 5,000 10,000 15,000 0 19,067

Bokeh: Box and whisker plot comparing distribution of similar attributes

Radar chart of cluster centres BokehJS 0.12.14 successfully loaded. (https://bokeh.pydata.org)

Share the results as iframes

BokehJS 0.12.14 successfully loaded. (https://bokeh.pydata.org) (http

Conclusion I have illustrated some of the advantages of working