Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Idea to Product: Customer Profiling in Apa...

Pycon ZA
October 11, 2018

From Idea to Product: Customer Profiling in Apache Zeppelin with PySpark by Sarah Sprich

Zeppelin is a web based notebook which enables interactive data analytics on big data. Data can easily be ingested from a variety of databases and analysis can be performed in Python and Pyspark. Visualisations can be built and displayed together with the code, using Zeppelin’s built in tool Helium, or Python specific tools such as Matplotlib and Bokeh. The web based interface facilitates easy sharing of results, and collaboration on projects.

Developing in Zeppelin has changed the way we approach model development. We are able to take a project from an idea to a product all within one tool using the fo3llowing process:

1. Come up with an idea. Write some notes in a Zeppelin notebook describing how we would like the idea implemented.
2. Slowly start fleshing out the idea, with real code, until the solution is built. This is great to demo, as the code is in bite size chunks, and visualisations can be added directly in.
3. Take the code into production. It can be scheduled it to run directly in Zeppelin with a cron scheduler, or from a tool such as Nifi. Interactive visualisations can be embedded in a web-based frontend.

This talk is aimed at data scientists, particularly those working with big data. We will demonstrate how we have built a catalogue of subscriber attributes based on customer mobile usage and purchase behavior using Zeppelin and Pyspark. These attributes can be used to profile subscribers, and are the starting point for indivisualised customer engagement. Anyone who attends this talk will get an introduction to Zeppelin and Pyspark and an overview of what can be achieved with these tools.

Pycon ZA

October 11, 2018
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. Pycon From Idea to Product: Building Models in Apache Zeppelin

    with PySpark Introduction to Zeppelin This is Zeppelin. Welcome. Its a notebook, much like Jupyter. It is part of Apache, and the Hortonworks Stack (Hive, Spark, Yarn, Kafka etc. all managed by Ambari) Contents 1. Introduction 2. The Notebook Way of Working 3. Zeppelin vs Jupyter 4. Reading and Writing Data with Zeppelin 5. Visualising Data with Zeppelin 6. Using Zeppelin to build Models
  2. Setup Must be used before SparkInterpreter (%spark) initialized Hint: put

    this paragraph before any Spark code and restart Zeppelin/Interpreter Introduction About Digitata: About Me: - I've worked in the modelling team at Digitata for the last 5 years. - Before that, I worked as a MATLAB consultant. - I trained as an electrical engineer. Technologies: In today's talk, I'll be using the following technologies: - Zeppelin - Python (including numpy and pandas) - Pyspark and Spark MLlib - Matplotlib and Bokeh
  3. The Notebook Way of Working Using Zeppelin has completely changed

    the way that our team works: Our old analysis workflow used to look something like this: This has now been simplified to this: We are using Notebooks in three ways: 1. Personal Environment: Each of our team members has a VM with a Zeppelin instance running to do some adhoc-analysis on small datasets or prototype analysis on data samples. 2. Test Environment: We have a test cluster with the Hortonworks stack installed. We use the Zeppelin instance here to collaborate on work, and to run larger analysis. 3. Production Environment: We have Zeppelin installed on customer production servers. Here we can run analysis on large datasets, at the source of the data.
  4. Zeppelin vs Jupyter Here's a comparison, with a focus on

    why we chose Zeppelin. Reading and Writing Data with Zeppelin We typically use the following databases:
  5. Postgres and Postgres-XL: Used to store Call Data Record (CDR),

    Event Data Record (EDR) and network statistics data which we collect from our customers, network operators. MongoDB: Used to store data required in real-time components of our solution. Hive: used on our larger customers to store CDR and EDR data. Although intepreters can be created explicitely for these languages, we typically use the following two approaches to read from and write to these sources: Python libraries such as psycopg2 and pymongo (for 'small' data) Pyspark(for 'big' data) Write SQL query Read data from Postgres using Pyspark Fetched 168124 rows of data Display the data... uh... no +----------+-------------------+-------------------+--------------------+---------------- ----+-----------------------+------------------+--------------------+-------------------+ -------------------+--------------------+--------------------+--------------------+------ ------------+--------------------+-------------------+-------------------+--------------- -----+--------------------+---------------------+------------------+--------------------+ | msisdn| firstcall| lastcall| totalcallduration| chargedcalldura tion|zerochargedcallduration| chargedcallamt| voiceactivedays| firstsms| lastsms| totalsmscnt| chargedsmscnt| zerochargedsmscnt| c hargedsmsamt| smsactivedays| firstdata| lastdata| totaldatav olume| chargeddatavolume|zerochargeddatavolume| chargeddataamt| dataactivedays| +----------+-------------------+-------------------+--------------------+---------------- ----+-----------------------+------------------+--------------------+-------------------+ -------------------+--------------------+--------------------+--------------------+------ ------------+--------------------+-------------------+-------------------+--------------- -----+--------------------+---------------------+------------------+--------------------+ |1773293996|2018-03-12 08:41:06|2018-04-08 17:49:57|111.8214285714285...|110.285714285714 2...| 1.535714285714285700| 6578.607142857143|0.678571428571428571| null| null| 0E-18| 0E-18| 0E-18| Display the data... much better 1769344060 2018-03-12 11:13:59.0 2018-03-14 13:54:23.0 1.678571428571428600 1.6785714285 1777260828 2018-03-12 18:11:33.0 2018-04-04 19:49:13.0 38.535714285714285700 21.3214285714 1760996104 2018-03-17 08:37:46.0 2018-04-07 09:46:59.0 91.035714285714285700 91.035714285 1761007372 2018-03-12 01:14:14.0 2018-04-08 15:59:34.0 200.285714285714285700 158.964285714 1762939758 2018-04-03 16:12:52.0 2018-04-08 05:26:04.0 5.750000000000000000 5.7500000000 1768950000 2018-03-12 15:33:45.0 2018-04-08 15:56:51.0 174.928571428571428600 123.50000000 1767456272 2018-03-16 21:35:48.0 2018-04-07 19:52:29.0 18.714285714285714300 14.357142857 1776982500 2018-03-19 19:56:19.0 2018-04-08 22:16:48.0 16.750000000000000000 15.3214285714 1772958074 2018-03-16 16:38:31.0 2018-03-19 17:47:44.0 12.928571428571428600 12.9285714285 Output exceeds 102400. Truncated. ▼ msisdn ▼ firstcall ▼ lastcall ▼ totalcallduration chargedcalldu
  6. Visualising Data with Zeppelin A big advantage of working with

    notebooks is that data analysis code can be displayed side by side with visualisations of the results. With libraries such as Bokeh, insightful, interactive and beautiful data visualisations can be built. I'll be showing some visualisations using: Helium (built in to Zeppelin) Matplotlib Bokeh Helium: Display attribute distribution chart Select Attribute: totalcallduration
  7. 0.59 3.03 9.21 24.85 64.48 5,000 10,000 15,000 0 19,067

    Matplotlib: Scatter plot comparing two attributes Select Attribute for x-axis: totalcallduration Select Attribute for y-axis: totaldatavolume field x, log2 field y, log2 field y, log10 x: 0.0 12.7360696697 y: 0.0 9.91812372759 De ne box-whisker Bokeh function
  8. Bokeh: Box and whisker plot comparing distribution of similar attributes

    Loading BokehJS ... (https://bokeh.pydata.org) (http Using Zeppelin to build Models Once you've got a good data set together, this is almost too easy. Many tools could be used to build models, but I'll demo K-means clustering using the Spark MLlib implementation. Prepare data for clustering Applied log scaling to skewed features. Features scaled to range: [0.000000, 1.000000] Build a K-means model Training a k-means model Making predictions on feature data Cluster Centers: Cluster 0 0.48 0.65 0.58 0.04 0.27 Cluster 1 0.44 0.6 0.48 0.07 0.26 Cluster 2 0.22 0.42 0.14 0.02 0.06 Cluster 3 0.51 0.65 0.69 0.43 0.58 ▼ ▼ totalcallduration ▼ chargedcallamt ▼ voiceactivedays ▼ totalsmscnt chargedsmsa ▼ ▼ totalcallduration ▼ chargedcallamt ▼ voiceactivedays ▼ totalsmscnt chargedsmsa
  9. Conclusion I have illustrated some of the advantages of working

    with Notebooks, the primary ones being: - Repeatable analysis - Easy collaboration - Easy communication of data analysis More specifically, I've shown how Zeppelin can be used to: - Read and write data - Visualise data - Build models I hope I have inspired you to investigate Zeppelin further. Thank you