Slide 1

Slide 1 text

GoDataDriven PROUDLY PART OF THE XEBIA GROUP Real time data driven applications Giovanni Lanzani Data Whisperer Using Python + pandas as back end

Slide 2

Slide 2 text

Who am I?

Slide 3

Slide 3 text

Who am I? 2008-2012: PhD Theoretical Physics 2012-2013: KPMG 2013-Now: GoDataDriven

Slide 4

Slide 4 text

Feedback @gglanzani

Slide 5

Slide 5 text

Real-time, data driven app? •No store and retrieve; •Store, {transform, enrich, analyse} and retrieve; •Real-time: retrieve is not a batch process; •App: something your mother could use: SELECT attendees ! FROM!pydataberlin2014 ! WHERE password = '1234';

Slide 6

Slide 6 text

Get insight about event impact

Slide 7

Slide 7 text

Get insight about event impact

Slide 8

Slide 8 text

Get insight about event impact

Slide 9

Slide 9 text

Get insight about event impact

Slide 10

Slide 10 text

Get insight about event impact

Slide 11

Slide 11 text

Is it Big Data?

Slide 12

Slide 12 text

Is it Big Data? •Raw logs are in the order of 40TB; •We use Hadoop for storing, enriching and pre- processing; •(10 nodes, 24TB per nodes)

Slide 13

Slide 13 text

Challenges 1. Privacy; 2. Huge pile of data; 3. Real-time retrieval; 4. Some real-time analysis.

Slide 14

Slide 14 text

1. Privacy

Slide 15

Slide 15 text

1. Privacy

Slide 16

Slide 16 text

3. Real-time retrieval •Harder than it looks; •Large data; •Retrieval is by giving date, center location + radius.

Slide 17

Slide 17 text

4. (Some) real-time analysis

Slide 18

Slide 18 text

Architecture AngularJS app.py helper.py REST Front-end Back-end JSON

Slide 19

Slide 19 text

JS - 1

Slide 20

Slide 20 text

JS - 2

Slide 21

Slide 21 text

Flask from flask import Flask! app = Flask(__name__)! ! @app.route("/hello")! def hello():! return "Hello World!"! ! if __name__ == "__main__":! app.run()

Slide 22

Slide 22 text

Flask from flask import Flask! app = Flask(__name__)! ! @app.route("/hello")! def hello():! return "Hello World!"! ! if __name__ == "__main__":! app.run()

Slide 23

Slide 23 text

app.py example @app.route('/api///', methods=['GET'])! @app.route('/api//', methods=['GET'])! def datapoints(postcode, date, radius=1.0):! ...! stats, timeline, points = helper.get_json(postcode, date, radius)! return … # returns a JSON object for AngularJS

Slide 24

Slide 24 text

data example date hour id_activity postcode hits delta sbi 2013-01-01 12 1234 1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2

Slide 25

Slide 25 text

helper.py example def get_json(postcode, date, radius):! ...! ! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data, sbi)! timeline = get_timeline(data, sbi)! ! return stats, timeline, data.to_json(orient='records')

Slide 26

Slide 26 text

helper.py example def get_json(postcode, date, radius):! ...! ! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data, sbi)! timeline = get_timeline(data, sbi)! ! return stats, timeline, data.to_json(orient='records')

Slide 27

Slide 27 text

helper.py example def get_statistics(data, sbi):! sbi_df = data[data.sbi == sbi] # filter by sbi! ! hits = sbi_df.hits.sum() # sum the hits ! delta_hits = sbi_df.delta.sum() # sum the delta hits! ! if delta_hits:! percentage = (hits - delta_hits) / delta_hits! else:! percentage = 0! ! return {"sbi": sbi, "total": hits, "percentage": percentage}

Slide 28

Slide 28 text

helper.py example def get_statistics(data, sbi):! sbi_df = data[data.sbi == sbi] # filter by sbi! ! hits = sbi_df.hits.sum() # sum the hits ! delta_hits = sbi_df.delta.sum() # sum the delta hits! ! if delta_hits:! percentage = (hits - delta_hits) / delta_hits! else:! percentage = 0! ! return {"sbi": sbi, "total": hits, "percentage": percentage}

Slide 29

Slide 29 text

helper.py example def get_timeline(data, sbi):! df_sbi = data.groupby([“date”, “hour", "sbi"]).! aggregate(sum)! return df_sbi

Slide 30

Slide 30 text

helper.py example def get_timeline(data, sbi):! df_sbi = data.groupby([“date”, “hour", "sbi"]).! aggregate(sum)! return df_sbi

Slide 31

Slide 31 text

helper.py example def get_json(postcode, date, radius):! ...! ! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! dates = date.split(';')! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data)! timeline = get_timeline(data, dates)! ! return stats, timeline, data.to_json(orient='records')

Slide 32

Slide 32 text

Who has my data?

Slide 33

Slide 33 text

Who has my data? •First iteration was a (pre)-POC, less data (3GB vs 500GB); •Time constraints; •Oeps:

Slide 34

Slide 34 text

Who has my data? •First iteration was a (pre)-POC, less data (3GB vs 500GB); •Time constraints; •Oeps: import pandas as pd! ...! source_data = pd.read_csv("data.csv", …)! ...! def get_data(postcodes, dates):! result = filter_data(source_data, postcodes, dates)! return result

Slide 35

Slide 35 text

Advantage of “everything is a df” Pro: •Fast!! •Use what you know •NO DBA’s! •We all love CSV’s! ! ! !

Slide 36

Slide 36 text

Advantage of “everything is a df” Pro: •Fast!! •Use what you know •NO DBA’s! •We all love CSV’s! ! ! ! Contra: •Doesn’t scale; •Huge startup time; •NO DBA’s! •We all hate CSV’s!

Slide 37

Slide 37 text

If you want to go down this path •Set the dataframe index wisely; •Align the data to the index: ! •Beware of modifications of the original dataframe! source_data.sort_index(inplace=True)

Slide 38

Slide 38 text

If you want to go down this path

Slide 39

Slide 39 text

If you want to go down this path

Slide 40

Slide 40 text

If you want to go down this path

Slide 41

Slide 41 text

If you want to go down this path The reason pandas is faster is because I came up with a better algorithm

Slide 42

Slide 42 text

If you don’t… data = get_data(postcodes, dates) AngularJS app.py helper.py REST Front-end Back-end JSON

Slide 43

Slide 43 text

If you don’t… data = get_data(postcodes, dates) database.py Data psycopg2 AngularJS app.py helper.py REST Front-end Back-end JSON

Slide 44

Slide 44 text

If you don’t… data = get_data(db, postcodes, dates) database.py Data psycopg2 AngularJS app.py helper.py REST Front-end Back-end JSON

Slide 45

Slide 45 text

Handling geo-data def get_json(postcode, date, radius):! """! ...! """! lat, lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! dates = date.split(';')! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data)! timeline = get_timeline(data, dates)! ! return stats, timeline, data.to_json(orient='records')

Slide 46

Slide 46 text

Issues?! •With a radius of 10km, in Amsterdam, you get 10k postcodes. You need to do this in your SQL: ! ! ! •Index on date and postcode, but single queries running more than 20 minutes. SELECT * FROM datapoints ! WHERE ! date IN date_array ! ! ! AND ! ! ! ! postcode IN postcode_array;

Slide 47

Slide 47 text

Postgres + Postgis (2.x) PostGIS is a spatial database extender for PostgreSQL. Supports geographic objects allowing location queries SELECT *! FROM datapoints! WHERE ST_DWithin(lon, lat, 1500)! AND dates IN ('2013-02-30', '2013-02-31');! -- every point within 1.5km ! -- from (lat, lon) on imaginary dates

Slide 48

Slide 48 text

Other db’s?

Slide 49

Slide 49 text

Steps to solve it 1. Align data on disk by date; 2. Use the temporary table trick: ! ! ! ! 3. Lose precision: 1234AB→1234 4. (Compression) CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY);! INSERT INTO tmp (postcodes) VALUES postcode_array;! ! SELECT * FROM tmp! JOIN datapoints d! ON d.postcode = tmp.postcodes! WHERE! d.dt IN dates_array;

Slide 50

Slide 50 text

GoDataDriven We’re hiring / Questions? / Thank you! @gglanzani [email protected] Giovanni Lanzani Data Whisperer