Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real time data driven applications

E9f5d8d804cef5201511a3c1654df1f1?s=47 lanzani
July 26, 2014

Real time data driven applications

This is the talk I delivered at PyData Berlin 2014 about building real time data driven applications using Python and pandas.

Video available soon.

E9f5d8d804cef5201511a3c1654df1f1?s=128

lanzani

July 26, 2014
Tweet

Transcript

  1. GoDataDriven PROUDLY PART OF THE XEBIA GROUP Real time data

    driven applications Giovanni Lanzani Data Whisperer Using Python + pandas as back end
  2. Who am I?

  3. Who am I? 2008-2012: PhD Theoretical Physics 2012-2013: KPMG 2013-Now:

    GoDataDriven
  4. Feedback @gglanzani

  5. Real-time, data driven app? •No store and retrieve; •Store, {transform,

    enrich, analyse} and retrieve; •Real-time: retrieve is not a batch process; •App: something your mother could use: SELECT attendees ! FROM!pydataberlin2014 ! WHERE password = '1234';
  6. Get insight about event impact

  7. Get insight about event impact

  8. Get insight about event impact

  9. Get insight about event impact

  10. Get insight about event impact

  11. Is it Big Data?

  12. Is it Big Data? •Raw logs are in the order

    of 40TB; •We use Hadoop for storing, enriching and pre- processing; •(10 nodes, 24TB per nodes)
  13. Challenges 1. Privacy; 2. Huge pile of data; 3. Real-time

    retrieval; 4. Some real-time analysis.
  14. 1. Privacy

  15. 1. Privacy

  16. 3. Real-time retrieval •Harder than it looks; •Large data; •Retrieval

    is by giving date, center location + radius.
  17. 4. (Some) real-time analysis

  18. Architecture AngularJS app.py helper.py REST Front-end Back-end JSON

  19. JS - 1

  20. JS - 2

  21. Flask from flask import Flask! app = Flask(__name__)! ! @app.route("/hello")!

    def hello():! return "Hello World!"! ! if __name__ == "__main__":! app.run()
  22. Flask from flask import Flask! app = Flask(__name__)! ! @app.route("/hello")!

    def hello():! return "Hello World!"! ! if __name__ == "__main__":! app.run()
  23. app.py example @app.route('/api/<postcode>/<date>/<radius>', methods=['GET'])! @app.route('/api/<postcode>/<date>', methods=['GET'])! def datapoints(postcode, date, radius=1.0):!

    ...! stats, timeline, points = helper.get_json(postcode, date, radius)! return … # returns a JSON object for AngularJS
  24. data example date hour id_activity postcode hits delta sbi 2013-01-01

    12 1234 1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2
  25. helper.py example def get_json(postcode, date, radius):! ...! ! lat, lon

    = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data, sbi)! timeline = get_timeline(data, sbi)! ! return stats, timeline, data.to_json(orient='records')
  26. helper.py example def get_json(postcode, date, radius):! ...! ! lat, lon

    = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data, sbi)! timeline = get_timeline(data, sbi)! ! return stats, timeline, data.to_json(orient='records')
  27. helper.py example def get_statistics(data, sbi):! sbi_df = data[data.sbi == sbi]

    # filter by sbi! ! hits = sbi_df.hits.sum() # sum the hits ! delta_hits = sbi_df.delta.sum() # sum the delta hits! ! if delta_hits:! percentage = (hits - delta_hits) / delta_hits! else:! percentage = 0! ! return {"sbi": sbi, "total": hits, "percentage": percentage}
  28. helper.py example def get_statistics(data, sbi):! sbi_df = data[data.sbi == sbi]

    # filter by sbi! ! hits = sbi_df.hits.sum() # sum the hits ! delta_hits = sbi_df.delta.sum() # sum the delta hits! ! if delta_hits:! percentage = (hits - delta_hits) / delta_hits! else:! percentage = 0! ! return {"sbi": sbi, "total": hits, "percentage": percentage}
  29. helper.py example def get_timeline(data, sbi):! df_sbi = data.groupby([“date”, “hour", "sbi"]).!

    aggregate(sum)! return df_sbi
  30. helper.py example def get_timeline(data, sbi):! df_sbi = data.groupby([“date”, “hour", "sbi"]).!

    aggregate(sum)! return df_sbi
  31. helper.py example def get_json(postcode, date, radius):! ...! ! lat, lon

    = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! dates = date.split(';')! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data)! timeline = get_timeline(data, dates)! ! return stats, timeline, data.to_json(orient='records')
  32. Who has my data?

  33. Who has my data? •First iteration was a (pre)-POC, less

    data (3GB vs 500GB); •Time constraints; •Oeps:
  34. Who has my data? •First iteration was a (pre)-POC, less

    data (3GB vs 500GB); •Time constraints; •Oeps: import pandas as pd! ...! source_data = pd.read_csv("data.csv", …)! ...! def get_data(postcodes, dates):! result = filter_data(source_data, postcodes, dates)! return result
  35. Advantage of “everything is a df” Pro: •Fast!! •Use what

    you know •NO DBA’s! •We all love CSV’s! ! ! !
  36. Advantage of “everything is a df” Pro: •Fast!! •Use what

    you know •NO DBA’s! •We all love CSV’s! ! ! ! Contra: •Doesn’t scale; •Huge startup time; •NO DBA’s! •We all hate CSV’s!
  37. If you want to go down this path •Set the

    dataframe index wisely; •Align the data to the index: ! •Beware of modifications of the original dataframe! source_data.sort_index(inplace=True)
  38. If you want to go down this path

  39. If you want to go down this path

  40. If you want to go down this path

  41. If you want to go down this path The reason

    pandas is faster is because I came up with a better algorithm
  42. If you don’t… data = get_data(postcodes, dates) AngularJS app.py helper.py

    REST Front-end Back-end JSON
  43. If you don’t… data = get_data(postcodes, dates) database.py Data psycopg2

    AngularJS app.py helper.py REST Front-end Back-end JSON
  44. If you don’t… data = get_data(db, postcodes, dates) database.py Data

    psycopg2 AngularJS app.py helper.py REST Front-end Back-end JSON
  45. Handling geo-data def get_json(postcode, date, radius):! """! ...! """! lat,

    lon = get_lat_lon(postcode)! postcodes = get_postcodes(postcode, radius)! dates = date.split(';')! ! data = get_data(postcodes, dates)! ! stats = get_statistics(data)! timeline = get_timeline(data, dates)! ! return stats, timeline, data.to_json(orient='records')
  46. Issues?! •With a radius of 10km, in Amsterdam, you get

    10k postcodes. You need to do this in your SQL: ! ! ! •Index on date and postcode, but single queries running more than 20 minutes. SELECT * FROM datapoints ! WHERE ! date IN date_array ! ! ! AND ! ! ! ! postcode IN postcode_array;
  47. Postgres + Postgis (2.x) PostGIS is a spatial database extender

    for PostgreSQL. Supports geographic objects allowing location queries SELECT *! FROM datapoints! WHERE ST_DWithin(lon, lat, 1500)! AND dates IN ('2013-02-30', '2013-02-31');! -- every point within 1.5km ! -- from (lat, lon) on imaginary dates
  48. Other db’s?

  49. Steps to solve it 1. Align data on disk by

    date; 2. Use the temporary table trick: ! ! ! ! 3. Lose precision: 1234AB→1234 4. (Compression) CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY);! INSERT INTO tmp (postcodes) VALUES postcode_array;! ! SELECT * FROM tmp! JOIN datapoints d! ON d.postcode = tmp.postcodes! WHERE! d.dt IN dates_array;
  50. GoDataDriven We’re hiring / Questions? / Thank you! @gglanzani giovannilanzani@godatadriven.com

    Giovanni Lanzani Data Whisperer