Real time data driven applications (and SQL vs NoSQL databases)

GoDataDriven PROUDLY PART OF THE XEBIA GROUP Real time data
driven applications Giovanni Lanzani Data Whisperer and SQL vs NoSQL databases

Who am I? 2008-2012: PhD Theoretical Physics 2012-2013: KPMG 2013-Now:
GoDataDriven

Feedback @gglanzani

Real-time, data driven app? •No store and retrieve; •Store, {transform,
enrich, analyse} and retrieve; •Real-time: retrieve is not a batch process; •App: something your mother could use: SELECT attendees FROM NoSQLMatters WHERE password = '1234';

Get insight about event impact

Challenges 1. Big Data; 2. Privacy; 3. Some real-time analysis;
4. Real-time retrieval.

Is it Big Data? Everybody talks about it Nobody knows
how to do it Everyone thinks everyone else is doing it, so everyone claims they’re doing it… Dan Ariely

Is it Big Data? •Raw logs are in the order
of 40TB; •We use Hadoop for storing, enriching and pre- processing.

2. Privacy

3. (Some) real-time analysis

•Harder than it looks; •Large data; •Retrieval is by giving
date, center location + radius. 4. Real-Time Retrieval

AngularJS python app REST Front-end Back-end JSON Architecture

date hour id_activity postcode hits delta sbi 2013-01-01 12 1234
1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2 Data Example

date hour id_activity postcod e hits delta sbi 2013-01-01 12
1234 1234AB 35 22 1 2013-01-08 12 1234 1234AB 45 35 1 2013-01-01 11 2345 5555ZB 2 1 2 2013-01-08 11 2345 5555ZB 55 2 2 Data Example

helper.py example def get_statistics(data, sbi): sbi_df = data[data.sbi == sbi]
# select * from data where sbi = sbi hits = sbi_df.hits.sum() # select sum(hits) from … delta_hits = sbi_df.delta.sum() # select sum(delta) from … if delta_hits: percentage = (hits - delta_hits) / delta_hits else: percentage = 0 return {"sbi": sbi, "total": hits, "percentage": percentage}

helper.py example def get_timeline(data, sbi): df_sbi = data.groupby([“date”, “hour", “sbi"]).aggregate(sum)
# select sum(hits), sum(delta) from data group by date, hour, sbi return df_sbi

Who has my data? •First iteration was a (pre)-POC, less
data (3GB vs 500GB); •Time constraints; •Oeps: everything is a pandas df!

Advantage of “everything is a df” Pro: •Fast!! •Use what
you know •NO DBA’s! •We all love CSV’s! Contra: •Doesn’t scale; •Huge startup time; •NO DBA’s! •We all hate CSV’s!

•Set the dataframe index wisely; •Align the data to the
index: •Beware of modiﬁcations of the original dataframe! source_data.sort_index(inplace=True) If you want to go down this path

The reason pandas is faster is because I came up
with a better algorithm If you want to go down this path

AngularJS python app REST Front-end Back-end Database JSON ? If
you don’t

A word about (traditional) databases…

Db: programming language dict

Postgres for data driven apps?

Issues?! •With a radius of 10km, in Amsterdam, you get
10k postcodes. You need to do this in your SQL: •Index on date and postcode, but single queries running more than 20 minutes. SELECT * FROM datapoints WHERE date IN date_array AND postcode IN postcode_array;

PostGIS is a spatial database extender for PostgreSQL. Supports geographic
objects allowing location queries: SELECT * FROM datapoints WHERE ST_DWithin(lon, lat, 1500) AND dates IN ('2013-02-30', '2013-02-31'); -- every point within 1.5km -- from (lat, lon) on imaginary dates Postgres + Postgis (2.x)

Other db’s?

How we solved it 1. Align data on disk by
date; 2. Use the temporary table trick: 3. Lose precision: 1234AB→1234 CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL PRIMARY KEY); INSERT INTO tmp (postcodes) VALUES postcode_array; SELECT * FROM tmp JOIN datapoints d ON d.postcode = tmp.postcodes WHERE d.dt IN dates_array;

Take home messages 1. Geospatial problems are “hard” and cam
kill your queries; 2. Not everybody has inﬁnite resources: be smart and KISS! 3. SQL or NoSQL? (Size, schema)

GoDataDriven We’re hiring / Questions? / Thank you! @gglanzani [email protected]
Giovanni Lanzani Data Whisperer

Real time data driven applications (and SQL vs ...

Real time data driven applications (and SQL vs NoSQL databases)

More Decks by lanzani

Other Decks in Technology

Featured

Transcript