$30 off During Our Annual Pro Sale. View Details »

Real time data driven applications (and SQL vs NoSQL databases)

lanzani
September 04, 2014

Real time data driven applications (and SQL vs NoSQL databases)

These are the slides of my No-SQL Matters talk I gave in Dublin in September 2014.

I cover what real time data driven applications are, present one of the app build for one of GoDataDriven customers, what challenges arose and what database helped us achieve the level of performance we wanted.

lanzani

September 04, 2014
Tweet

More Decks by lanzani

Other Decks in Technology

Transcript

  1. GoDataDriven
    PROUDLY PART OF THE XEBIA GROUP
    Real time data driven applications
    Giovanni Lanzani



    Data Whisperer
    and SQL vs NoSQL databases

    View Slide

  2. Feedback
    @gglanzani

    View Slide

  3. Real-time, data driven app?
    •No store and retrieve;

    •Store, {transform, enrich, analyse} and retrieve;

    •Real-time: retrieve is not a batch process;

    •App: something your mother could use:

    SELECT attendees
    FROM NoSQLMatters
    WHERE password = '1234';

    View Slide

  4. Get insight about event impact

    View Slide

  5. Get insight about event impact

    View Slide

  6. Get insight about event impact

    View Slide

  7. Get insight about event impact

    View Slide

  8. Get insight about event impact

    View Slide

  9. Challenges
    1. Big Data

    2. Privacy;

    3. Some real-time analysis;

    4. Real-time retrieval.

    View Slide

  10. Is it Big Data?

    View Slide

  11. Is it Big Data?
    Everybody talks about it

    Nobody knows how to do it

    Everyone thinks everyone else is doing it, so everyone
    claims they’re doing it…

    Dan Ariely

    View Slide

  12. 2. Privacy

    View Slide

  13. 2. Privacy

    View Slide

  14. 3. (Some) real-time analysis

    View Slide

  15. •Harder than it looks;

    •Large data;

    •Retrieval is by giving date, center location +
    radius.
    4. Real-Time Retrieval

    View Slide

  16. AngularJS python app
    REST
    Front-end Back-end
    JSON
    Architecture

    View Slide

  17. JS-1

    View Slide

  18. JS-2

    View Slide

  19. date hour id_activity postcode hits delta sbi
    2013-01-01 12 1234 1234AB 35 22 1
    2013-01-08 12 1234 1234AB 45 35 1
    2013-01-01 11 2345 5555ZB 2 1 2
    2013-01-08 11 2345 5555ZB 55 2 2
    Data Example

    View Slide

  20. Who has my data?

    View Slide

  21. Who has my data?
    •First iteration was a (pre)-POC, less data (3GB vs
    500GB);

    •Time constraints;

    •Oeps: everything is a pandas df!

    View Slide

  22. Advantage of “everything is a df”
    Pro:

    •Fast!!

    •Use what you know

    •NO DBA’s!

    •We all love CSV’s!

    View Slide

  23. Advantage of “everything is a df”
    Pro:

    •Fast!!

    •Use what you know

    •NO DBA’s!

    •We all love CSV’s!

    Contra:

    •Doesn’t scale;

    •Huge startup time;

    •NO DBA’s!

    •We all hate CSV’s!

    View Slide

  24. AngularJS python app
    REST
    Front-end Back-end Database
    JSON
    ?
    If you don’t

    View Slide

  25. Issues?!
    •With a radius of 10km, in Amsterdam, you get
    10k postcodes. You need to do this in your SQL:

    !
    !
    !
    •Index on date and postcode, but single queries
    running more than 20 minutes.
    SELECT * FROM datapoints
    WHERE
    date IN date_array
    AND
    postcode IN postcode_array;

    View Slide

  26. PostGIS is a spatial database extender for PostgreSQL.
    Supports geographic objects allowing location queries:

    SELECT *
    FROM datapoints
    WHERE ST_DWithin(lon, lat, 1500)
    AND dates IN ('2013-02-30', '2013-02-31');
    -- every point within 1.5km
    -- from (lat, lon) on imaginary dates
    Postgres + Postgis (2.x)

    View Slide

  27. Other db’s?

    View Slide

  28. How we solved it
    1. Align data on disk by date;

    2. Use the temporary table trick:

    !
    !
    !
    !
    3. Lose precision: 1234AB→1234
    CREATE TEMPORARY TABLE tmp (postcodes STRING NOT NULL
    PRIMARY KEY);
    INSERT INTO tmp (postcodes) VALUES postcode_array;
    !
    SELECT * FROM tmp
    JOIN datapoints d
    ON d.postcode = tmp.postcodes
    WHERE
    d.dt IN dates_array;

    View Slide

  29. Take home messages
    1. Geospatial problems are hard and queries can be
    really slow;

    2. Not everybody has infinite resources: be smart
    and KISS!

    3. SQL or NoSQL? (Size, schema)

    View Slide

  30. GoDataDriven
    We’re hiring / Questions? / Thank you!
    @gglanzani

    [email protected]
    Giovanni Lanzani

    Data Whisperer

    View Slide