$30 off During Our Annual Pro Sale. View Details »

Treading Water in a Stream of Data

Treading Water in a Stream of Data

A talk given at Big Ruby 2013 on some fundamental concepts of data acquisition.

Jeremy Hinegardner

March 01, 2013
Tweet

More Decks by Jeremy Hinegardner

Other Decks in Technology

Transcript

  1. Treading Water In a Stream
    of Data
    Jeremy Hinegardner
    @copiousfreetime
    [email protected]
    Monday, March 4, 13

    View Slide

  2. Data Junkie
    Monday, March 4, 13

    View Slide

  3. Survey
    Monday, March 4, 13

    View Slide

  4. Streaming?
    Monday, March 4, 13

    View Slide

  5. Wikipedia Says ...
    "A SEQUENCE OF DATA ELEMENTS MADE
    AVAILABLE OVER TIME.
    ... ALLOWS ITEMS TO BE
    PROCESSED ONE AT A TIME RATHER THAN IN
    LARGE BATCHES."
    Monday, March 4, 13

    View Slide

  6. Big Data == Streaming?
    Monday, March 4, 13

    View Slide

  7. Big Data
    Monday, March 4, 13

    View Slide

  8. Wikipedia Says ...
    A COLLECTION OF DATA SETS
    SO LARGE AND COMPLEX
    THAT IT BECOMES DIFFICULT TO PROCESS
    USING ON-HAND DATABASE
    MANAGEMENT TOOLS OR
    TRADITIONAL DATA PROCESSING
    APPLICATIONS.
    Monday, March 4, 13

    View Slide

  9. A LOT of Data
    Monday, March 4, 13

    View Slide

  10. Heading towards you
    FAST!
    Monday, March 4, 13

    View Slide

  11. All of it needs to be
    processed
    Monday, March 4, 13

    View Slide

  12. Keep it around forever
    Monday, March 4, 13

    View Slide

  13. Copious’s Definition
    AN AMOUNT OF DATA
    AND THE PROCESSING OF IT
    THAT MAKES YOU FEEL UNCOMFORTABLE.
    Monday, March 4, 13

    View Slide

  14. Wikipedia Says ...
    A COLLECTION OF DATA SETS
    SO LARGE AND COMPLEX
    THAT IT BECOMES DIFFICULT TO PROCESS
    USING ON-HAND DATABASE
    MANAGEMENT TOOLS OR
    TRADITIONAL DATA PROCESSING
    APPLICATIONS.
    Monday, March 4, 13

    View Slide

  15. This
    Monday, March 4, 13

    View Slide

  16. Here
    This
    Monday, March 4, 13

    View Slide

  17. Here
    This
    That
    Monday, March 4, 13

    View Slide

  18. Here
    There
    This
    That
    Monday, March 4, 13

    View Slide

  19. Here
    There
    This
    That
    Other
    +

    Monday, March 4, 13

    View Slide

  20. Here
    There
    This
    That
    Every
    Where
    Other
    +

    Monday, March 4, 13

    View Slide

  21. Here
    There
    This
    That
    Every
    Where
    Other
    +
    ⬇ $
    Monday, March 4, 13

    View Slide

  22. First things First
    Monday, March 4, 13

    View Slide

  23. Get This Data
    Monday, March 4, 13

    View Slide

  24. Or ... Getting the
    “Sequence of Data
    Elements”
    Monday, March 4, 13

    View Slide

  25. Polling
    Monday, March 4, 13

    View Slide

  26. Notification / Web Hook
    Monday, March 4, 13

    View Slide

  27. Payload
    Monday, March 4, 13

    View Slide

  28. Push
    Monday, March 4, 13

    View Slide

  29. Poll Notify
    Payload
    Push
    VS.
    Monday, March 4, 13

    View Slide

  30. My Ideal
    Monday, March 4, 13

    View Slide

  31. GitHub Events
    Monday, March 4, 13

    View Slide

  32. GitHub Archive
    Monday, March 4, 13

    View Slide

  33. Store This Data
    Monday, March 4, 13

    View Slide

  34. Pre-Storage
    Processing?
    Monday, March 4, 13

    View Slide

  35. Physical Location
    Monday, March 4, 13

    View Slide

  36. Hadoop
    Monday, March 4, 13

    View Slide

  37. Avro
    Monday, March 4, 13

    View Slide

  38. Why all this trouble?
    Monday, March 4, 13

    View Slide

  39. Fundamental Truth
    Monday, March 4, 13

    View Slide

  40. Future Discovery
    Monday, March 4, 13

    View Slide

  41. Paranoia
    Monday, March 4, 13

    View Slide

  42. https://github.com/copiousfreetime/ghent
    Monday, March 4, 13

    View Slide

  43. Thanks!
    Jeremy Hinegardner
    @copiousfreetime
    [email protected]
    Monday, March 4, 13

    View Slide

  44. What is Old is New
    Again
    Bonus Track!!
    Monday, March 4, 13

    View Slide

  45. Monday, March 4, 13

    View Slide

  46. 'NEARLY EVERY LARGE DATASET HAS
    UNANTICIPATED VALUE WITHIN IT.'
    'ULTIMATELY YOU
    CAN'T DISCOVER INTERESTING
    THINGS WITH YOUR DATA UNLESS
    YOU CAN ASK ARBITRARY
    QUESTIONS OF IT'
    - BIG DATA,
    NATHAN MARZ (2013)
    'THE GRANULAR DATA FOUND IN THE
    DATA WAREHOUSE IS THE KEY TO
    REUSABILITY, BECAUSE
    IT CAN BE USED BY MANY PEOPLE IN
    DIFFERENT WAYS'
    'BUT PERHAPS THE LARGEST
    BENEFIT OF A DATA WAREHOUSE
    FOUNDATION IS THAT FUTURE
    UNKNOWN REQUIREMENTS CAN BE
    ACCOMMODATED'
    - BUILDING THE DATA
    WAREHOUSE, 3RD EDITION,
    W. H. INMON (2002)
    Monday, March 4, 13

    View Slide

  47. Real Time ETL
    “... REFERS TO SOFTWARE THAT MOVES
    DATA ASYNCHRONOUSLY INTO A DATA
    WAREHOUSE WITH
    SOME URGENCY -- WITHIN MINUTES OF THE
    EXECUTION OF THE BUSINESS
    TRANSACTION”
    - THE DATA WAREHOUSE ETL TOOLKIT
    RALPH KIMBALL (2004)
    Monday, March 4, 13

    View Slide

  48. Big Data
    ALL NEW
    ETL and
    Data Warehousing
    Monday, March 4, 13

    View Slide