Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bitdeli at PyData 2013

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Bitdeli at PyData 2013

Avatar for Ville Tuulos

Ville Tuulos

April 11, 2013
Tweet

Other Decks in Technology

Transcript

  1. Everybody (Click & Play) Business Analysts (Excel) IT / DBAs

    (SQL, Python) Data Hackers (MapReduce) People who implement their own infrastructure
  2. Everybody (Click & Play) Business Analysts (Excel) IT / DBAs

    (SQL, Python) Data Hackers (MapReduce) People who implement their own infrastructure Disco
  3. Everybody (Click & Play) Business Analysts (Excel) IT / DBAs

    (SQL, Python) Data Hackers (MapReduce) People who implement their own infrastructure
  4. Python is great MapReduce is hard Servers are annoying (cloud

    or not) Everybody likes real-time Support healthy workflows
  5. what makes some users very active? Customer C Customer B

    how to reduce churn? Customer A why some users return? Daily Activity Daily Activity Daily Activity Users Users Users
  6. Simple Complex Discover Explore Infographics Basic Statistics Reports Query Segments

    Funnels Slice & Dice Descriptive Models Visualizations
  7. Simple Complex Discover Explore Infographics Basic Statistics Reports Query Segments

    Funnels Clustering Slice & Dice Descriptive Models Visualizations Predictive Models
  8. DiscoDB persistent, immutable, compressed, lightning fast, key-value(s) mapping that supports

    lazy boolean queries. Code https://github.com/discoproject/discodb Docs http://discoproject.org/doc/discodb/
  9. from discodb import DiscoDB FILES = [‘a.txt’, ‘b.txt’, ‘c.txt’] def

    extract_words(): for fname in FILES: for word in open(fname).read().split(): yield word, fname db = DiscoDB(extract_words()) db[‘dog’] db.keys() db.unique_values() db.items() # files that mention ‘dog’ # all distinct word # all distinct filenames # all (word, iter(fname)) pairs
  10. Hash Map: hash(Key) → Key ID Value Map: Key ID

    → [Value ID, ...] Keys: Key ID → Key Values: Value ID → Value DiscoDB Chunk
  11. Hash Map: hash(Key) → Key ID Value Map: Key ID

    → [Value ID, ...] Keys: Key ID → Key Values: Value ID → Value DiscoDB Chunk Perfect hashing by CMPH, guaranteed O(1) The list of Value IDs is delta-encoded Values are compressed with a global Huffman codebook
  12. DiscoDB Chunk Node 1 Node 2 Node N Disco Node

    Python Worker DDFS Disco Node Python Worker Disco Node Python Worker DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk
  13. A → [Apple, Orange, Banana] B → [Apple, Banana] C

    → [Banana, Melon] Q(“A & B”) Apple Banana Q(“A | B”) Apple Orange Banana Q(“(A & B) | C”) Banana DiscoDB from discodb.query import Q Querying with Conjunctive Normal Form
  14. Model: Event → Users Query (sequence of events): Q(“Event A

    & Event B & ...”) Funnel https://github.com/tuulos/bd3-mixpanel-funnel
  15. Model: Day N → Users Query (weekly cohorts): Q(“(dayN |

    dayN+1) & (dayM | dayM+1...)”) Cohort Analysis https://github.com/tuulos/bd3-mixpanel-cohort
  16. Model: Day N → Users Query (one time series): [Q(Day

    K) for K in range(start, end)] Time Series https://github.com/tuulos/bd3-mixpanel-trends