Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Bitdeli at PyData 2013

Bitdeli at PyData 2013

Avatar for Ville Tuulos

Ville Tuulos

April 11, 2013
Tweet

Other Decks in Technology

Transcript

  1. Everybody (Click & Play) Business Analysts (Excel) IT / DBAs

    (SQL, Python) Data Hackers (MapReduce) People who implement their own infrastructure
  2. Everybody (Click & Play) Business Analysts (Excel) IT / DBAs

    (SQL, Python) Data Hackers (MapReduce) People who implement their own infrastructure Disco
  3. Everybody (Click & Play) Business Analysts (Excel) IT / DBAs

    (SQL, Python) Data Hackers (MapReduce) People who implement their own infrastructure
  4. Python is great MapReduce is hard Servers are annoying (cloud

    or not) Everybody likes real-time Support healthy workflows
  5. what makes some users very active? Customer C Customer B

    how to reduce churn? Customer A why some users return? Daily Activity Daily Activity Daily Activity Users Users Users
  6. Simple Complex Discover Explore Infographics Basic Statistics Reports Query Segments

    Funnels Slice & Dice Descriptive Models Visualizations
  7. Simple Complex Discover Explore Infographics Basic Statistics Reports Query Segments

    Funnels Clustering Slice & Dice Descriptive Models Visualizations Predictive Models
  8. DiscoDB persistent, immutable, compressed, lightning fast, key-value(s) mapping that supports

    lazy boolean queries. Code https://github.com/discoproject/discodb Docs http://discoproject.org/doc/discodb/
  9. from discodb import DiscoDB FILES = [‘a.txt’, ‘b.txt’, ‘c.txt’] def

    extract_words(): for fname in FILES: for word in open(fname).read().split(): yield word, fname db = DiscoDB(extract_words()) db[‘dog’] db.keys() db.unique_values() db.items() # files that mention ‘dog’ # all distinct word # all distinct filenames # all (word, iter(fname)) pairs
  10. Hash Map: hash(Key) → Key ID Value Map: Key ID

    → [Value ID, ...] Keys: Key ID → Key Values: Value ID → Value DiscoDB Chunk
  11. Hash Map: hash(Key) → Key ID Value Map: Key ID

    → [Value ID, ...] Keys: Key ID → Key Values: Value ID → Value DiscoDB Chunk Perfect hashing by CMPH, guaranteed O(1) The list of Value IDs is delta-encoded Values are compressed with a global Huffman codebook
  12. DiscoDB Chunk Node 1 Node 2 Node N Disco Node

    Python Worker DDFS Disco Node Python Worker Disco Node Python Worker DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk DiscoDB Chunk
  13. A → [Apple, Orange, Banana] B → [Apple, Banana] C

    → [Banana, Melon] Q(“A & B”) Apple Banana Q(“A | B”) Apple Orange Banana Q(“(A & B) | C”) Banana DiscoDB from discodb.query import Q Querying with Conjunctive Normal Form
  14. Model: Event → Users Query (sequence of events): Q(“Event A

    & Event B & ...”) Funnel https://github.com/tuulos/bd3-mixpanel-funnel
  15. Model: Day N → Users Query (weekly cohorts): Q(“(dayN |

    dayN+1) & (dayM | dayM+1...)”) Cohort Analysis https://github.com/tuulos/bd3-mixpanel-cohort
  16. Model: Day N → Users Query (one time series): [Q(Day

    K) for K in range(start, end)] Time Series https://github.com/tuulos/bd3-mixpanel-trends