Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Ivory - An Introduction

Ambiata
July 11, 2014

Ivory - An Introduction

An introduction to the Ivory feature store.

Ambiata

July 11, 2014
Tweet

More Decks by Ambiata

Other Decks in Technology

Transcript

  1. Ivory
    http://github.com/ambiata/ivory
    © Ambiata 2014

    View Slide

  2. Ivory
    A data store for features
    © Ambiata 2014

    View Slide

  3. Data set A
    Data set B
    Data set C
    Data set D
    Data set E
    Feature Eng 1
    Feature Eng 2
    Feature Eng 3
    Train Model 1
    Score
    Train Model 2
    Score
    Train Model 3
    Score
    Modelling in silos - models built and deployed in isolation
    Feature engineering is in a silo -
    no reuse between model builds

    View Slide

  4. Data set A
    Data set B
    Data set C
    Data set D
    Data set E
    Feature Eng A
    Train Model 1
    Score
    Train Model 2
    Score
    Train Model 3
    Score
    Modelling in silos - models built and deployed in isolation
    Feature Store
    !
    (Ivory)
    Feature Eng B
    Feature Eng C
    Feature Eng D
    Feature Eng E
    Feature engineering
    is done once
    Features are reused
    across model builds

    View Slide

  5. Motivations
    • Features for model training + scoring
    • Scalability - lots of features
    • Extensible features sets
    • Feature sparseness
    • Feature integrity in a messy “big data” world
    • Recovering from operational errors
    © Ambiata 2014

    View Slide

  6. Key ideas
    • A sparse fact data model
    • Specific entity views - snapshots + chords
    • No moving parts - just files + tools
    • Immutability built-in
    © Ambiata 2014

    View Slide

  7. “Fact” data model
    © Ambiata 2014

    View Slide

  8. customer-1
    sav.balance
    634 @ 2014-02-01
    single “fact”
    Fact: Entity - Attribute - Value - Time
    !
    The value of a feature (attribute) for a given
    entity known to be valid from a certain
    point in time.
    © Ambiata 2014

    View Slide

  9. customer-1
    sav.balance
    634 @ 2014-02-01
    scalable
    customer-2
    customer-3
    customer-4
    469 @ 2014-02-01
    276 @ 2014-04-01
    1966 @ 2014-03-01
    © Ambiata 2014

    View Slide

  10. customer-2
    customer-3
    customer-4
    customer-1
    gender sav.balance cc.purchases postcode
    634 @ 2014-02-01
    extensible
    469 @ 2014-02-01
    276 @ 2014-04-01
    1966 @ 2014-03-01
    ‘M’ @ 2012-01-01 3 @ 2014-03-27 ‘4670’ @ 2009-05-13
    © Ambiata 2014

    View Slide

  11. 736 @ 2014-01-01
    3 @ 2014-02-19
    184 @ 2014-02-01
    312 @ 2014-03-01
    customer-1
    customer-2
    customer-3
    customer-4
    gender sav.balance cc.purchases postcode
    ‘M’ @ 2012-01-01 276 @ 2014-04-01
    4 @ 2014-04-04
    2 @ 2014-03-12
    3 @ 2014-03-27
    ‘2381’ @ 2004-08-19
    ‘4670’ @ 2009-05-13
    ‘F’ @ 2007-04-01
    ‘3001’ @ 2011-09-14
    1876 @ 2014-02-01
    1966 @ 2014-03-01
    634 @ 2014-02-01
    Sparse
    469 @ 2014-02-01
    © Ambiata 2014

    View Slide

  12. Entity views
    © Ambiata 2014

    View Slide

  13. Snapshot views
    • Attribute values for entities at a point in
    time
    • Same time for all entities
    • Select latest attribute values with respect
    to time
    • Used in preparing instances for scoring
    © Ambiata 2014

    View Slide

  14. 736 @ 2014-01-01
    6 @ 2014-02-19
    184 @ 2014-02-01
    312 @ 2014-03-01
    customer-1
    customer-2
    customer-3
    customer-4
    gender sav.balance cc.purchases postcode
    ‘M’ @ 2012-01-01 276 @ 2014-04-01
    4 @ 2014-02-04
    2 @ 2014-03-12
    3 @ 2014-03-27
    ‘2381’ @ 2004-08-19
    ‘4670’ @ 2009-05-13
    ‘F’ @ 2007-04-01
    ‘3001’ @ 2011-09-14
    1876 @ 2014-02-01
    1966 @ 2014-03-01
    634 @ 2014-02-01
    snapshot @ 2014-03-01
    469 @ 2014-02-01
    © Ambiata 2014

    View Slide

  15. customer-1
    customer-2
    customer-3
    customer-4
    gender sav.balance cc.purchases postcode
    ‘M’ 312 4 ‘4670’
    ‘F’
    ‘3001’
    1966
    634
    469
    © Ambiata 2014

    View Slide

  16. Chord views
    • Attribute values for entities at a point in
    time
    • Different times for different entities
    • Select latest attribute values with respect
    to time
    • Used in preparing instances for training
    © Ambiata 2014

    View Slide

  17. 736 @ 2014-01-01
    6 @ 2014-02-19
    184 @ 2014-02-01
    312 @ 2014-03-01
    customer-1
    customer-2
    customer-3
    customer-4
    gender sav.balance cc.purchases postcode
    ‘M’ @ 2012-01-01 276 @ 2014-04-01
    4 @ 2014-04-04
    2 @ 2014-03-12
    3 @ 2014-03-27
    ‘2381’ @ 2004-08-19
    ‘4670’ @ 2009-05-13
    ‘F’ @ 2007-04-01
    ‘3001’ @ 2011-09-14
    1876 @ 2014-02-01
    1966 @ 2014-03-01
    634 @ 2014-02-01
    469 @ 2014-02-01
    customer2 @ 2014-03-01!
    customer4 @ 2014-01-01
    © Ambiata 2014

    View Slide

  18. customer-2
    @ 2014-03-01
    customer-4
    @ 2014-01-01
    gender sav.balance cc.purchases postcode
    ‘M’ 312 6 ‘4670’
    ‘3001’
    1876
    © Ambiata 2014

    View Slide

  19. Ivory concepts
    © Ambiata 2014

    View Slide

  20. Repository
    • A single class of “entity” (e.g. customer)
    • Single on-disk hierarchy (e.g. HDFS)
    • Stores facts
    • Stores dictionary of fact attributes
    • Tools to get data in and out (e.g. MapReduce)
    • Versioned
    © Ambiata 2014

    View Slide

  21. - - -
    hdfs:///prod/ivory/customers
    single!
    on-disk!
    hierarchy
    © Ambiata 2014

    View Slide

  22. Dictionary
    • Description and metadata of all attributes
    that can be ascribed to facts in the
    repository
    • Versioned
    © Ambiata 2014

    View Slide

  23. namespace name encoding type description
    demographics gender string categorical Gender
    demographics postcode string categorical Post-code, zip-code
    accounts sav.balance double numerical Balance of savings account
    accounts cc.purchases int numerical Number of credit-card purchases
    © Ambiata 2014

    View Slide

  24. - - -
    hdfs:///prod/ivory/customers
    dict
    import!
    dictionary
    © Ambiata 2014

    View Slide

  25. 0 0 -
    hdfs:///prod/ivory/customers
    dict[0]
    dictionary!
    version
    repository!
    version
    © Ambiata 2014

    View Slide

  26. 0 0 -
    hdfs:///prod/ivory/customers
    dict[0]
    dict
    update!
    dictionary
    © Ambiata 2014

    View Slide

  27. 1 1 -
    hdfs:///prod/ivory/customers
    dict[1] dict[0]
    latest
    dictionary
    version
    © Ambiata 2014

    View Slide

  28. Factsets
    • A collection of facts
    • Treated as a single atomic unit
    • The unit of inclusion (ingestion)
    • The unit of exclusion
    © Ambiata 2014

    View Slide

  29. customer-1 accounts:sav.balance 634 2014-02-01
    customer-2 accounts:sav.balance 184 2014-02-01
    cusomter-2 accounts:sav.balance 312 2014-03-01
    customer-2 accounts:cc.puchases 4 2014-02-04
    customer-3 demographics:gender F 2007-04-01
    customer-4 demographics:postcode 3001 2011-03-14
    © Ambiata 2014

    View Slide

  30. 1 1 -
    hdfs:///prod/ivory/customers
    dict[1]
    factset
    dict[0]
    ingest!
    factset
    © Ambiata 2014

    View Slide

  31. 1 1 -
    hdfs:///prod/ivory/customers
    dict[1]
    factset-a
    dict[0]
    © Ambiata 2014

    View Slide

  32. Stores
    • A prioritised sequence of fact-sets
    • Versioned
    © Ambiata 2014

    View Slide

  33. 2 1 0
    hdfs:///prod/ivory/customers
    dict[1]
    store[0]
    { a }
    factset-a
    dict[0]
    store version
    store
    references
    factset-a
    © Ambiata 2014

    View Slide

  34. 2 1 0
    hdfs:///prod/ivory/customers
    dict[1]
    store[0]
    { a }
    factset-a
    dict[0]
    factset
    ingest!
    factset
    © Ambiata 2014

    View Slide

  35. 3 1 1
    hdfs:///prod/ivory/customers
    dict[1]
    store[1]
    { b, a }
    factset-b
    dict[0]
    factset-a
    store[0]
    { a }
    © Ambiata 2014

    View Slide

  36. 3 1 1
    hdfs:///prod/ivory/customers
    dict[1]
    store[1]
    { b, a }
    factset-b
    dict[0]
    factset-a
    store[0]
    { a }
    factset
    ingest!
    factset
    © Ambiata 2014

    View Slide

  37. 4 1 2
    hdfs:///prod/ivory/customers
    dict[1]
    store[2]
    { c, b, a }
    factset-c
    dict[0]
    factset-b
    store[1]
    { b, a }
    factset-a
    store[0]
    { a }
    © Ambiata 2014

    View Slide

  38. 4 1 2
    hdfs:///prod/ivory/customers
    dict[1]
    store[2]
    { c, b, a }
    dict[0]
    factset-b
    store[1]
    { b, a }
    factset-a
    store[0]
    { a }
    remove!
    factset-b
    © Ambiata 2014
    factset-c

    View Slide

  39. 5 1 3
    hdfs:///prod/ivory/customers
    dict[1]
    store[3]
    { c, a }
    factset-b factset-a
    store[2]
    { c, b, a }
    store[1]
    { b, a }
    store[0]
    { a }
    dict[0]
    factset-b
    removed
    © Ambiata 2014
    factset-c

    View Slide

  40. Internals

    View Slide

  41. • On-disk hierarchy
    • On-disk EAVT representation incl.
    compression
    • Internal snapshots

    View Slide

  42. User experience
    • Consistent command line tooling
    • Version metadata
    • Workflows
    • End-to-end example system

    View Slide

  43. Extensible dictionary
    • Support for rich attribute types (e.g. structs, arrays)
    • Arbitrary attribute metadata
    • Specification of valid attribute values
    • e.g. ‘M’ and ‘F’ only for gender
    • Improved validation
    • Improved on-disk representation
    • Useful for downstream applications, e.g. plots

    View Slide

  44. Lazy feature generation
    • Lazily generate features derived from
    existing facts on extract (chord/snapshot)
    • Derived “meta” features (i.e. ‘select’)
    • Windowing functions (e.g. “average over
    last 3 months”)
    • Row-level features

    View Slide

  45. Data set A
    Data set B
    Data set C
    Data set D
    Data set E
    Train Model 1
    Score
    Train Model 2
    Score
    Train Model 3
    Score
    Modelling in silos - models built and deployed in isolation
    Feature Store
    !
    (Ivory)
    Feature engineering
    is integrated and lazily
    generated on extraction
    Source data loaded
    directly
    Feature
    Eng
    !
    (Ivory)

    View Slide

  46. Repository forking
    • Low-cost cloning/forking of repositories:
    • “master” production repo
    • “experimental” cloned repo
    • Allow a data scientist to join production
    features with their own without affecting
    production operations

    View Slide

  47. Other filesystems
    • Support for repository metadata and fact
    sets to be on different file systems
    • Support HDFS, S3 and POSIX

    View Slide