Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2016 - Maxime Beauchemin - Caravel - A data visualization, exploration and dashboarding platform

PyBay
August 21, 2016

2016 - Maxime Beauchemin - Caravel - A data visualization, exploration and dashboarding platform

Description
Airbnb developed Caravel to provide all employees with interactive access to data while minimizing friction. Caravel's main goal is to make it easy to slice, dice and visualize data. It empowers each and everyone to perform analytics at the speed of thought.

Abstract
Topics include:
* Intuitively visualizing datasets while filtering, pivoting, and changing views
* Creating and sharing simple dashboards
* Caravel's rich set of visualizations
* Caravel's extensible, high-granularity security/permission model allowing intricate rules on who can access individual features and the dataset
* Caravel's enterprise-ready authentication with integration with major authentication providers (database, OpenID, LDAP, OAuth, and REMOTE_USER through Flask AppBuilder)
* Caravel's simple semantic layer, allowing users to control how data sources are displayed in the UI by defining which fields should show up in which drop-down and which aggregation and function metrics are made available to the user
* Caravel’s deep integration with Druid
* Caravel’s integration with most RDBMS through SQLAlchemy
* How Javascript/Node/D3/React can cohabit and work well along with Python/Pypi/Flask

Bio
Maxime Beauchemin works at Airbnb as part of the Data Tools team, developing open source products that reduce friction that help generating insight from data. He is the creator and a leading maintainer of Apache Airflow [incubating] (a workflow engine) and Caravel (a data visualization platform). Before Airbnb, Maxime worked at Facebook on computation frameworks around engagement and growth analytics, at Yahoo! on social properties analytics, and at Ubisoft as a data warehouse architect.

https://youtu.be/Wt1xH41gXhs

PyBay

August 21, 2016
Tweet

More Decks by PyBay

Other Decks in Programming

Transcript

  1. * Data is too strategic to depend on vendors! *

    Tableau doesn’t support Presto & Druid * Tableau extracts don’t scale well * Buying means lock-in and increasing costs * We need deep integration with our stack * We’re builders not buyers!
  2. Flask App Builder * Authentication / permission / role management

    * CRUD! * Babel (translation framework) * Bootstrap / dynamic navbar / font-awesome
  3. Flask [App Builder] vs Django Pros * Airbnb was already

    using Flask / SqlAlchemy * FAB’s CRUD doesn’t require an Admin, all models ship with a `show` permission * FAB has a lean codebase, it’s easy to contribute to it Cons * Few guarantees on quality / security / support * A lot less features than Django has to grow into * Not much of a community (yet)
  4. Pandas * Crafting the right JSON for the visualization: *

    pivot_table * groupby * multi-sorts * … * Time series transforms: * rolling functions * resampling * period shifts * period ratios
  5. Package and distribute * setup tools * nose / coverage

    (tests) * alembic (db migrations) * requires.io (dependency tracking) * coveralls.io (coverage reporting) * landscape.io (code quality reporting) * sphinx (documentation) * pypi (`pip install caravel` )
  6. The Frontend Stack * Javascript frontend * npm / ES6

    / webpack / React * d3.js! * nvd3.org
  7. Security * Provided by Flask AppBuilder (python web framework) *

    Easily integrate with: OpenID, LDAP, REMOTE_USER, OAUTH, or use the builtin database * Ships with 3 roles: * Admin (all access) * Alpha (all access but cannot alter permissions) * Gamma (per-datasource / table access) * Fine grain controls to create new roles
  8. A thin Semantic Layer * Verbose names and long descriptions

    for columns and metrics * Add calculated fields and metrics as SQL expression * Set how individual columns are exposed
  9. Event Logs MySQL Dumps Gold Hive Cluster HDFS Spark Cluster

    Airpal Airflow Scheduling Presto Cluster Silver Hive Cluster HDFS Replication Kafka Sqoop Tableau S3 Caravel ! Druid
  10. Caching! * Provided by flask-cache * Backends: memcache, redis, filesystem,

    memory, … * cascading timeout configuration * UI is upfront about staleness * allows to force-refresh
  11. * Grow a community! * Ship “SQL Lab” * Ship

    visualizations & controls as React.js components * Reactify the whole app * DSL for the semantic layer * UX -> smoothen common flows What’s next?
  12. Q?