Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BI on Big Data: What Are Your Options?

Dremio
June 02, 2016

BI on Big Data: What Are Your Options?

There are (too?) many options for BI on Hadoop. Some are great at exploration, some are great at OLAP, some are fast, and some are flexible. Understanding the options and how they work with Hadoop systems is a key challenge for many organizations. Tomer Shiran provides a survey of the main options, both traditional (Tableau, Qlik, etc.) and new (Platfora, Datameer, etc.).

Tomer covers the key use cases for using BI with Hadoop and NoSQL systems and discusses the types of data as well as the scale of data and ingestion/mutation rates for each. Tomer then explains the main categories of BI on Hadoop including general-purpose BI (Tableau, Qlik, MicroStrategy, etc.), combined with interactive SQL-on-Hadoop (Drill, Impala, Spark SQL), and on-Hadoop BI (Platfora, Datameer, Arcadia Data, etc.), highlighting the strengths and weaknesses of each category as well as the main options within the category based on the desired use case. You’ll leave with a solid understanding of the Hadoop BI landscape and an approach for structuring your own system evaluation.

Dremio

June 02, 2016
Tweet

Other Decks in Technology

Transcript

  1. BI on Big Data Tomer Shiran - @tshiran Co-Founder &

    CEO, Dremio Strata + Hadoop World London, June 3, 2016 What are your options?
  2. 2 BI on Hadoop: What are your Options Dremio Company

    Background Jacques Nadeau Founder & CTO • Recognized SQL & NoSQL expert • Founder of Apache Arrow & Drill • Previously Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO • Previously MapR (VP Product & employee #5), MapR; Microsoft; IBM Research • Carnegie Mellon, Technion Julien Le Dem Architect • Founder of Apache Parquet • Apache Pig PMC Member • Previously Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs • Stealth data analytics startup • Founded in 2015 • Led by experts in Big Data and open source including the creators of Apache Arrow & Apache Parquet
  3. 3 BI on Hadoop: What are your Options Recent changes

    to the BI landscape Good ol’ Days • Only a few databases (e.g. Oracle, Teradata, SQL Server) • A few BI tools (MicroStrategy, Cognos) • Everything worked with everything • Things were easy! Modern Reality • Larger scale, less control and less structure • Lots of databases! • Data Lake, not database • HDFS: It’s a file system, folks! • NoSQL: Let’s put the schema in the application • It can feel like the wild west!
  4. 4 BI on Hadoop: What are your Options Major Approaches

    to BI on Big Data? ETL to RDBMS o “Make the new world look like the old world!” o Load a transformed set of data into relational database Monolithic (all-in-one) solutions o Use BI tools that connect directly to Big Data SQL-on-Big-Data o Connect BI tools to a query engine sitting on top of Big Data o Three main sub-categories • Native SQL • Batch SQL • OLAP Cubes
  5. 5 BI on Hadoop: What are your Options So how

    do we bring BI to Big Data? Big Data RDBMS BI options ETL tool ETL to Data Warehouse Big Data SQL Engine BI options SQL-on-Big-Data Big Data Monolithic tool with built-in BI Monolithic All-in-one Solutions
  6. 6 BI on Hadoop: What are your Options ETL to

    RDBMS: Introduction • ETL (Extract, Transform, and Load) a subset of the data into a relational database o Oracle, PostgreSQL, Teradata, Redshift, Vertica, … • Connect any desired BI tool to the RDBMS o Tableau, Qlik, … • Two options: o Commercial tools (Informatica, Talend, Pentaho,…) o Custom development, scripts, etc. Big Data RDBMS BI options ETL tool
  7. 7 BI on Hadoop: What are your Options ETL to

    RDBMS: Example • Load web server logs from HDFS into RDBMS • ETL software: Pentaho Data Integration (aka ‘Kettle’) • RDBMS: MySQL Connect ETL to RDBMS Add and Configure Input/Output Connect Input and Output Create and fill RDBMS table Connect BI tool To RDBMS 0 50 100 150 200 250 April May June July Source: http://wiki.pentaho.com/display/BAD/Extracting+Data+from+HDFS+to+Load+an+RDBMS
  8. 8 BI on Hadoop: What are your Options ETL to

    RDBMS: Pros and Cons Pros • Relational databases and their BI integrations are very mature • Use your favorite tools o Tableau, Excel, R, … Cons • Traditional ETL tools don’t work well with modern data o Changing schemas, complex or semi-structured data, … o Hand-coded scripts are a common substitute • Data freshness o How often do you replicate/synchronize? • Data resolution o Can’t store all the raw data in the RDBMS (due to scalability and/or cost) o Need to sample, aggregate or time-constrain the data …and really, who wants to ETL?
  9. 9 BI on Hadoop: What are your Options Monolithic (or

    All-in-One) Solutions: Introduction • Single piece of software on top of Big Data • Performs both data visualization (BI) and execution • Utilize sampling or manual pre-aggregation to reduce the data volume that the user is interacting with • Examples: o Datameer o Platfora o Zoomdata Big Data Monolithic system with built-in BI Monolithic Solutions
  10. 10 BI on Hadoop: What are your Options Platfora Architecture

    Overview • Constructs aggregates that are loaded into an external database o Aggregates provide fast visualizations o Aggregations must be created before consumption MapReduce/Spark HDFS Hadoop Cluster Hadoop Proprietary DB Aggregates Platfora Cluster
  11. 11 BI on Hadoop: What are your Options Hadoop Cluster

    Datameer Nodes Datameer Architecture Overview • Users interact with samples of the data in an Excel-like interface • Finished designs use the whole dataset • Query router determines execution engine based on data size Single Node Custom Execution Tez MapReduce Query Router Sampling Hadoop HDFS
  12. 12 BI on Hadoop: What are your Options Zoomdata Architecture

    Overview • Queries on historical (ie, non-streaming) data are split into many sampling queries • This sampling provides a view of the data that converges toward an accurate picture o But adds load on the data source… • Can handle streaming data sources Stream Processing Engine Spark-based cache HDFS / MongoDB Zoomdata Server Incremental Sampling Streaming Data Source Multiple Data Clusters
  13. 13 BI on Hadoop: What are your Options Monolithic Solutions:

    Pros and Cons Pros • Only one tool to learn and operate • Easier than building and maintain ETL-to-RDBMS pipeline • Integrated data preparation in some solutions Cons • Can’t analyze the raw data o Rely on aggregation or sampling before primary analysis • Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …) • Can’t run arbitrary SQL queries
  14. 14 BI on Hadoop: What are your Options SQL-on-Big-Data: Introduction

    • SQL queries against Big Data o Hadoop o NoSQL • MongoDB, HBase, ... o Cloud Storage • S3, Azure Data Lake, GCS, … • Use your existing BI tools o Leverage standard ODBC/JDBC drivers Tableau, Qlik, R, … SQL Engine Hadoop & NoSQL
  15. 15 BI on Hadoop: What are your Options SQL-on-Big-Data: Introduction

    Three major design philosophies: • Native SQL • Batch & Data Science SQL • OLAP Cubes on Hadoop
  16. 16 BI on Hadoop: What are your Options Native SQL

    • Apache Drill o Queries Hadoop, RDBMS, Files, NoSQL, Cloud (S3) o Based on Apache Arrow o Columnar in-memory execution • Apache Impala (incubating) o Utilizes the Hive metastore o Focused on data in HDFS • Presto o Queries Hadoop, RDBMS, Files, NoSQL, Cloud (S3)
  17. 17 BI on Hadoop: What are your Options Native SQL:

    Pros and Cons Pros • Highest performance for Big Data workloads • Connect to Hadoop and also NoSQL systems • Make Hadoop “look like a database” Cons • Queries may still be too slow for interactive analysis on many TB/PB • Can’t defeat physics
  18. 18 BI on Hadoop: What are your Options Batch &

    Data Science SQL • Hive o Enables SQL queries to be translated to MapReduce/Tez o Most commonly used for batch processing and ETL workloads • Spark SQL o Provides a way to deliver SQL queries in Spark programs (Scala/Java/Python) o Excellent interleaving with data science work
  19. 19 BI on Hadoop: What are your Options Batch &

    Data Science SQL: Pros and Cons Pros o Potentially simpler deployment (no daemons) • New YARN job (MapReduce/Spark) for each query o Check-pointing support enables very long-running queries • Days to weeks (ETL work) o Works well in tandem with machine learning (Spark) Cons o Latency prohibitive for for interactive analytics • Tableau, Qlik Sense, … o Slower than native SQL engines
  20. 20 BI on Hadoop: What are your Options OLAP Cubes

    on Hadoop • Kylin o Hadoop-only o Stores OLAP cubes in HBase o Queries fail if not satisfied by cubes o Open source • AtScale o Hadoop-only o Leverages external SQL engine • Hive, Impala, SparkSQL o Collaborative cube creation o Closed source
  21. 21 BI on Hadoop: What are your Options OLAP Cubes

    on Hadoop: Pros and Cons Pros o Fast queries on pre-aggregated data o Can use SQL and MDX tools Cons o Explicit cube definition/modeling phase • Not “self-service” • Frequent updates required due to dependency on business logic o Aggregation create and maintenance can be long (and large) o User connects to and interacts with the cube • Can’t interact with the raw data
  22. 22 BI on Hadoop: What are your Options SQL-on-Big-Data: Solution

    Comparison Native SQL Batch & DS SQL OLAP Cubes Technologies Drill, Impala, Presto Hive, Spark SQL Kylin, AtScale Connectivity SQL and NoSQL SQL and NoSQL Hadoop-only Primary Use Case Interactive ETL or data-science focused Constrained Interactive Query Capability Raw data Raw data Aggregated data Deployment Model New daemons collocated with existing services New MapReduce and/or Spark job for each query Varies
  23. 23 BI on Hadoop: What are your Options SQL-on-Big-Data: General

    Pros and Cons Pros • Continue using your favorite BI tools and SQL-based clients o Tableau, Qlik, Power BI, Excel, R, SAS, … • Technical analysts can write custom SQL queries Cons • Another layer in your data stack • May need to pre-aggregate the data depending on your scale • Need a separate data preparation tool (or custom scripts)
  24. 25 BI on Hadoop: What are your Options ETL to

    RDBMS BI on Big data: Heuristic Do you already have a favorite BI Tool No Is External Cluster Okay? Does your schema change frequently? No Yes Yes Platfora Zoomdata No Yes Do you want to be able to write SQL No Datameer No Do you like Excel Metaphor? Yes Monolithic/All-in-one Solutions No Is your working data relatively small & static? Yes Yes Yes Do you have very predictable analysis needs? OLAP Cubes on Hadoop No Are you focused on interactive BI? No Do you need to query NoSQL? No Hive Native SQL Do you want to combine ML with SQL? No Yes SparkSQL
  25. 26 BI on Hadoop: What are your Options Q&A Tomer

    Shiran [email protected] @tshiran Reach out to learn what we’re up to at Dremio (or to join the private beta…)