Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Women in Big Data Meetup:An introduction to Presto, an open source distributed SQL engine

Ahana
August 27, 2020

Women in Big Data Meetup:An introduction to Presto, an open source distributed SQL engine

Presto is a widely adopted distributed SQL engine for federated querying across multiple data sources. With Presto, you can perform ad hoc querying of data in place, which helps solve challenges around time to discover and the amount of time it takes to do ad hoc analysis.

In this session, Dipti will introduce the Presto technology and share why it’s becoming so popular - in fact, companies like Facebook, Uber, Twitter, Alibaba and many more use Presto for interactive ad hoc queries, reporting & dashboarding, data lake analytics, and much more including job opportunities.

She’ll also share her career journey and how she arrived in her current role as co-founder and Chief Product Officer at Ahana, a Presto company. Dipti started her career at IBM as a software engineer and Development Manager for the DB2 server team, and since then has worked in deep database technologies including Couchbase (holding leadership positions in solutions engineering, product, and product marketing), Kinetica (VP Product Marketing), and Alluxio (VP of Product).

Dipti Borkar is the Co-Founder & Chief Product Officer at Ahana, the Presto company. She has over 15 years experience in data and database technology across relational and non-relational. Prior to Ahana, Dipti was VP of Product & Marketing at Alluxio, and VP Product Marketing at Kinetica and Couchbase. At Couchbase she held several leadership positions including Head of Global Technical Sales and Head of Product Management. Earlier in her career Dipti managed development teams at IBM DB2 where she started her career as a database software engineer. Dipti holds a M.S. in Computer Science from UC San Diego and an MBA from the Haas School of Business at UC Berkeley.

Ahana

August 27, 2020
Tweet

More Decks by Ahana

Other Decks in Technology

Transcript

  1. Dipti Borkar Co-Founder & CPO | Ahana An introduction to

    Presto, an open source distributed SQL engine
  2. Founder Mom Immigrant Girl data geek (DB) Engineer always Product

    techie Team builder Open source believer Mixologist
  3. 3 Agenda • What is Presto? • History of federation

    • Introduction to Presto • What made Presto different? • Scalable architecture • Flexible Connectors • Performance • The life of a query
  4. 4 Technology Cycles Rhyme: Data Federation FDBMS Challenges RDBMS FDBMS

    Paper by McCleod / Heimbigner (1985) FDBMS Paper by Sheth / Larson (1990) OLTP to DW Wins Data Warehouse becomes the source of truth Star schema becomes sacred Cloud & Big Data Composite Software (founded 2001) Garlic Paper by Laura Haas (2002) à DB2 Federated Google File System Paper (2003) MapReduce paper (2006) Spark Paper (2010) Too many Data Sources, No one uber schema New Cloud DW w/ Data Lakes Based on SQL Self Service Platforms which enable Self-Service Analytics SQL Federation Makes Comeback Dremel Paper (2010) à Drill paper (2012) SQL ++ paper (2014) à Couchbase SQL++ engine (2018) Presto paper (2019), PartiQL (2019) 80’s 90’s 2000’s 2010’s 2020’s
  5. 5 Presto: One of the Fastest Growing Open Source Projects

    in Data Analytics Business Needs Data-driven decision making Businesses need more data to iterate over Technology Trends Disaggregation of Storage and Compute The rise of data lakes
  6. 6 What is Presto? • Distributed SQL query engine •

    ANSI SQL on Databases, Data lakes • Designed to be interactive • Access to petabytes of data • Opensource, hosted on github • https://github.com/prestodb
  7. 8 Common Questions? • Is presto a database? • How

    is it related to Hadoop? • How is it different from a data warehouse?
  8. 9 Sample Presto deployment stack & use cases • Ad

    hoc • BI tools • Dashboard • A/B testing • ETL/scheduled job • Online service
  9. 11 Scalable Architecture • Two roles - coordinator and worker

    • Easy scale up and scale down • Scale up to 1000 workers • Validated at web scaled companies
  10. 14 Presto Connector Data Model • Connector: Driver for a

    data source. • Example: HDFS, AWS S3, Cassandra, MySQL, SQL Server, Kafka • Catalog: Contains schemas from a data source specified by the connector • Schemas: Namespace to organize tables. • Tables: Set of unordered rows organized into columns with types.
  11. 17 Presto Hive Connector – Data File Types • Supported

    File Types • ORC • Parquet • Avro • RCFile • SequenceFile • JSON • Text • No data ingestion needed
  12. 19 Why Presto is Fast • In-Memory processing • Pull

    model • Columnar storage and execution
  13. 21 The Life of a Query – Join and Aggregation

    SELECT orders.orderkey, SUM(tax) FROM orders LEFT JOIN lineitem ON orders.orderkey = lineitem.orderkey WHERE discount = 0 GROUP BY orders.orderkey This example is from Presto: SQL on Everything https://research.fb.com/publications/ presto-sql-on-everything/
  14. 25 Ahana • SQL analytics company based on Presto •

    Team of experts in cloud, database, and Presto • Investment from Google Ventures • Named CRN Top 10 Big Data Startup of 2020 • Premier member of “[Ahana founders] have been strong supporters of the Presto Foundation since its launch in September 2019” “We are excited to welcome Ahana, as the first and only company focused on supporting Presto of the Presto Foundation”
  15. 27 Join the Presto Community • Require new feature or

    file a bug: github.com/prestodb/presto • Slack: prestodb.slack.com • Twitter: @prestodb Stay Up-to-Date with Ahana • URL: ahana.io • Twitter: @ahanaio
  16. 31 Data-Driven Companies need Low Data Latency Analysts and Scientists

    need to answer questions: The time it takes from a user having a question to the time they can actually answer it “Data Latency” = 1. User wants to track or explore some new data 2. User meets with Data Eng team to make plan 3. Data team acquire data and check access permissions 4. Build and test the ETLs and make tables available to user 5. Notify the user so they can ask their questions ! Can be days or weeks of time