Slide 1

Slide 1 text

localstack localstack-cloud localstack localstack.cloud localstack.cloud LocalStack Snowflake Emulator Waldemar Hummer San Francisco June 5th 2024

Slide 2

Slide 2 text

localstack localstack-cloud localstack localstack.cloud 2 What is Snowflake? ● A Cloud Data Platform that allows for scalable data processing ○ Uploading data files (CSV/JSON/parquet) to stages ○ Running SQL statements to create databases/tables/views/… ○ Running SELECT queries to query data from files and tables ○ Running scheduled jobs to create ETL pipelines ○ … ● Lots of native integrations and SDKs/tools to interact with the platform ○ Python Pandas dataframes, Snowpark libraries, JDBC driver, … ● Some similarities to the Data/BigData services in AWS: ○ Athena, Redshift, EMR, Managed Airflow, etc

Slide 3

Slide 3 text

localstack localstack-cloud localstack localstack.cloud Developing Data Pipelines Locally 3 ● Developing for Snowflake requires connectivity to the remote cloud at all times ○ → how does that fit into dev Lifecycle? ○ → is there a local development story? ● Often requested feature, even in Snowflake forums , as well as on StackOverflow, Reddit, etc.. ● Similar challenges as for AWS cloud ○ Speed up development cycles; avoid resource conflicts; test reproducibility; costs …

Slide 4

Slide 4 text

localstack localstack-cloud localstack localstack.cloud 4 LocalStack for Snowflake ● Available as a Docker image ○ Can be easily installed locally ● Emulates the actual Snowflake API Surface ○ → integrates natively with all Tooling ○ JDBC, DB visualization tools, etc work out of the box ● Easy to extend from local into CI pipelines - running tests in CI ● Recent Announcement: https://blog.localstack.cloud/2024-05-22-introducing-localstack-for-snowflake/

Slide 5

Slide 5 text

localstack localstack-cloud localstack localstack.cloud 5 Supported Feature Set (Excerpt) ● Some of the key features are already available today, including: ○ Basic operations on warehouses, databases, schemas, and tables (e.g., Using the Python Connector) ○ Storing files in user/data/named stages (Choosing an Internal Stage for Local Files) ○ Snowpark libraries (e.g., Snowpark Developer Guide for Python) ○ Snowpipe streaming with Kafka connector (Using Snowflake Connector for Kafka with Snowpipe Streaming) ○ JavaScript and Python UDFs (Introduction to JavaScript UDFs) ○ Tasks for scheduled execution ○ Table streams for change data capture and audit logs ○ … and quite a bit more!

Slide 6

Slide 6 text

localstack localstack-cloud localstack localstack.cloud Seamless integration with DB viz tools (e.g., DBeaver) Source: https://www.youtube.com/watch?v=1l9i_755MlA 6

Slide 7

Slide 7 text

localstack localstack-cloud localstack localstack.cloud 7 Demo Time!

Slide 8

Slide 8 text

localstack localstack-cloud localstack localstack.cloud 8 Starting Up ● Configure your auth token, then use the localstack CLI to start up: ● Configure your client app to connect to the local endpoint: $ export LOCALSTACK_AUTH_TOKEN= $ IMAGE_NAME=localstack/snowflake localstack start import snowflake.connector as sf conn = sf.connect( user="test", password="test", account="test", database="test", host="snowflake.localhost.localstack.cloud", )

Slide 9

Slide 9 text

localstack localstack-cloud localstack localstack.cloud 9 Demo 1: Basic Queries

Slide 10

Slide 10 text

localstack localstack-cloud localstack localstack.cloud 10 Demo 2: Covid19 Vaccine Dataset

Slide 11

Slide 11 text

localstack localstack-cloud localstack localstack.cloud 11 Sample: Queries over Covid19 dataset ● Taken from Snowflake “Getting Started” Guide ○ https://quickstarts.snowflake.com/guide/data_science_with_dataiku/index.html#0 ● Data set contains a lot of different data points ○ mobility data ○ vaccination data ○ … ● For this sample, we’ll focus on: ○ Putting CSV files to a local S3 stage ○ Loading the CSV data into a table ○ Running some simple SELECT queries

Slide 12

Slide 12 text

localstack localstack-cloud localstack localstack.cloud 12 Demo 3: Citybike Trips

Slide 13

Slide 13 text

localstack localstack-cloud localstack localstack.cloud 13 Sample App: NYC Citybike Trips ● Taken from Snowflake “Getting Started” Guide ● Contains trips and weather information over several years ● Data available in a public S3 bucket ○ → can be integrated in a local Snowflake stage directly! ● Web app displays the data in simple charts

Slide 14

Slide 14 text

localstack localstack-cloud localstack localstack.cloud 14 Demo 4: Table Streams

Slide 15

Slide 15 text

localstack localstack-cloud localstack localstack.cloud 15 Table Streams ● See https://docs.snowflake.com/en/user-guide/streams-intro ● Enables Change Data Capture (CDC) for Snowflake tables ● Stream = minimal set of changes from its current offset to the current version of the table ● Streams can be “consumed” via DML Queries, e.g.: INSERT INTO target … SELECT * FROM stream …

Slide 16

Slide 16 text

localstack localstack-cloud localstack localstack.cloud 16 Demo 5: Streamlit Applications

Slide 17

Slide 17 text

localstack localstack-cloud localstack localstack.cloud 17 Streamlit Apps ● Streamlit = Python UI Framework ○ https://streamlit.io ● Integrates natively with Snowflake ● Lots of UI components available ○ Charts ○ Widgets ○ Maps ○ Graphs ○ … ● → Easy way to create Data Apps!

Slide 18

Slide 18 text

localstack localstack-cloud localstack localstack.cloud 18 State Management

Slide 19

Slide 19 text

localstack localstack-cloud localstack localstack.cloud Cloud Pods 19 Persistent Shareable Sandboxes Cloud Pods are a mechanism that allows you to take a snapshot of the state in your current LocalStack instance, persist it to a storage backend, and easily share it with your team members.

Slide 20

Slide 20 text

localstack localstack-cloud localstack localstack.cloud 20 Demo 6: Integration with Cloud Pods

Slide 21

Slide 21 text

localstack localstack-cloud localstack localstack.cloud 21 Cloud Pods: Save & Load DB Snapshots ● Cloud pods can be saved and loaded from the CLI ● We’ve prepared a cloud pod with a table named “test” - load it like this: $ localstack pod save my-pod-123 $ localstack pod load my-pod-123 $ localstack pod load pod-snowflake $ snow sql -c local --query 'select * from test' +----------------------------------+ | MESSAGE | |----------------------------------| | Hello from LocalStack Snowflake! | +----------------------------------+

Slide 22

Slide 22 text

localstack localstack-cloud localstack localstack.cloud 22 How are we building this?

Slide 23

Slide 23 text

localstack localstack-cloud localstack localstack.cloud Implementation 23 ● High-Level Architecture ○ Query Processors ○ Core DB Engine ■ Current: Postgres ■ Alternative: DuckDB ○ Auxiliary Services ■ Streams, stages, … ● Written in Python ● Running in Docker https://blog.localstack.cloud/2024-05-22-introducing-localstack-for-snowflake

Slide 24

Slide 24 text

localstack localstack-cloud localstack localstack.cloud Challenges: Query Transpilation, Data Types 24 ● Snowflake/Postgres SQL is similar, yet many subtle differences ● Query parsing using sqlglot ○ Allows us to create a query AST, and perform modifications on it ○ Big shout-out to Tobiko Data for providing this library! 🙌 ○ We’ve also been able to contribute a few upstream PRs :) (#2989, #3510, #3519) ● Challenge: high-fidelity support for Snowflake data types ○ Often either no direct mapping, or different semantics in Postgres ○ Example: timestamps (TIMESTAMP_LTZ, TIMESTAMP_NTZ, etc) ○ advanced data types in Snowflake like generic VARIANT type ○ Needed to introduce a custom VARIANT data type in the core DB engine https://blog.localstack.cloud/2024-05-22-introducing-localstack-for-snowflake

Slide 25

Slide 25 text

localstack localstack-cloud localstack localstack.cloud Local Machine Bridging Local & Remote: Connection Proxy 25 ● Easily flip the switch between local and remote execution ● Can be configured with real Snowflake credentials (see screenshot below) ○ Calls will be forwarded to real cloud and returned to the client ● Enables a lot of exciting use cases: ○ duplex mode - running queries against local AND remote (local mirror) ○ Route only requests for certain tables to upstream: JOIN local & remote Client (e.g., Python connector) LocalStack Snowflake Connection Proxy Real Snowflake Cloud Account Core Engine

Slide 26

Slide 26 text

localstack localstack-cloud localstack localstack.cloud 26 The Road Ahead

Slide 27

Slide 27 text

localstack localstack-cloud localstack localstack.cloud 27 Roadmap ● LocalStack “vNext”: Expanding our focus into the data engineering space ○ Based on our learnings and foundation of the LocalStack AWS emulator ○ Turns out that there is a need for better local testing of data pipelines as well! ● The Snowflake emulator is still early stage, but the direction looks very promising ● Nicely integrates with our existing LocalStack AWS features ○ Using local S3; soon: AWS<>Snowflake integrations (e.g., Kinesis Firehose, …) ○ LS Cloud Platform: saving/loading of Cloud Pods to manage persistent state ● Exploring interesting challenges related to data testing ○ Test data management; testing data/ETL in CI pipelines; … ● Most of all: We’d love to LEARN about YOUR use cases! ○ Get in touch to participate in the LocalStack Snowflake preview!

Slide 28

Slide 28 text

localstack localstack-cloud localstack localstack.cloud Get in touch! 28 info@localstack.cloud