Cloud Data Platform that allows for scalable data processing ◦ Uploading data files (CSV/JSON/parquet) to stages ◦ Running SQL statements to create databases/tables/views/… ◦ Running SELECT queries to query data from files and tables ◦ Running scheduled jobs to create ETL pipelines ◦ … • Lots of native integrations and SDKs/tools to interact with the platform ◦ Python Pandas dataframes, Snowpark libraries, JDBC driver, … • Some similarities to the Data/BigData services in AWS: ◦ Athena, Redshift, EMR, Managed Airflow, etc
Developing for Snowflake requires connectivity to the remote cloud at all times ◦ → how does that fit into dev Lifecycle? ◦ → is there a local development story? • Often requested feature, even in Snowflake forums , as well as on StackOverflow, Reddit, etc.. • Similar challenges as for AWS cloud ◦ Speed up development cycles; avoid resource conflicts; test reproducibility; costs …
as a Docker image ◦ Can be easily installed locally • Emulates the actual Snowflake API Surface ◦ → integrates natively with all Tooling ◦ JDBC, DB visualization tools, etc work out of the box • Easy to extend from local into CI pipelines - running tests in CI • Recent Announcement: https://blog.localstack.cloud/2024-05-22-introducing-localstack-for-snowflake/
Some of the key features are already available today, including: ◦ Basic operations on warehouses, databases, schemas, and tables (e.g., Using the Python Connector) ◦ Storing files in user/data/named stages (Choosing an Internal Stage for Local Files) ◦ Snowpark libraries (e.g., Snowpark Developer Guide for Python) ◦ Snowpipe streaming with Kafka connector (Using Snowflake Connector for Kafka with Snowpipe Streaming) ◦ JavaScript and Python UDFs (Introduction to JavaScript UDFs) ◦ Tasks for scheduled execution ◦ Table streams for change data capture and audit logs ◦ … and quite a bit more!
auth token, then use the localstack CLI to start up: • Configure your client app to connect to the local endpoint: $ export LOCALSTACK_AUTH_TOKEN=<your-auth-token> $ IMAGE_NAME=localstack/snowflake localstack start import snowflake.connector as sf conn = sf.connect( user="test", password="test", account="test", database="test", host="snowflake.localhost.localstack.cloud", )
• Taken from Snowflake “Getting Started” Guide ◦ https://quickstarts.snowflake.com/guide/data_science_with_dataiku/index.html#0 • Data set contains a lot of different data points ◦ mobility data ◦ vaccination data ◦ … • For this sample, we’ll focus on: ◦ Putting CSV files to a local S3 stage ◦ Loading the CSV data into a table ◦ Running some simple SELECT queries
• Taken from Snowflake “Getting Started” Guide • Contains trips and weather information over several years • Data available in a public S3 bucket ◦ → can be integrated in a local Snowflake stage directly! • Web app displays the data in simple charts
• Enables Change Data Capture (CDC) for Snowflake tables • Stream = minimal set of changes from its current offset to the current version of the table • Streams can be “consumed” via DML Queries, e.g.: INSERT INTO target … SELECT * FROM stream …
Cloud Pods are a mechanism that allows you to take a snapshot of the state in your current LocalStack instance, persist it to a storage backend, and easily share it with your team members.
DB Snapshots • Cloud pods can be saved and loaded from the CLI • We’ve prepared a cloud pod with a table named “test” - load it like this: $ localstack pod save my-pod-123 $ localstack pod load my-pod-123 $ localstack pod load pod-snowflake $ snow sql -c local --query 'select * from test' +----------------------------------+ | MESSAGE | |----------------------------------| | Hello from LocalStack Snowflake! | +----------------------------------+
• Snowflake/Postgres SQL is similar, yet many subtle differences • Query parsing using sqlglot ◦ Allows us to create a query AST, and perform modifications on it ◦ Big shout-out to Tobiko Data for providing this library! 🙌 ◦ We’ve also been able to contribute a few upstream PRs :) (#2989, #3510, #3519) • Challenge: high-fidelity support for Snowflake data types ◦ Often either no direct mapping, or different semantics in Postgres ◦ Example: timestamps (TIMESTAMP_LTZ, TIMESTAMP_NTZ, etc) ◦ advanced data types in Snowflake like generic VARIANT type ◦ Needed to introduce a custom VARIANT data type in the core DB engine https://blog.localstack.cloud/2024-05-22-introducing-localstack-for-snowflake
Connection Proxy 25 • Easily flip the switch between local and remote execution • Can be configured with real Snowflake credentials (see screenshot below) ◦ Calls will be forwarded to real cloud and returned to the client • Enables a lot of exciting use cases: ◦ duplex mode - running queries against local AND remote (local mirror) ◦ Route only requests for certain tables to upstream: JOIN local & remote Client (e.g., Python connector) LocalStack Snowflake Connection Proxy Real Snowflake Cloud Account Core Engine
our focus into the data engineering space ◦ Based on our learnings and foundation of the LocalStack AWS emulator ◦ Turns out that there is a need for better local testing of data pipelines as well! • The Snowflake emulator is still early stage, but the direction looks very promising • Nicely integrates with our existing LocalStack AWS features ◦ Using local S3; soon: AWS<>Snowflake integrations (e.g., Kinesis Firehose, …) ◦ LS Cloud Platform: saving/loading of Cloud Pods to manage persistent state • Exploring interesting challenges related to data testing ◦ Test data management; testing data/ETL in CI pipelines; … • Most of all: We’d love to LEARN about YOUR use cases! ◦ Get in touch to participate in the LocalStack Snowflake preview!