Party Like it's ANSI 1999: Using SQL from Python Applications

Slide 1

Slide 1 text

Party Like it’s ANSI 1999 1

Slide 2

Slide 2 text

I used to work at an Investment Bank 2

Slide 3

Slide 3 text

...definitely not this! 3

Slide 4

Slide 4 text

...more like this 4

Slide 5

Slide 5 text

...specifically 5

Slide 6

Slide 6 text

Hi I’m Huss Co-founder & CTO @ 9fin Investment Banking Technology ● Equities Trading ● Bespoke Trader Desk tools ● Risk and Control systems Aeronautical Engineering ● Lots of aeroplane stuff 9fin.com @9finHQ 6 @HUSSTECH

Slide 7

Slide 7 text

American National Standards Institute We’re at SQL: 2016… but also, I like puns This better be fun…. 7 Party Like it’s ANSI 1999

Slide 8

Slide 8 text

Let’s talk about Databases ● Relational DBMS specifically ● A long history since the early 1970s 8 Berkley Ingress Sybase SQL Server - fork - PostgreSQL MS SQL Server MySQL mSQL 1970s 1980s 1990s PEP 248: DB 1.0 API (1996)

Slide 9

Slide 9 text

Software Archeology: BigCo EE ● Silo groups with big databases ● Technology functions replicated for each system ● System of record and final destination for all data ○ So was treated very carefully ● Specific SQL developers and DBAs (Database Administrators aka wizards) 9 Sales Finance Inventory

Slide 10

Slide 10 text

Software Archeology: The “fat client” ● Direct connection from UI to Database ● Tightly coupled ○ Roles/Permissions were on the Database ○ Database down = application down ● Not just a pattern in the enterprise ● Web Servers were also multi-tasking ○ Hosting, executing app code, database calls 10 Application

Slide 11

Slide 11 text

Software Archeology: Three Tier ● Intermediate layer between UI/Client and Datastore ○ Server, Cache, Proxy, another client ○ Typically endpoints on enterprise message bus or direct HTTP ○ Decoupled the data from the application layer ● Evolved further today with microservices and mesh like layers ● Decoupling helped an explosion in dynamic languages with rich tooling for web development 11 Application Server

Slide 12

Slide 12 text

The Problem with State ● Application code has been living in source control for decades now ○ Code is stateless ○ Data only passes through it ○ Arbitrary progression backwards and forwards through time ○ Amazing tooling and support ● Source control of a database still unsolved problem ○ Database holds code as well as data ○ Data is resident in the database ○ Points in time are linked to a database state ○ Some good solutions which are source control for the structure of the database exist ● So it’s best to assume nothing of the database and treat it as a dumb container? 12 Application nope! Source Control Source Control

Slide 13

Slide 13 text

SQL and Apps Today: Object Relational Mapping ● Model your entities using an ORM library for your application ● Takes care of data retrieval and persistence ● Great for bootstrapping ○ Focus on app code not storage ○ Strong control of your business objects ● Rich libraries in many languages 13 Application runtime Models ORM Backend DB Driver

Slide 14

Slide 14 text

SQL and Apps Today: We can do more! ● It is a high performance machine...why not use it? ○ Make use of advanced data structures ■ JSON, XML, Hstore, IP, Geo… ○ Do your computation where the data is ■ Aggregates, pivots, windowing ■ Save bandwidth ○ Enforce business rules ■ Guarantee the integrity of your data ■ Asynchronously act on events ● How can we use all of this in a simple and easy way from our Python code? ● How do we solve the source control problem? 14

Slide 15

Slide 15 text

SQL and Apps Today: JSON Example 15 { "forms": [ { "url": "...api/v2/pokemon-form/25/", "name": "pikachu" } ], "abilities": [ { "slot": 3, "is_hidden": true, "ability": { "url": ".../api/v2/ability/31/", "name": "lightning-rod" } }, ... SELECT jsonb_array_elements( data->'abilities' ) ->'ability' ->>'name' FROM pokemon ability | --------------| lightning-rod | static |

Slide 16

Slide 16 text

SQLpy ● Write SQL *next* to your application code ● Prepares a ready to execute functools.partial object for you ● No query strings dotted all over source files ● A light wrapper around your DB2.0 API ● Genesis from YeSQL 16 In source control! queries.sql Application runtime Whatever you want DB Driver Source Control

Slide 17

Slide 17 text

That awkward moment when you need SQL 17 # set up 2 query objects one for subquery one for window function subquery = db.session.query(func.DATE(Payment.payment_date).label('payme nt_date'), func.sum(Payment.amount).label('amount') ).group_by(func.DATE(Payment.payment_date))\ .order_by(func.DATE(Payment.payment_date))\ .subquery('s') window = db.session.query(subquery.c.payment_date, subquery.c.amount, func.sum(subquery.c.amount).over( order_by=subquery.c.payment_date) ).select_entity_from(subquery).order_by(subquery.c.payment_date) ● A nested data question ● 10+ (readable) lines ● This is quite nice syntax actually ● But the semantics are halfway between Python object orientation and SQL declarative style How to do the same in SQL?

Slide 18

Slide 18 text

That awkward moment when you need SQL 18 How to do the same in SQL? SELECT payment_date, amount, sum(amount) OVER (ORDER BY payment_date) FROM ( SELECT CAST(payment_date AS DATE) AS payment_date, SUM(amount) AS amount FROM payment GROUP BY CAST(payment_date AS DATE)) p ORDER BY payment_date; from itertools import accumulate # List of dates and amount pairs payments = [(payment_date, amount)] for (pay_dt, amt), cum_pay in zip(payments, accumulate(payments)): print(f'{pay_dt} {amt} {cum_pay}') Easier in Python!

Slide 19

Slide 19 text

SQLpy + Models ● Still like linking behaviours and actions via objects? ● Use the DAO pattern - define the models how you want ○ Add extra methods to serialize or validate eg. Marshmallow 19 class Hello: def __init__(self, queryfunc, cur, args): self._query = queryfunc self._result = self.result(cur, args) @property def result(self): return self._result @result.setter def result(self, cur, args): self._result = self._query(cur, args) def refresh(self, cur, args): return self._query(cur, args) def jsonout(self): return json.dumps(self.result)

Slide 20

Slide 20 text

SQLpy + Composition ● Constructing SQL queries is legitimately dangerous! ● How to do this in SQLpy, but not destroy the valid SQL syntax in queries file? ● Answer: BUILT SQL ○ Clauses added based on parameters passed to the wrapping function ○ Transparent and lightweight ● Query still passes through the DB2 .0 API library sanitiser for safe execution 20

Slide 21

Slide 21 text

SQLpy + Composition 21 -- name: built_sql_query$ SELECT * FROM messages WHERE 1=1 AND id = %(id_low)s OR id = %(id_high)s AND message = %(msg)s; sql = sqlpy.Queries('queries.sql') .... args = {'id_low': 1} results = sql.BUILT_SQL_QUERY(cur, args) SELECT * FROM messages WHERE 1=1 AND id = 1; id message next_message 1 hello there 2 SQLpy NULL 3 databases rule! 4 hello friend

Slide 22

Slide 22 text

Beyond web apps: Python taking over! ● Multitude of tools have made Python the interface language of choice ● But the glue has been clunky bash scripts, python scripts, CSV dumps and loads ● Why not use use Python all the way through and really integrate your systems and pipelines with one universal language? 22 Database Object Store --> Log Streaming --> Distributed Computation

Slide 23

Slide 23 text

Data Science ● Moving data from A to B is a very frequent task ○ Data cleaning and transformation takes up a much higher portion of our time than we’d like :( ○ Any efficiency gain on the ingestion and processing pipeline is a win ● ORMs sometimes do not fit very well here as there are no “entities” to model ○ We’re experimenting: cutting, pivoting, adding and dropping attributes of interest on the fly ● Use SQLpy to have your data modelling language next to your computation language 23

Slide 24

Slide 24 text

Data Science - SQL is coming for you ● Next generation datastores are cloud native, distributed and big in scale ● They’re not traditionally relational ○ But many are converging on SQL as their data modelling language ○ Not full SQL standard, but most common operations present ○ No traditional relational Database ORM support ● All provide Python libraries that expose their query language interfaces ○ kSQL (kafka) ○ sparkSQL (Apache Spark) ○ BigQuery (Google Cloud) ○ Athena (AWS S3) ● Let’s modify SQLpy to work with these too 24

Slide 25

Slide 25 text

Google BigQuery API with SQL ● Since SQLpy has no opinion on the backend datastore interface mechanism, it just expects a DB 2.0 API compliant cursor object ○ Make an object defining execute, fetchone, fetchall etc… methods 25 https://github.com/HUSSTECH/talks-code/tree/master/party-like-its-ansi-1999

Slide 26

Slide 26 text

sqlpy.readthedocs.io ● github.com/9fin/sqlpy (MIT) ● Made for PostgreSQL first ○ Other DBs working in theory, but not tested! ● Still

Slide 27

Slide 27 text

Thank You! References and Credits [slide#] [3] image: By Own Oil Industry News - Own Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=8266714 [4] image: CC0 license [5] image: An unnamed colleague [8] Database timeline based on: https://hpi.de/naumann/projects/rdbms-genealogy.html [15] PokeAPI for json data: https://pokeapi.co/docsv2/ [24] ksql: https://www.confluent.io/product/ksql/ [24] SparkSQL: https://spark.apache.org/sql/ [24] BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators [24] Athena: https://aws.amazon.com/athena/ [25] Google BigQuery sample applciation: https://cloud.google.com/bigquery/create-simple-app-api 27