Upgrade to Pro — share decks privately, control downloads, hide ads and more …

pd.{read/to}_sql is simple but not fast

Uwe L. Korn
November 13, 2020

pd.{read/to}_sql is simple but not fast

Pandas provides convenience methods to read and write to databases using to_sql and read_sql. They provide great usability and a uniform interface for all databases that support an SQL Alchemy connection. Sadly, the layer of convenience also introduces a performance loss. Luckily, for a lot of databases, a performant access layer is available.

Uwe L. Korn

November 13, 2020

More Decks by Uwe L. Korn

Other Decks in Programming


  1. pd.{read/to}_sql is simple but
    not fast
    Uwe Korn – QuantCo – November 2020

    View Slide

  2. About me
    • Engineering at QuantCo

    • Apache {Arrow, Parquet} PMC

    • Turbodbc Maintainer

    • Other OSS stuff
    [email protected]

    View Slide

  3. Our setting
    • We like tabular data

    • Thus we use pandas

    • We want large amounts of this data in pandas
    • The traditional storage for it is SQL databases

    • How do we get from one to another?

    View Slide

  4. SQL
    • Very very brief intro:

    • „domain-specific language for accessing data held in a relational
    database management system“

    • The one language in data systems that precedes all the Python, R,
    Julia, … we use as our „main“ language, also much wider user

    • SELECT * FROM table

    INSERT INTO table

    View Slide

  5. • Two main arguments:

    • sql: SQL query to be executed or a table name.

    • con: SQLAlchemy connectable, str, or sqlite3 connection

    View Slide

  6. • Two main arguments:

    • name: Name of SQL table.

    • con: SQLAlchemy connectable, str, or sqlite3 connection

    View Slide

  7. • Let’s look at the other nice bits („additional arguments“)

    • if_exists: „What should we do when the target already exists?“

    • fail

    • replace

    • append

    View Slide

  8. • index: „What should we with this one magical column?“ (bool)

    • index_label

    • chunksize: „Write less data at once“

    • dtype: „What should we with this one magical column?“ (bool)

    • method: „Supply some magic insertion hook“ (callable)

    View Slide

  9. SQLAlchemy
    • SQLAlchemy is a Python SQL toolkit and Object Relational Mapper

    • We only use the toolkit part for:

    • Metadata about schema and tables (incl. creation)

    • Engine for connecting to various databases using a uniform

    View Slide

  10. Under the bonnet

    View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. How does it work (read_sql)?
    • pandas.read_sql [1] calls SQLDatabase.read_query [2]

    • This then does

    • Depending on whether a chunksize was given, this fetches all or
    parts of the result
    [1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L509-L516
    [2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1243

    View Slide

  16. How does it work (read_sql)?
    • Passes in the data into the from_records constructor

    • Optionally parses dates and sets an index

    View Slide

  17. How does it work (to_sql)?
    • This is more tricky as we modify the database.

    • to_sql [1] may need to create the target

    • If not existing, it will call CREATE TABLE [2]

    • Afterwards, we INSERT [3] into the (new) table

    • The insertion step is where we convert from DataFrame back into
    records [4]

    [1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1320
    [2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1383-L1393
    [3] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1398
    [4] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L734-L747

    View Slide

  18. Why is it slow?
    No benchmarks yet, theory first.

    View Slide

  19. Why is it slow?

    View Slide

  20. Thanks
    Slides will come after PyData Global

    Follow me on Twitter: @xhochy
    How to get fast?

    View Slide

  21. ODBC
    • Open Database Connectivity (ODBC) is a standard API for accessing

    • Most databases provide an ODBC interface, some of them are

    • Two popular Python libraries for that:

    • https://github.com/mkleehammer/pyodbc

    • https://github.com/blue-yonder/turbodbc

    View Slide

  22. ODBC
    Turbodbc has support for Apache Arrow: https://arrow.apache.org/

    View Slide

  23. ODBC
    • With turbodbc + Arrow we get the following performance

    • 3-4x for MS SQL, see https://youtu.be/B-uj8EDcjLY?t=1208

    • 3-4x speedup for Exasol, see https://youtu.be/B-uj8EDcjLY?t=1390

    View Slide

  24. Snowflake
    • Turbodbc is a solution that retrofits performance

    • Snowflake drivers already come with built-in speed

    • Default response is JSON-based, BUT:

    • The database server can answer directly with Arrow

    • Client only needs the Arrow->pandas conversion (lightning fast⚡)

    • Up to 10x faster, see https://www.snowflake.com/blog/fetching-

    View Slide

  25. JDBC
    • Blogged about this at: https://uwekorn.com/2019/11/17/fast-jdbc-

    • Not yet so convenient and read-only

    • First, you need all your Java dependencies incl arrow-jdbc in your

    • Start JVM and load the driver, setup Arrow Java

    View Slide

  26. JDBC
    • Then:

    • Fetch result using the Arrow Java JDBC adapter

    • Use pyarrow.jvm to get a Python reference to the JVM memory

    • Convert to pandas 136x speedup!

    View Slide

  27. Postgres
    Not yet opensourced but this is how it works:

    View Slide

  28. How do we get this
    into pandas.read_sql?

    View Slide

  29. API troubles
    • pandas’ simple API: 

    • turbodbc

    View Slide

  30. API troubles
    • pandas’ simple API: 

    • Snowflake

    View Slide

  31. API troubles
    • pandas’ simple API: 

    • pyarrow.jvm + JDBC

    View Slide

  32. Building a better API
    • We want to use pandas’ simple API but with the nice performance

    • One idea: Dispatching based on the connection class

    • User doesn’t need to learn a new API

    • Performance improvements come via optional packages

    View Slide

  33. Building a better API
    Alternative idea:

    View Slide

  34. Building a better API
    Discussion in https://github.com/pandas-dev/pandas/issues/36893

    View Slide

  35. Thanks
    Follow me on Twitter: @xhochy

    View Slide