Upgrade to Pro — share decks privately, control downloads, hide ads and more …

pd.{read/to}_sql is simple but not fast

Uwe L. Korn
November 13, 2020

pd.{read/to}_sql is simple but not fast

Pandas provides convenience methods to read and write to databases using to_sql and read_sql. They provide great usability and a uniform interface for all databases that support an SQL Alchemy connection. Sadly, the layer of convenience also introduces a performance loss. Luckily, for a lot of databases, a performant access layer is available.

Uwe L. Korn

November 13, 2020

More Decks by Uwe L. Korn

Other Decks in Programming


  1. About me • Engineering at QuantCo • Apache {Arrow, Parquet}

    PMC • Turbodbc Maintainer • Other OSS stuff @xhochy @xhochy [email protected] https://uwekorn.com
  2. Our setting • We like tabular data • Thus we

    use pandas • We want large amounts of this data in pandas • The traditional storage for it is SQL databases • How do we get from one to another?
  3. SQL • Very very brief intro: • „domain-specific language for

    accessing data held in a relational database management system“ • The one language in data systems that precedes all the Python, R, Julia, … we use as our „main“ language, also much wider user base • SELECT * FROM table
  4. • Two main arguments: • sql: SQL query to be

    executed or a table name. • con: SQLAlchemy connectable, str, or sqlite3 connection
  5. • Two main arguments: • name: Name of SQL table.

    • con: SQLAlchemy connectable, str, or sqlite3 connection
  6. • Let’s look at the other nice bits („additional arguments“)

    • if_exists: „What should we do when the target already exists?“ • fail • replace • append
  7. • index: „What should we with this one magical column?“

    (bool) • index_label • chunksize: „Write less data at once“ • dtype: „What should we with this one magical column?“ (bool) • method: „Supply some magic insertion hook“ (callable)
  8. SQLAlchemy • SQLAlchemy is a Python SQL toolkit and Object

    Relational Mapper (ORM) • We only use the toolkit part for: • Metadata about schema and tables (incl. creation) • Engine for connecting to various databases using a uniform interface
  9. How does it work (read_sql)? • pandas.read_sql [1] calls SQLDatabase.read_query

    [2] • This then does
 • Depending on whether a chunksize was given, this fetches all or parts of the result [1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L509-L516 [2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1243
  10. How does it work (read_sql)? • Passes in the data

    into the from_records constructor • Optionally parses dates and sets an index
  11. How does it work (to_sql)? • This is more tricky

    as we modify the database. • to_sql [1] may need to create the target • If not existing, it will call CREATE TABLE [2] • Afterwards, we INSERT [3] into the (new) table • The insertion step is where we convert from DataFrame back into records [4]
 [1] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1320 [2] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1383-L1393 [3] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L1398 [4] https://github.com/pandas-dev/pandas/blob/d9fff2792bf16178d4e450fe7384244e50635733/pandas/io/sql.py#L734-L747
  12. ODBC • Open Database Connectivity (ODBC) is a standard API

    for accessing databases • Most databases provide an ODBC interface, some of them are efficient • Two popular Python libraries for that: • https://github.com/mkleehammer/pyodbc • https://github.com/blue-yonder/turbodbc
  13. ODBC • With turbodbc + Arrow we get the following

    performance improvements: • 3-4x for MS SQL, see https://youtu.be/B-uj8EDcjLY?t=1208 • 3-4x speedup for Exasol, see https://youtu.be/B-uj8EDcjLY?t=1390
  14. Snowflake • Turbodbc is a solution that retrofits performance •

    Snowflake drivers already come with built-in speed • Default response is JSON-based, BUT: • The database server can answer directly with Arrow • Client only needs the Arrow->pandas conversion (lightning fast⚡) • Up to 10x faster, see https://www.snowflake.com/blog/fetching- query-results-from-snowflake-just-got-a-lot-faster-with-apache- arrow/
  15. JDBC • Blogged about this at: https://uwekorn.com/2019/11/17/fast-jdbc- access-in-python-using-pyarrow-jvm.html • Not

    yet so convenient and read-only • First, you need all your Java dependencies incl arrow-jdbc in your classpath • Start JVM and load the driver, setup Arrow Java
  16. JDBC • Then: • Fetch result using the Arrow Java

    JDBC adapter • Use pyarrow.jvm to get a Python reference to the JVM memory • Convert to pandas 136x speedup!
  17. Building a better API • We want to use pandas’

    simple API but with the nice performance benefits • One idea: Dispatching based on the connection class
 • User doesn’t need to learn a new API • Performance improvements come via optional packages