Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tooling for updating large tables in Postgres

Tooling for updating large tables in Postgres

In this talk André will present tooling that will help you update large tables in Postgres with observability on the progress, pause/resume work, optimization of databases updates throughput and having atomic operations while not locking all the table but just the rows you want to update.

André Freitas

February 13, 2020
Tweet

More Decks by André Freitas

Other Decks in Programming

Transcript

  1. Use cases In this table we sometimes: • Add a

    new field and backfill data • Fix data That table is used for: • Our main product • Data Science projects • Business Inteligence • Run our data pipeline
  2. Table Batch Iterator BATCH_SIZE = 500 PROCESSES = 2 table_batch_iterator

    = TableBatchIterator ( conn=connection, table='users', batch_size=BATCH_SIZE ) def _update_last_name_in_batch (boundaries): begin_id, end_id = boundaries with connection. cursor() as cursor: cursor. execute( f""" UPDATE users SET last_name = 'freitas' WHERE id >= {begin_id } and id <= {end_id} """ ) def update_all_users (): with Pool(processes=PROCESSES ) as pool: pool. map(_update_last_name_in_batch, table_batch_iterator) In [8]: update_all_users() Updated batch 25001 - 25500 Updated batch 25501 - 26000 Updated batch 26001 - 26500 Updated batch 26501 - 27000 Updated batch 27001 - 27500 Updated batch 27501 - 28000
  3. ETA calculator eta_calculator = eta.ProgressCalculator (table_batch_iterator.total_batches) def _update_last_name_in_batch (boundaries): ...

    eta_calculator. update_progress_and_print (1) ... def update_all_users (): eta_calculator. start() with Pool(processes=PROCESSES ) as pool: pool. map(_update_last_name_in_batch, table_batch_iterator) eta_calculator. done() eta_calculator. print_total_time () In [2]: update_all_users() Progress: 50.88% | Done: 29 | ETA: 0h 0m 4s Progress: 70.18% | Done: 40 | ETA: 0h 0m 3s Progress: 100.00% | Done: 57 | ETA: 0h 0m 0s Total time: 0h 0m 9s
  4. Batch Update SQL Builder UPDATE users SET last_name = (

    CASE WHEN id = 3 THEN 'Afonso' WHEN id = 1 THEN 'Carla' WHEN id = 2 THEN 'John' ELSE last_name END ) WHERE id >= 1 AND id <= 500 Builder for generating complex update statements on multiple rows and multiple fields.
  5. Batch Update SQL Builder def _build_batch_update_sql (begin_id, end_id): batch_update_builder =

    BatchUpdateBuilder ('beyond_users ', begin_id, end_id) users = ( User.objects . select_for_update () . filter(id__gte=begin_id, id__lte=end_id) ) for user in users: if user.first_name: batch_update_builder. update(user.id, 'last_name ', user.first_name[ ::-1]) return batch_update_builder. build_sql () @transaction.atomic () def _update_last_name_in_batch (boundaries): begin_id, end_id = boundaries update_sql = _build_batch_update_sql (begin_id, end_id) if update_sql: cursor = connection. cursor() cursor. execute(update_sql) cursor. close() _sleep_throttle () eta_calculator. update_progress_and_print (1)
  6. Partitioned Persistent Queue A persistent queue in disk that assigns

    partitions for each process. It serializes Python objects and recover queue processing if process dies. Built on top of persist-queue. Jobs enqueue Jobs Partitions Workers Processes DB update consume data flowing
  7. Partitioned Persistent Queue queues = PartitionedPersistentQueue ( 'demo-queue ', partitions_number=PROCESSES

    ) def enqueue_jobs (): queues. clean() queues. create() for boundaries in table_batch_iterator: queues.cycle_queue. put(boundaries) def _process_partition (partition): queue = queues.process_queue while not queue.empty(): boundaries = queue.get() _update_last_name_in_batch (boundaries) queue. task_done () def process_jobs (): ... connections. close_all () with Pool(processes=PROCESSES ) as pool: pool. map(_process_partition, range(PROCESSES )) ...
  8. Conclusions • We did a lot of copy paste code

    to process backfills • Investing in tooling was important to reduce boilerplate code • Visibility is important specially if processing takes months • Being able to pause backfills allows database maintenance