Tooling for updating large tables in Postgres

Tooling for updating large tables in Postgres André Freitas -
[email protected] 13 Feb 2020

What we do?

22TB in a single table

Use cases In this table we sometimes: • Add a
new ﬁeld and backﬁll data • Fix data That table is used for: • Our main product • Data Science projects • Business Inteligence • Run our data pipeline

Problems and solutions (tools)

1. We cannot update all table at once because of
locks

Table Batch Iterator BATCH_SIZE = 500 PROCESSES = 2 table_batch_iterator
= TableBatchIterator ( conn=connection, table='users', batch_size=BATCH_SIZE ) def _update_last_name_in_batch (boundaries): begin_id, end_id = boundaries with connection. cursor() as cursor: cursor. execute( f""" UPDATE users SET last_name = 'freitas' WHERE id >= {begin_id } and id <= {end_id} """ ) def update_all_users (): with Pool(processes=PROCESSES ) as pool: pool. map(_update_last_name_in_batch, table_batch_iterator) In [8]: update_all_users() Updated batch 25001 - 25500 Updated batch 25501 - 26000 Updated batch 26001 - 26500 Updated batch 26501 - 27000 Updated batch 27001 - 27500 Updated batch 27501 - 28000

2. We need visibility on progress

ETA calculator eta_calculator = eta.ProgressCalculator (table_batch_iterator.total_batches) def _update_last_name_in_batch (boundaries): ...
eta_calculator. update_progress_and_print (1) ... def update_all_users (): eta_calculator. start() with Pool(processes=PROCESSES ) as pool: pool. map(_update_last_name_in_batch, table_batch_iterator) eta_calculator. done() eta_calculator. print_total_time () In [2]: update_all_users() Progress: 50.88% | Done: 29 | ETA: 0h 0m 4s Progress: 70.18% | Done: 40 | ETA: 0h 0m 3s Progress: 100.00% | Done: 57 | ETA: 0h 0m 0s Total time: 0h 0m 9s

3. Updating row by row takes a lot of time

Batch Update SQL Builder UPDATE users SET last_name = (
CASE WHEN id = 3 THEN 'Afonso' WHEN id = 1 THEN 'Carla' WHEN id = 2 THEN 'John' ELSE last_name END ) WHERE id >= 1 AND id <= 500 Builder for generating complex update statements on multiple rows and multiple ﬁelds.

Batch Update SQL Builder def _build_batch_update_sql (begin_id, end_id): batch_update_builder =
BatchUpdateBuilder ('beyond_users ', begin_id, end_id) users = ( User.objects . select_for_update () . filter(id__gte=begin_id, id__lte=end_id) ) for user in users: if user.first_name: batch_update_builder. update(user.id, 'last_name ', user.first_name[ ::-1]) return batch_update_builder. build_sql () @transaction.atomic () def _update_last_name_in_batch (boundaries): begin_id, end_id = boundaries update_sql = _build_batch_update_sql (begin_id, end_id) if update_sql: cursor = connection. cursor() cursor. execute(update_sql) cursor. close() _sleep_throttle () eta_calculator. update_progress_and_print (1)

3. We need to pause/resume the process

Partitioned Persistent Queue A persistent queue in disk that assigns
partitions for each process. It serializes Python objects and recover queue processing if process dies. Built on top of persist-queue. Jobs enqueue Jobs Partitions Workers Processes DB update consume data ﬂowing

Partitioned Persistent Queue queues = PartitionedPersistentQueue ( 'demo-queue ', partitions_number=PROCESSES
) def enqueue_jobs (): queues. clean() queues. create() for boundaries in table_batch_iterator: queues.cycle_queue. put(boundaries) def _process_partition (partition): queue = queues.process_queue while not queue.empty(): boundaries = queue.get() _update_last_name_in_batch (boundaries) queue. task_done () def process_jobs (): ... connections. close_all () with Pool(processes=PROCESSES ) as pool: pool. map(_process_partition, range(PROCESSES )) ...

Conclusions • We did a lot of copy paste code
to process backﬁlls • Investing in tooling was important to reduce boilerplate code • Visibility is important specially if processing takes months • Being able to pause backﬁlls allows database maintenance

Tooling for updating large tables in Postgres

Tooling for updating large tables in Postgres

André Freitas

More Decks by André Freitas

Other Decks in Programming

Featured

Transcript