Slide 1

Slide 1 text

Tooling for updating large tables in Postgres André Freitas - [email protected] 13 Feb 2020

Slide 2

Slide 2 text

What we do?

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

22TB in a single table

Slide 5

Slide 5 text

Use cases In this table we sometimes: ● Add a new field and backfill data ● Fix data That table is used for: ● Our main product ● Data Science projects ● Business Inteligence ● Run our data pipeline

Slide 6

Slide 6 text

Problems and solutions (tools)

Slide 7

Slide 7 text

1. We cannot update all table at once because of locks

Slide 8

Slide 8 text

Table Batch Iterator BATCH_SIZE = 500 PROCESSES = 2 table_batch_iterator = TableBatchIterator ( conn=connection, table='users', batch_size=BATCH_SIZE ) def _update_last_name_in_batch (boundaries): begin_id, end_id = boundaries with connection. cursor() as cursor: cursor. execute( f""" UPDATE users SET last_name = 'freitas' WHERE id >= {begin_id } and id <= {end_id} """ ) def update_all_users (): with Pool(processes=PROCESSES ) as pool: pool. map(_update_last_name_in_batch, table_batch_iterator) In [8]: update_all_users() Updated batch 25001 - 25500 Updated batch 25501 - 26000 Updated batch 26001 - 26500 Updated batch 26501 - 27000 Updated batch 27001 - 27500 Updated batch 27501 - 28000

Slide 9

Slide 9 text

2. We need visibility on progress

Slide 10

Slide 10 text

ETA calculator eta_calculator = eta.ProgressCalculator (table_batch_iterator.total_batches) def _update_last_name_in_batch (boundaries): ... eta_calculator. update_progress_and_print (1) ... def update_all_users (): eta_calculator. start() with Pool(processes=PROCESSES ) as pool: pool. map(_update_last_name_in_batch, table_batch_iterator) eta_calculator. done() eta_calculator. print_total_time () In [2]: update_all_users() Progress: 50.88% | Done: 29 | ETA: 0h 0m 4s Progress: 70.18% | Done: 40 | ETA: 0h 0m 3s Progress: 100.00% | Done: 57 | ETA: 0h 0m 0s Total time: 0h 0m 9s

Slide 11

Slide 11 text

3. Updating row by row takes a lot of time

Slide 12

Slide 12 text

Batch Update SQL Builder UPDATE users SET last_name = ( CASE WHEN id = 3 THEN 'Afonso' WHEN id = 1 THEN 'Carla' WHEN id = 2 THEN 'John' ELSE last_name END ) WHERE id >= 1 AND id <= 500 Builder for generating complex update statements on multiple rows and multiple fields.

Slide 13

Slide 13 text

Batch Update SQL Builder def _build_batch_update_sql (begin_id, end_id): batch_update_builder = BatchUpdateBuilder ('beyond_users ', begin_id, end_id) users = ( User.objects . select_for_update () . filter(id__gte=begin_id, id__lte=end_id) ) for user in users: if user.first_name: batch_update_builder. update(user.id, 'last_name ', user.first_name[ ::-1]) return batch_update_builder. build_sql () @transaction.atomic () def _update_last_name_in_batch (boundaries): begin_id, end_id = boundaries update_sql = _build_batch_update_sql (begin_id, end_id) if update_sql: cursor = connection. cursor() cursor. execute(update_sql) cursor. close() _sleep_throttle () eta_calculator. update_progress_and_print (1)

Slide 14

Slide 14 text

3. We need to pause/resume the process

Slide 15

Slide 15 text

Partitioned Persistent Queue A persistent queue in disk that assigns partitions for each process. It serializes Python objects and recover queue processing if process dies. Built on top of persist-queue. Jobs enqueue Jobs Partitions Workers Processes DB update consume data flowing

Slide 16

Slide 16 text

Partitioned Persistent Queue queues = PartitionedPersistentQueue ( 'demo-queue ', partitions_number=PROCESSES ) def enqueue_jobs (): queues. clean() queues. create() for boundaries in table_batch_iterator: queues.cycle_queue. put(boundaries) def _process_partition (partition): queue = queues.process_queue while not queue.empty(): boundaries = queue.get() _update_last_name_in_batch (boundaries) queue. task_done () def process_jobs (): ... connections. close_all () with Pool(processes=PROCESSES ) as pool: pool. map(_process_partition, range(PROCESSES )) ...

Slide 17

Slide 17 text

Conclusions ● We did a lot of copy paste code to process backfills ● Investing in tooling was important to reduce boilerplate code ● Visibility is important specially if processing takes months ● Being able to pause backfills allows database maintenance