Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Citus Con 2023, Citus and JSON at Vizor Games

Citus Con 2023, Citus and JSON at Vizor Games

Slides from Citus Con 2023.
Describes our experience with Citus&JSON for real-time analytics, including topics like common caveats and workarounds, maintenance practices, and interaction with the users.

Ivan Vyazmitinov

April 18, 2023
Tweet

Other Decks in Programming

Transcript

  1. Achievements: • 2014. Klondike was named one of the 10

    best new games on Facebook in 2014. • 2015. Knights and Brides won Facebook’s best web game award • 2020. VIZOR ranked 12th in TOP 30 EMEA Headquartered Overall publishers by Revenue (iOS App Store & Google Play). According to data.ai. • 2021. VIZOR ranked 12th in TOP 30 Overall EMEA Headquartered publishers WorldWide by Consumer spend (iOS App Store and Google Play). According to data.ai.
  2. • Open-source • JSONB • SQL • OLAP • No

    columnar • Pros and Cons(2017)
  3. • Open-source • JSONB • SQL • OLAP • No

    columnar • No rebalancing Pros and Cons(2017)
  4. • Open Source • JSONB • SQL • OLAP •

    Columnar • Rebalancing • No columnar • No rebalancing Pros and Cons(2023)
  5. Coordinator: • AMD Ryzen™ 9 5950X • 128 GB DDR4

    ECC • 2 x 3.84 TB NVMe SSD Datacenter Edition • 1 GBit/s port • Gentoo • BTRFS
  6. Worker (x40): • AMD Ryzen 7 3700X Octa-Core "Matisse" (Zen2)

    • 64 GB DDR4 ECC RAM • 2 x 1 TB NVMe SSD + Backup SSD • 1 GBit/s port • Gentoo • BTRFS
  7. • huge_pages = 'on' • shared_buffers = '16GB' • effective_cache_size

    = '54GB' • random_page_cost = 1.1 Postgres configs:
  8. BTRFS is a modern copy on write (COW) filesystem for

    Linux aimed at implementing advanced features while also focusing on fault tolerance, repair and easy administration. Btrfs wiki
  9. ~ $ compsize /var/lib/postgresql/14/data/ Processed 1898298 files, 114140807 regular extents

    (118227654 refs), 120 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 18% 1.5T 8.6T 8.4T none 100% 489G 489G 443G zstd 13% 1.1T 8.1T 8.0T
  10. Afterthought: Build with –with-segsize=(SEGSIZE > 1gb) Set the segment size,

    in gigabytes. Large tables are divided into multiple operating-system files, each of size equal to the segment size. docs
  11. { "event_name": "some_event", "user_id": "some_uid", "event_time": 1672531200000000, #epoch microseconds "data":

    { "some_data": "f432917", "int_data": 1580465207693, "nested_data": { ... } } }
  12. SELECT * FROM events."some_event" AS event INNER JOIN profiles.permanent perm

    USING (user_id) INNER JOIN profiles.volatile vol ON ( event."user_id" = vol."user_id" AND vol.updated_time <= event.time AND event.time < vol.up_to_time);
  13. +-------------------+--------+ |Total queries |22990 | +-------------------+--------+ |Total calls |29667156| +-------------------+--------+

    |Average (s) |0.17 | +-------------------+--------+ |95th percentile (s)|33.71 | +-------------------+--------+ |99th percentile (s)|265.01 | +-------------------+--------+
  14. PostgreSQL uses a fixed page size (commonly 8 kB), and

    does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or “the best thing since sliced bread”). The TOAST infrastructure is also used to improve handling of large data values in-memory. docs
  15. --------------------------------------------------------------------------------------------------------------------- -> Parallel Seq Scan on some_event_36234134 some_event (cost=0.00..730710.38 rows=9100

    width=17) (actual time=114.659..29370.110 rows=823924 loops=1) Filter: ((data ->> 'some_key'::text) = 'some_value'::text) Rows Removed by Filter: 11134520 ---------------------------------------------------------------------------------------------------------------------
  16. TOAST: 1. Thin events/big profiles 2. Restrict size on ingestion

    via CHECK (not cool) 3. Increase toast_tuple_target/BLCKSZ (debatable)
  17. GIN stands for Generalized Inverted Index. GIN is designed for

    handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. For example, the items could be documents, and the queries could be searches for documents containing specific words. docs
  18. The default GIN operator class for JSONB supports queries with

    the key-exists operators ?, ?| and ?&, the containment operator @>, and the jsonpath match operators @? and @@. docs
  19. CREATE INDEX IF NOT EXISTS some_event_value_idx ON some_event USING gin

    (data jsonb_path_ops); ANALYZE some_event; EXPLAIN (ANALYZE) SELECT * FROM some_event WHERE data @@ '$.some_key == "some_value"';
  20. --------------------------------------------------------------------------------------------------------------------------------- -> Bitmap Heap Scan on some_event_36234138 some_event (cost=6343.61..522449.43 rows=776660

    width=17) (actual time=435.453..28274.869 rows=822426 loops=1) Recheck Cond: (data @@ '($."some_key" == "some_value")'::jsonpath) Heap Blocks: exact=443181 -> Bitmap Index Scan on some_event_value_idx_36234138 (cost=0.00..6149.45 rows=776660 width=0) (actual time=187.053..187.054 rows=822427 loops=1) Index Cond: (data @@ '($."some_key" == "some_value")'::jsonpath) ---------------------------------------------------------------------------------------------------------------------------------
  21. • Slower than Btree • j̵̨͎̗̼̯̪̺̮͗̈̎̽ͬ̒̏ͮ̈́̋̚͟ͅ _̧̡̙̖̽̾̽̓͐͗ͧ͡ s̵̡̳̹ͬͩ͌̈͠ ơ̡̺̳͎̤̪͚̻̙ͨͭ̈́̉ͦͨ͟ n͎̦̞͔̉͠͡

    p̨̝̱͓̜͍͚̜͉͖͉͊̏ͤ̉̆̀͐̆̃ a̴̵̢̭̦̟͍̺͈̮̳͚̾̒̿̏ͥͣ̒̎̄͌̀ͦ̉ͣ̔̄͌̾̕̚͠͞͠͡ t̛̰͖̲̻͒̂́̃̐̓̉̓̽̄̆͊͘̕͜ ḩ̡̫̭͈̫̝͍̃́͂͆̀̒̚͟͟͞ ≈ r̲̄ͮͦ̋ _̸̸̖̬̫̺͚̩̫̱̙̯ͬͮ͌̎̎̀̈́̔̈̅̀̕ e͕̝̓ ǵ̡̮̠̯͇̻͍̪̣͔͕̬̓̒̈̽̋ͬ̾̂̚͘͘͢͞ ȩ̛̹͈̘̌̀ͯ̽̔ͨͯ͠ x̌ͩͅ GIN index cons
  22. CREATE INDEX IF NOT EXISTS some_event_some_key_idx ON some_event ((data ->>

    'some_key')); ANALYZE some_event; EXPLAIN (ANALYZE) SELECT * FROM some_event WHERE data ->> 'some_key' = 'some_value';
  23. --------------------------------------------------------------------------------- -> Index Scan using some_event_some_key_idx_36234128 on some_event_36234128 some_event (cost=0.43..385981.93

    rows=822579 width=17) (actual time=5.678..8079.450 rows=821785 loops=1) Index Cond: ((data ->> 'some_key'::text) = 'some_value'::text) ---------------------------------------------------------------------------------
  24. SELECT * from citus_add_node('worker-1', 5432); SELECT * from citus_add_node('worker-2', 5432);

    SELECT * from citus_add_node('worker-3', 5432); -- Two screens later... SELECT * from citus_add_node('worker-n', 5432); -- Blazingly fast SELECT * FROM some_event WHERE data ->> 'some_key' = 'some_value';
  25. CREATE TABLE IF NOT EXISTS profiles.permanent ( user_id VARCHAR NOT

    NULL CONSTRAINT pk_permanent PRIMARY KEY, -- ... ); CREATE INDEX IF NOT EXISTS volatile_profiles_user_id_time_index ON profiles.volatile (user_id, update_time, up_to_time);
  26. +----------+--------------+ |table_name|pg_size_pretty| +----------+--------------+ |******** |28 TB | |******** |15 TB

    | |******** |3179 GB | |******** |2812 GB | |******** |1977 GB | |******** |1559 GB | |******** |1553 GB | |******** |1435 GB | |******** |1306 GB | |******** |1304 GB | +----------+--------------+
  27. SELECT user_id, time, perm.data ->> 'some_key' as some_key FROM events.some_event

    AS event INNER JOIN profiles.permanent perm USING (user_id);
  28. ALTER TABLE events.some_event ADD COLUMN some_key TEXT; UPDATE events.some_event event_to_inject

    SET some_key = data_to_inject.some_key FROM (SELECT ... FROM events."some_event" AS event ...) data_to_inject WHERE event_to_inject.user_id = data_to_inject.user_id AND event_to_inject.time = data_to_inject.time;
  29. CREATE OR REPLACE PROCEDURE events.inject_fields() LANGUAGE sql BEGIN ATOMIC; UPDATE

    events.some_event event_to_inject SET some_key = data_to_inject.some_key FROM (SELECT ... FROM events."some_event" AS event ...) data_to_inject WHERE event_to_inject.user_id = data_to_inject.user_id AND event_to_inject.time = data_to_inject.time; END;
  30. SELECT user_id, time, some_key FROM events.some_event WHERE event.time < current_date;

    UNION ALL SELECT user_id, time, data ->> 'some_key' AS some_key FROM events.some_event AS event INNER JOIN profiles.permanent perm USING (user_id) WHERE event.time > current_date
  31. CREATE MATERIALIZED VIEW counts AS SELECT user_id, count(*) FROM events.some_event

    GROUP BY user_id; SELECT cron.schedule('0 0 * * *', $cmd$REFRESH MATERIALIZED VIEW counts; $cmd$);
  32. CREATE MATERIALIZED VIEW counts AS SELECT user_id, count(*) FROM events.some_event

    GROUP BY user_id; SELECT create_distributed_table( 'counts', 'user_id'); SELECT cron.schedule('0 0 * * *', $cmd$REFRESH MATERIALIZED VIEW counts;$cmd$);
  33. CREATE TABLE counts ( user_id TEXT, count BIGINT ); SELECT

    create_distributed_table( 'counts', 'user_id');
  34. CREATE OR REPLACE PROCEDURE counts_refresh() LANGUAGE sql AS $cmd$ TRUNCATE

    counts; INSERT INTO counts SELECT user_id, count(*) FROM events.some_event GROUP BY user_id $cmd$; SELECT cron.schedule('0 0 * * *', $cmd$CALL counts_refresh();$cmd$);
  35. • Strong isolation • Custom configuration • Parallel migrations •

    No X-DB queries • DATABASE-based multitenancy Pros and Cons
  36. SELECT * FROM tenant_one.events.some_event INNER JOIN tenant_two.events.some_event USING (user_id); [0A000]

    ERROR: cross-database references are not implemented: "tenant_two.events.some_event"
  37. CREATE EXTENSION dblink; -- CREATE EXTENSION postgres_fdw; SELECT user_id, avg(data

    ->> 'some_key') AS some_key_avg FROM events.some_event GROUP BY user_id UNION ALL SELECT * FROM dblink('tenant_two', $$ SELECT ... FROM events.some_event ...;$$) AS t(user_id TEXT, some_key_avg NUMERIC);
  38. • Strong isolation • Custom configuration • Parallel migrations •

    No X-DB queries • Poor support DATABASE-based multitenancy Pros and Cons
  39. One DATABASE with requires: • 1 daemon process on coordinator

    and workers • N connections per worker for deadlocks detection •
  40. One DATABASE with requires: • 1 daemon process on coordinator

    and workers • N connections per worker for deadlocks detection • 2 connections per worker for transaction recovery
  41. N = 40;D = 30 D * (N + 2)

    = Little math...
  42. N = 40;D = 30 30 * (40 + 2)

    = Little math...
  43. N = 40;D = 30 30 * (40 + 2)

    = 1260 connections Little math...
  44. N = 40;D = 30 30 * (40 + 2)

    = 1260 connections ~60 connections every citus.recover_2pc_interval = '1min' Little math...
  45. Little math... N = 40;D = 30 30 * (40

    + 2) = 1260 connections ~60 connections every citus.recover_2pc_interval = '1min' 1200 idle connections due to citus.distributed_deadlock_detection_factor = '2ms'
  46. Workaround: ALTER SYSTEM SET citus.recover_2pc_interval TO -1; ALTER SYSTEM SET

    citus.distributed_deadlock_detection_factor TO -1; SELECT cron.schedule( schedule := '*/5 * * * *', command := $cron$CALL execute_on_databases_with_citus( $cmd$SELECT check_distributed_deadlocks();$cmd$); $cron$);
  47. CREATE MATERIALIZED VIEW one AS SELECT ...; SELECT cron.schedule('0 0

    * * *', $$REFRESH MATERIALIZED VIEW one;$$); CREATE MATERIALIZED VIEW two AS SELECT ...; SELECT cron.schedule('0 1 * * *', $$REFRESH MATERIALIZED VIEW two;$$); CREATE MATERIALIZED VIEW three AS SELECT * FROM one join two ...; SELECT cron.schedule('0 2 * * *', $$REFRESH MATERIALIZED VIEW three;$$); CREATE MATERIALIZED VIEW four AS SELECT * FROM one join three ...; SELECT cron.schedule('0 3 * * *', $$REFRESH MATERIALIZED VIEW four;$$);
  48. https://gitlab.com/ongresinc/extensions/noset CREATE EXTENSION noset; ALTER ROLE analytics_user SET noset.enabled =

    true; ALTER ROLE analytics_user SET noset.paramaters = 'STATEMENT_TIMEOUT';
  49. CREATE OR REPLACE PROCEDURE admin.kill_long_analytical_queries() LANGUAGE plpgsql AS $proc$ DECLARE

    problematic_query RECORD; BEGIN FOR problematic_query IN (SELECT pid, query FROM (SELECT now() - query_start AS run_time, query, pid FROM pg_stat_activity WHERE usename IN (...) sub WHERE run_time >= INTERVAL '20 MINUTES') LOOP RAISE WARNING 'Killing pid % that was running %', problematic_query.pid, problematic_query.query; PERFORM pg_terminate_backend(problematic_query.pid); END LOOP; END; $proc$; SELECT cron.schedule('*/5 * * * *', $$CALL admin.kill_long_analytical_queries()$$);
  50. PIDS=$( ps -o pid --sort=-size --no-headers $( psql postgres -qt

    -c "SELECT pid FROM pg_stat_activity WHERE usename!='postgres'" ) | head -20 ) for PID in ${PIDS} do QUERY_ID=$( psql postgres -qt -c "SELECT COALESCE(query_id, 0) FROM pg_stat_activity WHERE pid=${PID}" | sed 's/ //g' ) MEM=$( ps -o size --no-headers ${PID} ) if [ -n "${MEM}" ] then echo '{"key":"processes_postgres_memory_usage_kilobytes", "pid":"'$ {PID}'", "query_id":"'${QUERY_ID}'", "value":'${MEM}'},' fi done
  51. CREATE EXTENSION pg_stats_statements; SELECT queryid, datname::TEXT, rolname::text, now(), calls, mean_exec_time,

    total_exec_time FROM pg_stat_statements t1 JOIN pg_database t2 ON (t1.dbid = t2.oid) JOIN pg_roles t3 ON (t1.userid = t3.oid) WHERE t3.rolname IN (...) AND total_exec_time > 0;
  52. ALTER SYSTEM SET log_destination = 'stderr,csvlog'; ALTER SYSTEM SET log_min_duration_statement

    = 60000; COPY long_query_examples FROM '/var/lib/postgresql/14/data/log/postgresql-<day_of_week>.csv' WITH CSV WHERE command_tag IN ('SELECT', 'REFRESH MATERIALIZED VIEW', 'CALL') AND message LIKE '%%duration:%%';
  53. WITH sc AS (SELECT * FROM events."some_event" WHERE time >=

    '2020-01-01' AND time < '2020-02-01') SELECT '*******' AS user_id_masked, count(er.user_id) AS "amount" FROM sc INNER JOIN events."another_event" er ON sc.data ->> 'country' = er.data ->> 'country' GROUP BY er.user_id;
  54. ------------------------------------------------------------------------ Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=72) -> Distributed

    Subplan 3_1 -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=72) Task Count: 80 Tasks Shown: One of 80 -> Task Node: host=10.97.97.10 port=5432 dbname=some_db -> Seq Scan on "some_event_2020_JAN_36474381" "some_event" (cost=0.00..59133.21 rows=2339605 width=47) ------------------------------------------------------------------------
  55. -- Incorrect SELECT * FROM events.some_event WHERE date_trunc('DAY', time) >=

    '2023-01-01'; -- Correct EXPLAIN SELECT * FROM events.some_event WHERE time >= '2023-01-01';
  56. CREATE TABLE measurement ( logdate DATE NOT NULL, peaktime DATE

    NOT NULL ) PARTITION BY RANGE (logdate); SELECT create_distributed_table('measurement', 'city_id'); CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES FROM ('2006-02-01') TO ('2006-03-01') PARTITION BY RANGE (peaktemp); [0A000] ERROR: distributing multi-level partitioned tables is not supported
  57. Q&A