Citus Con 2023, Citus and JSON at Vizor Games

Citus and JSON for real-time analytics at Vizor Games

• Ivan Vyazmitinov, Vizor games • Technical Lead, Internal Tools,
5+ • Java developer, accidental DBA

Achievements: • 2014. Klondike was named one of the 10
best new games on Facebook in 2014. • 2015. Knights and Brides won Facebook’s best web game award • 2020. VIZOR ranked 12th in TOP 30 EMEA Headquartered Overall publishers by Revenue (iOS App Store & Google Play). According to data.ai. • 2021. VIZOR ranked 12th in TOP 30 Overall EMEA Headquartered publishers WorldWide by Consumer spend (iOS App Store and Google Play). According to data.ai.

Phew...

Phew... Let’s get started!

Division of labor

Philosophy

Ingest everything.

Ingest everything. At max rate.

Ingest everything. At max rate. At any cost.

• Schemaless (semistructured) • Goals:

• Schemaless (semistructured) • Realtime • Goals:

• Schemaless (semistructured) • Realtime • Scalable • Goals:

• Schemaless (semistructured) • Realtime • Scalable • Efficient •
Goals:

• Schemaless (semistructured) • Realtime • Scalable • Efficient •
SQL-compliant Goals:

Pros and Cons

Pros and Cons(2017)

• Open-source • Pros and Cons(2017)

• Open-source • JSONB • Pros and Cons(2017)

• Open-source • JSONB • SQL • Pros and Cons(2017)

• Open-source • JSONB • SQL • OLAP Pros and
Cons(2017)

• Open-source • JSONB • SQL • OLAP • No
columnar • Pros and Cons(2017)

• Open-source • JSONB • SQL • OLAP • No
columnar • No rebalancing Pros and Cons(2017)

Pros and Cons(2023)

• Open Source • JSONB • SQL • OLAP •
Columnar • Rebalancing • No columnar • No rebalancing Pros and Cons(2023)

Coordinator: • AMD Ryzen™ 9 5950X • 128 GB DDR4
ECC • 2 x 3.84 TB NVMe SSD Datacenter Edition • 1 GBit/s port • Gentoo • BTRFS

Worker (x40): • AMD Ryzen 7 3700X Octa-Core "Matisse" (Zen2)
• 64 GB DDR4 ECC RAM • 2 x 1 TB NVMe SSD + Backup SSD • 1 GBit/s port • Gentoo • BTRFS

• transparent_hugepage=never hugepagesz=1G default_hugepagesz=1G • vm.nr_hugepages=20 • no swap Linux
configs:

• huge_pages = 'on' • shared_buffers = '16GB' • effective_cache_size
= '54GB' • random_page_cost = 1.1 Postgres configs:

BTRFS is a modern copy on write (COW) filesystem for
Linux aimed at implementing advanced features while also focusing on fault tolerance, repair and easy administration. Btrfs wiki

• File system snapshots •

• File system snapshots • Transparent compression

~ $ compsize /var/lib/postgresql/14/data/ Processed 1898298 files, 114140807 regular extents
(118227654 refs), 120 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 18% 1.5T 8.6T 8.4T none 100% 489G 489G 443G zstd 13% 1.1T 8.1T 8.0T

Afterthought: Build with –with-segsize=(SEGSIZE > 1gb) Set the segment size,
in gigabytes. Large tables are divided into multiple operating-system files, each of size equal to the segment size. docs

Solution

{ "event_name": "some_event", "user_id": "some_uid", "event_time": 1672531200000000, #epoch microseconds "data":
{ "some_data": "f432917", "int_data": 1580465207693, "nested_data": { ... } } }

+-------+-----------+----------+----------+ |user_id|update_time|up_to_time|data | +-------+-----------+----------+----------+ |U1 |2023-04-11 |2023-04-12|{"lvl": 1}| |U1 |2023-04-12
|2023-04-13|{"lvl": 2}| |U1 |2023-04-13 |infinity |{"lvl": 3}| +-------+-----------+----------+----------+

SELECT * FROM events."some_event" AS event INNER JOIN profiles.permanent perm
USING (user_id) INNER JOIN profiles.volatile vol ON ( event."user_id" = vol."user_id" AND vol.updated_time <= event.time AND event.time < vol.up_to_time);

SELECT create_distributed_table( 'events.some_event', 'user_id'); SELECT create_distributed_table( 'profiles.permanent', 'user_id'); SELECT create_distributed_table(
'profiles.volatile', 'user_id');

Third-party data sources • Regular wide (50+ columns) tables •
Local or another distribution

Results

+-------------------+--------+ |Total queries |22990 | +-------------------+--------+ |Total calls |29667156| +-------------------+--------+
|Average (s) |0.17 | +-------------------+--------+ |95th percentile (s)|33.71 | +-------------------+--------+ |99th percentile (s)|265.01 | +-------------------+--------+

Caveats

Caveats: 1. TOASTed 2.

PostgreSQL uses a fixed page size (commonly 8 kB), and
does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or “the best thing since sliced bread”). The TOAST infrastructure is also used to improve handling of large data values in-memory. docs

Caveats: 1. TOASTed 2. No statistics blog post

EXPLAIN (ANALYZE) SELECT * FROM some_event WHERE data ->> 'some_key'
= 'some_value';

--------------------------------------------------------------------------------------------------------------------- -> Parallel Seq Scan on some_event_36234134 some_event (cost=0.00..730710.38 rows=9100
width=17) (actual time=114.659..29370.110 rows=823924 loops=1) Filter: ((data ->> 'some_key'::text) = 'some_value'::text) Rows Removed by Filter: 11134520 ---------------------------------------------------------------------------------------------------------------------

Workarounds

TOAST: 1. Thin events/big profiles 2.

TOAST: 1. Thin events/big profiles 2. Restrict size on ingestion
via CHECK (not cool) 3.

TOAST: 1. Thin events/big profiles 2. Restrict size on ingestion
via CHECK (not cool) 3. Increase toast_tuple_target/BLCKSZ (debatable)

Statistics

Statistics: 1. Acceptance 2.

Statistics: 1. Acceptance 2. GIN Indexes 3.

GIN stands for Generalized Inverted Index. GIN is designed for
handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. For example, the items could be documents, and the queries could be searches for documents containing specific words. docs

The default GIN operator class for JSONB supports queries with
the key-exists operators ?, ?| and ?&, the containment operator @>, and the jsonpath match operators @? and @@. docs

CREATE INDEX IF NOT EXISTS some_event_value_idx ON some_event USING gin
(data jsonb_path_ops); ANALYZE some_event; EXPLAIN (ANALYZE) SELECT * FROM some_event WHERE data @@ '$.some_key == "some_value"';

--------------------------------------------------------------------------------------------------------------------------------- -> Bitmap Heap Scan on some_event_36234138 some_event (cost=6343.61..522449.43 rows=776660
width=17) (actual time=435.453..28274.869 rows=822426 loops=1) Recheck Cond: (data @@ '($."some_key" == "some_value")'::jsonpath) Heap Blocks: exact=443181 -> Bitmap Index Scan on some_event_value_idx_36234138 (cost=0.00..6149.45 rows=776660 width=0) (actual time=187.053..187.054 rows=822427 loops=1) Index Cond: (data @@ '($."some_key" == "some_value")'::jsonpath) ---------------------------------------------------------------------------------------------------------------------------------

• Slower than Btree • GIN index cons

• Slower than Btree • jsonpath GIN index cons

• Slower than Btree • jsonpath ≈ regex GIN index
cons

• Slower than Btree • j̵̨͎̗̼̯̪̺̮͗̈̎̽ͬ̒̏ͮ̈́̋̚͟ͅ _̧̡̙̖̽̾̽̓͐͗ͧ͡ s̵̡̳̹ͬͩ͌̈͠ ơ̡̺̳͎̤̪͚̻̙ͨͭ̈́̉ͦͨ͟ n͎̦̞͔̉͠͡
p̨̝̱͓̜͍͚̜͉͖͉͊̏ͤ̉̆̀͐̆̃ a̴̵̢̭̦̟͍̺͈̮̳͚̾̒̿̏ͥͣ̒̎̄͌̀ͦ̉ͣ̔̄͌̾̕̚͠͞͠͡ t̛̰͖̲̻͒̂́̃̐̓̉̓̽̄̆͊͘̕͜ ḩ̡̫̭͈̫̝͍̃́͂͆̀̒̚͟͟͞ ≈ r̲̄ͮͦ̋ _̸̸̖̬̫̺͚̩̫̱̙̯ͬͮ͌̎̎̀̈́̔̈̅̀̕ e͕̝̓ ǵ̡̮̠̯͇̻͍̪̣͔͕̬̓̒̈̽̋ͬ̾̂̚͘͘͢͞ ȩ̛̹͈̘̌̀ͯ̽̔ͨͯ͠ x̌ͩͅ GIN index cons

Statistics: 1. Acceptance 2. GIN Indexes 3. Btree Indexes 4.

CREATE INDEX IF NOT EXISTS some_event_some_key_idx ON some_event ((data ->>
'some_key')); ANALYZE some_event; EXPLAIN (ANALYZE) SELECT * FROM some_event WHERE data ->> 'some_key' = 'some_value';

--------------------------------------------------------------------------------- -> Index Scan using some_event_some_key_idx_36234128 on some_event_36234128 some_event (cost=0.43..385981.93
rows=822579 width=17) (actual time=5.678..8079.450 rows=821785 loops=1) Index Cond: ((data ->> 'some_key'::text) = 'some_value'::text) ---------------------------------------------------------------------------------

Statistics: 1. Acceptance 2. GIN Indexes 3. Btree Indexes 4.
Ultimate answer

Just throw at it!

SELECT * from citus_add_node('worker-1', 5432); SELECT * from citus_add_node('worker-2', 5432);
SELECT * from citus_add_node('worker-3', 5432); -- Two screens later... SELECT * from citus_add_node('worker-n', 5432); -- Blazingly fast SELECT * FROM some_event WHERE data ->> 'some_key' = 'some_value';

More optimizations: 1. Default indexes 2.

CREATE TABLE IF NOT EXISTS profiles.permanent ( user_id VARCHAR NOT
NULL CONSTRAINT pk_permanent PRIMARY KEY, -- ... ); CREATE INDEX IF NOT EXISTS volatile_profiles_user_id_time_index ON profiles.volatile (user_id, update_time, up_to_time);

More optimizations: 1. Default indexes 2. Denormalization

+----------+--------------+ |table_name|pg_size_pretty| +----------+--------------+ |******** |28 TB | |******** |15 TB
| |******** |3179 GB | |******** |2812 GB | |******** |1977 GB | |******** |1559 GB | |******** |1553 GB | |******** |1435 GB | |******** |1306 GB | |******** |1304 GB | +----------+--------------+

SELECT user_id, time, perm.data ->> 'some_key' as some_key FROM events.some_event
AS event INNER JOIN profiles.permanent perm USING (user_id);

ALTER TABLE events.some_event ADD COLUMN some_key TEXT; UPDATE events.some_event event_to_inject
SET some_key = data_to_inject.some_key FROM (SELECT ... FROM events."some_event" AS event ...) data_to_inject WHERE event_to_inject.user_id = data_to_inject.user_id AND event_to_inject.time = data_to_inject.time;

CREATE OR REPLACE PROCEDURE events.inject_fields() LANGUAGE sql BEGIN ATOMIC; UPDATE
events.some_event event_to_inject SET some_key = data_to_inject.some_key FROM (SELECT ... FROM events."some_event" AS event ...) data_to_inject WHERE event_to_inject.user_id = data_to_inject.user_id AND event_to_inject.time = data_to_inject.time; END;

SELECT cron.schedule('0 0 * * *', $$CALL events.inject_fields();$$);

SELECT user_id, time, some_key FROM events.some_event;

SELECT user_id, time, some_key FROM events.some_event WHERE event.time < current_date;
UNION ALL SELECT user_id, time, data ->> 'some_key' AS some_key FROM events.some_event AS event INNER JOIN profiles.permanent perm USING (user_id) WHERE event.time > current_date

Pre-aggregation

SELECT user_id , count(*) FROM events.some_event GROUP BY user_id;

CREATE MATERIALIZED VIEW counts AS SELECT user_id, count(*) FROM events.some_event
GROUP BY user_id; SELECT cron.schedule('0 0 * * *', $cmd$REFRESH MATERIALIZED VIEW counts; $cmd$);

CREATE MATERIALIZED VIEW counts AS SELECT user_id, count(*) FROM events.some_event
GROUP BY user_id; SELECT create_distributed_table( 'counts', 'user_id'); SELECT cron.schedule('0 0 * * *', $cmd$REFRESH MATERIALIZED VIEW counts;$cmd$);

Nope. SELECT create_distributed_table( 'counts', 'user_id'); [42809] ERROR: counts is not
a regular, foreign or partitioned table

https://github.com/citusdata/citus/issues/40

CREATE TABLE counts ( user_id TEXT, count BIGINT ); SELECT
create_distributed_table( 'counts', 'user_id');

CREATE OR REPLACE PROCEDURE counts_refresh() LANGUAGE sql AS $cmd$ TRUNCATE
counts; INSERT INTO counts SELECT user_id, count(*) FROM events.some_event GROUP BY user_id $cmd$; SELECT cron.schedule('0 0 * * *', $cmd$CALL counts_refresh();$cmd$);

Maintenance

GitOps

GitOps: 1. Repository 2.

GitOps: 1. Repository 2. Automated delivery 3.

GitOps: 1. Repository 2. Automated delivery 3. Changes only via
commits

Multitenancy

Preferred ways

Preferred ways: • Column-based multitenancy •

Preferred ways: • Column-based multitenancy • Schema-based multitenancy

Our way

Our way: • DATABASE-based multitenancy

• Strong isolation • DATABASE-based multitenancy Pros and Cons

• Strong isolation • Custom configuration • DATABASE-based multitenancy Pros
and Cons

DATABASE-based multitenancy Pros and Cons • Strong isolation • Custom
configuration • Parallel migrations

• Strong isolation • Custom configuration • Parallel migrations •
No X-DB queries • DATABASE-based multitenancy Pros and Cons

SELECT * FROM tenant_one.events.some_event INNER JOIN tenant_two.events.some_event USING (user_id); [0A000]
ERROR: cross-database references are not implemented: "tenant_two.events.some_event"

CREATE EXTENSION dblink; -- CREATE EXTENSION postgres_fdw; SELECT user_id, avg(data
->> 'some_key') AS some_key_avg FROM events.some_event GROUP BY user_id UNION ALL SELECT * FROM dblink('tenant_two', $$ SELECT ... FROM events.some_event ...;$$) AS t(user_id TEXT, some_key_avg NUMERIC);

• Strong isolation • Custom configuration • Parallel migrations •
No X-DB queries • Poor support DATABASE-based multitenancy Pros and Cons

One DATABASE with requires: • 1 daemon process on coordinator
and workers •

and workers • N connections per worker for deadlocks detection •

and workers • N connections per worker for deadlocks detection • 2 connections per worker for transaction recovery

N = 40;D = 30 Little math...

N = 40;D = 30 D * (N + 2)
= Little math...

N = 40;D = 30 30 * (40 + 2)
= Little math...

N = 40;D = 30 30 * (40 + 2)
= 1260 connections Little math...

N = 40;D = 30 30 * (40 + 2)
= 1260 connections ~60 connections every citus.recover_2pc_interval = '1min' Little math...

Little math... N = 40;D = 30 30 * (40
+ 2) = 1260 connections ~60 connections every citus.recover_2pc_interval = '1min' 1200 idle connections due to citus.distributed_deadlock_detection_factor = '2ms'

Workaround: ALTER SYSTEM SET citus.recover_2pc_interval TO -1; ALTER SYSTEM SET
citus.distributed_deadlock_detection_factor TO -1; SELECT cron.schedule( schedule := '*/5 * * * *', command := $cron$CALL execute_on_databases_with_citus( $cmd$SELECT check_distributed_deadlocks();$cmd$); $cron$);

Multi-level views iding_pain_meme.jp

CREATE MATERIALIZED VIEW one AS SELECT ...; SELECT cron.schedule('0 0
* * *', $$REFRESH MATERIALIZED VIEW one;$$); CREATE MATERIALIZED VIEW two AS SELECT ...; SELECT cron.schedule('0 1 * * *', $$REFRESH MATERIALIZED VIEW two;$$); CREATE MATERIALIZED VIEW three AS SELECT * FROM one join two ...; SELECT cron.schedule('0 2 * * *', $$REFRESH MATERIALIZED VIEW three;$$); CREATE MATERIALIZED VIEW four AS SELECT * FROM one join three ...; SELECT cron.schedule('0 3 * * *', $$REFRESH MATERIALIZED VIEW four;$$);

What’s wrong

What’s wrong: • Pick cron manually •

What’s wrong: • Pick cron manually • Hope there are
no delays

What we did

What we got

What we got: • No cron (almost) •

• No cron (almost) • Delay-resistant • What we got:

• No cron (almost) • Delay-resistant • Data for dashboard
What we got:

Rebalancing

https://www.citusdata.com/blog/2021/03/13/scaling-out-postgres-with-citus-open-source-shard-rebalancer/

Minor issues

Minor issues: • Issues •

Minor issues: • Issues • Underutilization

SELECT citus_rebalance_start();

2 < 4; 2 = underutilization

-- SELECT citus_rebalance_start(); SELECT alter_distributed_table( table_name := 'events.some_event', shard_count :=
16);

Drawbacks

Drawbacks: • No *start() function •

Drawbacks: • No *start() function • Table is recreated •

Drawbacks: • No *start() function • Table is recreated •
Table is recreated with issues

https://dzone.com/users/4759096/ivanvyazmitinov.html Skip ad ->>

User experience

Expectations

Expectations: • Frequent (100/hour) •

Expectations: • Frequent (100/hour) • Small (~100KB)

Reality

Reality: • Infrequent (1/day) • Large (~1.5gb)

Reasons

Reasons: • Tableau extracts (hyper files) •

Reasons: • Tableau extracts (hyper files) • Pandas DataFrame •

Reasons: • Tableau extracts (hyper files) • Pandas DataFrame •
Common practice

Result

Result: • Slow-query tolerance •

Result: • Slow-query tolerance • Inefficient SQL

Measures

Measures: • Education •

Measures: • Education • Restrictions •

ALTER SYSTEM SET STATEMENT_TIMEOUT = '40min'; ALTER ROLE analytics_user SET
STATEMENT_TIMEOUT = '20min';

https://gitlab.com/ongresinc/extensions/noset CREATE EXTENSION noset; ALTER ROLE analytics_user SET noset.enabled =
true; ALTER ROLE analytics_user SET noset.paramaters = 'STATEMENT_TIMEOUT';

CREATE OR REPLACE PROCEDURE admin.kill_long_analytical_queries() LANGUAGE plpgsql AS $proc$ DECLARE
problematic_query RECORD; BEGIN FOR problematic_query IN (SELECT pid, query FROM (SELECT now() - query_start AS run_time, query, pid FROM pg_stat_activity WHERE usename IN (...) sub WHERE run_time >= INTERVAL '20 MINUTES') LOOP RAISE WARNING 'Killing pid % that was running %', problematic_query.pid, problematic_query.query; PERFORM pg_terminate_backend(problematic_query.pid); END LOOP; END; $proc$; SELECT cron.schedule('*/5 * * * *', $$CALL admin.kill_long_analytical_queries()$$);

Measures: • Education • Restrictions • Monitoring

Monitoring

What to monitor: • Memory consumption •

PIDS=$( ps -o pid --sort=-size --no-headers $( psql postgres -qt
-c "SELECT pid FROM pg_stat_activity WHERE usename!='postgres'" ) | head -20 ) for PID in ${PIDS} do QUERY_ID=$( psql postgres -qt -c "SELECT COALESCE(query_id, 0) FROM pg_stat_activity WHERE pid=${PID}" | sed 's/ //g' ) MEM=$( ps -o size --no-headers ${PID} ) if [ -n "${MEM}" ] then echo '{"key":"processes_postgres_memory_usage_kilobytes", "pid":"'$ {PID}'", "query_id":"'${QUERY_ID}'", "value":'${MEM}'},' fi done

https://www.citusdata.com/blog/2022/07/21/citus-stat-activity-views-for-distributed-postgres/#gpid

What to monitor: • Memory consumption • Query execution times

CREATE EXTENSION pg_stats_statements; SELECT queryid, datname::TEXT, rolname::text, now(), calls, mean_exec_time,
total_exec_time FROM pg_stat_statements t1 JOIN pg_database t2 ON (t1.dbid = t2.oid) JOIN pg_roles t3 ON (t1.userid = t3.oid) WHERE t3.rolname IN (...) AND total_exec_time > 0;

ALTER SYSTEM SET log_destination = 'stderr,csvlog'; ALTER SYSTEM SET log_min_duration_statement
= 60000; COPY long_query_examples FROM '/var/lib/postgresql/14/data/log/postgresql-<day_of_week>.csv' WITH CSV WHERE command_tag IN ('SELECT', 'REFRESH MATERIALIZED VIEW', 'CALL') AND message LIKE '%%duration:%%';

Optimizations

Common optimizations: • Look out for Distributed Subplan •

WITH sc AS (SELECT * FROM events."some_event" WHERE time >=
'2020-01-01' AND time < '2020-02-01') SELECT '*******' AS user_id_masked, count(er.user_id) AS "amount" FROM sc INNER JOIN events."another_event" er ON sc.data ->> 'country' = er.data ->> 'country' GROUP BY er.user_id;

------------------------------------------------------------------------ Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=72) -> Distributed
Subplan 3_1 -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=72) Task Count: 80 Tasks Shown: One of 80 -> Task Node: host=10.97.97.10 port=5432 dbname=some_db -> Seq Scan on "some_event_2020_JAN_36474381" "some_event" (cost=0.00..59133.21 rows=2339605 width=47) ------------------------------------------------------------------------

Common optimizations: • Look out for Distributed Subplan • Partitions
•

-- Incorrect SELECT * FROM events.some_event WHERE date_trunc('DAY', time) >=
'2023-01-01'; -- Correct EXPLAIN SELECT * FROM events.some_event WHERE time >= '2023-01-01';

Common optimizations: • Look out for Distributed Subplan • Partitions
• Indexes

Phew… x2

Conclusion

Top wanted features

Top wanted features: • JSONB statistics •

Top wanted features: • JSONB statistics • Multi-level schemas •

Sub partitions •

CREATE TABLE measurement ( logdate DATE NOT NULL, peaktime DATE
NOT NULL ) PARTITION BY RANGE (logdate); SELECT create_distributed_table('measurement', 'city_id'); CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES FROM ('2006-02-01') TO ('2006-03-01') PARTITION BY RANGE (peaktemp); [0A000] ERROR: distributing multi-level partitioned tables is not supported

Sub partitions • Distributed mat views

Final Thoughts

Citus is a great extension for Postgres

https://speakerdeck.com/ivyazmitinov/citus-con-2023

Citus Con 2023, Citus and JSON at Vizor Games

Citus Con 2023, Citus and JSON at Vizor Games

Other Decks in Programming

Featured

Transcript