Citus Con 2023, Citus and JSON at Vizor Games

Slide 1

Slide 1 text

Citus and JSON for real-time analytics at Vizor Games

Slide 2

Slide 2 text

● Ivan Vyazmitinov, Vizor games ● Technical Lead, Internal Tools, 5+ ● Java developer, accidental DBA

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Achievements: ● 2014. Klondike was named one of the 10 best new games on Facebook in 2014. ● 2015. Knights and Brides won Facebook’s best web game award ● 2020. VIZOR ranked 12th in TOP 30 EMEA Headquartered Overall publishers by Revenue (iOS App Store & Google Play). According to data.ai. ● 2021. VIZOR ranked 12th in TOP 30 Overall EMEA Headquartered publishers WorldWide by Consumer spend (iOS App Store and Google Play). According to data.ai.

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Phew...

Slide 7

Slide 7 text

Phew... Let’s get started!

Slide 8

Slide 8 text

Division of labor

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

Philosophy

Slide 13

Slide 13 text

Ingest everything.

Slide 14

Slide 14 text

Ingest everything. At max rate.

Slide 15

Slide 15 text

Ingest everything. At max rate. At any cost.

Slide 16

Slide 16 text

Goals

Slide 17

Slide 17 text

● Schemaless (semistructured) ● Goals:

Slide 18

Slide 18 text

● Schemaless (semistructured) ● Realtime ● Goals:

Slide 19

Slide 19 text

● Schemaless (semistructured) ● Realtime ● Scalable ● Goals:

Slide 20

Slide 20 text

● Schemaless (semistructured) ● Realtime ● Scalable ● Efficient ● Goals:

Slide 21

Slide 21 text

● Schemaless (semistructured) ● Realtime ● Scalable ● Efficient ● SQL-compliant Goals:

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Pros and Cons

Slide 29

Slide 29 text

Pros and Cons(2017)

Slide 30

Slide 30 text

● Open-source ● Pros and Cons(2017)

Slide 31

Slide 31 text

● Open-source ● JSONB ● Pros and Cons(2017)

Slide 32

Slide 32 text

● Open-source ● JSONB ● SQL ● Pros and Cons(2017)

Slide 33

Slide 33 text

● Open-source ● JSONB ● SQL ● OLAP Pros and Cons(2017)

Slide 34

Slide 34 text

● Open-source ● JSONB ● SQL ● OLAP ● No columnar ● Pros and Cons(2017)

Slide 35

Slide 35 text

● Open-source ● JSONB ● SQL ● OLAP ● No columnar ● No rebalancing Pros and Cons(2017)

Slide 36

Slide 36 text

Pros and Cons(2023)

Slide 37

Slide 37 text

● Open Source ● JSONB ● SQL ● OLAP ● Columnar ● Rebalancing ● No columnar ● No rebalancing Pros and Cons(2023)

Slide 38

Slide 38 text

Setup

Slide 39

Slide 39 text

Coordinator: ● AMD Ryzen™ 9 5950X ● 128 GB DDR4 ECC ● 2 x 3.84 TB NVMe SSD Datacenter Edition ● 1 GBit/s port ● Gentoo ● BTRFS

Slide 40

Slide 40 text

Worker (x40): ● AMD Ryzen 7 3700X Octa-Core "Matisse" (Zen2) ● 64 GB DDR4 ECC RAM ● 2 x 1 TB NVMe SSD + Backup SSD ● 1 GBit/s port ● Gentoo ● BTRFS

Slide 41

Slide 41 text

● transparent_hugepage=never hugepagesz=1G default_hugepagesz=1G ● vm.nr_hugepages=20 ● no swap Linux configs:

Slide 42

Slide 42 text

● huge_pages = 'on' ● shared_buffers = '16GB' ● effective_cache_size = '54GB' ● random_page_cost = 1.1 Postgres configs:

Slide 43

Slide 43 text

BTRFS is a modern copy on write (COW) filesystem for Linux aimed at implementing advanced features while also focusing on fault tolerance, repair and easy administration. Btrfs wiki

Slide 44

Slide 44 text

● File system snapshots ●

Slide 45

Slide 45 text

● File system snapshots ● Transparent compression

Slide 46

Slide 46 text

~ $ compsize /var/lib/postgresql/14/data/ Processed 1898298 files, 114140807 regular extents (118227654 refs), 120 inline. Type Perc Disk Usage Uncompressed Referenced TOTAL 18% 1.5T 8.6T 8.4T none 100% 489G 489G 443G zstd 13% 1.1T 8.1T 8.0T

Slide 47

Slide 47 text

Afterthought: Build with –with-segsize=(SEGSIZE > 1gb) Set the segment size, in gigabytes. Large tables are divided into multiple operating-system files, each of size equal to the segment size. docs

Slide 48

Slide 48 text

Solution

Slide 49

Slide 49 text

{ "event_name": "some_event", "user_id": "some_uid", "event_time": 1672531200000000, #epoch microseconds "data": { "some_data": "f432917", "int_data": 1580465207693, "nested_data": { ... } } }

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

No content

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

+-------+-----------+----------+----------+ |user_id|update_time|up_to_time|data | +-------+-----------+----------+----------+ |U1 |2023-04-11 |2023-04-12|{"lvl": 1}| |U1 |2023-04-12 |2023-04-13|{"lvl": 2}| |U1 |2023-04-13 |infinity |{"lvl": 3}| +-------+-----------+----------+----------+

Slide 54

Slide 54 text

SELECT * FROM events."some_event" AS event INNER JOIN profiles.permanent perm USING (user_id) INNER JOIN profiles.volatile vol ON ( event."user_id" = vol."user_id" AND vol.updated_time <= event.time AND event.time < vol.up_to_time);

Slide 55

Slide 55 text

SELECT create_distributed_table( 'events.some_event', 'user_id'); SELECT create_distributed_table( 'profiles.permanent', 'user_id'); SELECT create_distributed_table( 'profiles.volatile', 'user_id');

Slide 56

Slide 56 text

Third-party data sources ● Regular wide (50+ columns) tables ● Local or another distribution

Slide 57

Slide 57 text

Results

Slide 58

Slide 58 text

+-------------------+--------+ |Total queries |22990 | +-------------------+--------+ |Total calls |29667156| +-------------------+--------+ |Average (s) |0.17 | +-------------------+--------+ |95th percentile (s)|33.71 | +-------------------+--------+ |99th percentile (s)|265.01 | +-------------------+--------+

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

JSONB

Slide 61

Slide 61 text

Caveats

Slide 62

Slide 62 text

Caveats: 1. TOASTed 2.

Slide 63

Slide 63 text

PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or “the best thing since sliced bread”). The TOAST infrastructure is also used to improve handling of large data values in-memory. docs

Slide 64

Slide 64 text

Caveats: 1. TOASTed 2. No statistics blog post

Slide 65

Slide 65 text

EXPLAIN (ANALYZE) SELECT * FROM some_event WHERE data ->> 'some_key' = 'some_value';

Slide 66

Slide 66 text

--------------------------------------------------------------------------------------------------------------------- -> Parallel Seq Scan on some_event_36234134 some_event (cost=0.00..730710.38 rows=9100 width=17) (actual time=114.659..29370.110 rows=823924 loops=1) Filter: ((data ->> 'some_key'::text) = 'some_value'::text) Rows Removed by Filter: 11134520 ---------------------------------------------------------------------------------------------------------------------

Slide 67

Slide 67 text

Workarounds

Slide 68

Slide 68 text

TOAST

Slide 69

Slide 69 text

TOAST: 1. Thin events/big profiles 2.

Slide 70

Slide 70 text

TOAST: 1. Thin events/big profiles 2. Restrict size on ingestion via CHECK (not cool) 3.

Slide 71

Slide 71 text

TOAST: 1. Thin events/big profiles 2. Restrict size on ingestion via CHECK (not cool) 3. Increase toast_tuple_target/BLCKSZ (debatable)

Slide 72

Slide 72 text

Statistics

Slide 73

Slide 73 text

Statistics: 1. Acceptance 2.

Slide 74

Slide 74 text

Statistics: 1. Acceptance 2. GIN Indexes 3.

Slide 75

Slide 75 text

GIN stands for Generalized Inverted Index. GIN is designed for handling cases where the items to be indexed are composite values, and the queries to be handled by the index need to search for element values that appear within the composite items. For example, the items could be documents, and the queries could be searches for documents containing specific words. docs

Slide 76

Slide 76 text

The default GIN operator class for JSONB supports queries with the key-exists operators ?, ?| and ?&, the containment operator @>, and the jsonpath match operators @? and @@. docs

Slide 77

Slide 77 text

CREATE INDEX IF NOT EXISTS some_event_value_idx ON some_event USING gin (data jsonb_path_ops); ANALYZE some_event; EXPLAIN (ANALYZE) SELECT * FROM some_event WHERE data @@ '$.some_key == "some_value"';

Slide 78

Slide 78 text

--------------------------------------------------------------------------------------------------------------------------------- -> Bitmap Heap Scan on some_event_36234138 some_event (cost=6343.61..522449.43 rows=776660 width=17) (actual time=435.453..28274.869 rows=822426 loops=1) Recheck Cond: (data @@ '($."some_key" == "some_value")'::jsonpath) Heap Blocks: exact=443181 -> Bitmap Index Scan on some_event_value_idx_36234138 (cost=0.00..6149.45 rows=776660 width=0) (actual time=187.053..187.054 rows=822427 loops=1) Index Cond: (data @@ '($."some_key" == "some_value")'::jsonpath) ---------------------------------------------------------------------------------------------------------------------------------

Slide 79

Slide 79 text

● Slower than Btree ● GIN index cons

Slide 80

Slide 80 text

● Slower than Btree ● jsonpath GIN index cons

Slide 81

Slide 81 text

● Slower than Btree ● jsonpath ≈ regex GIN index cons

Slide 82

Slide 82 text

● Slower than Btree ● j̵̨͎̗̼̯̪̺̮͗̈̎̽ͬ̒̏ͮ̈́̋̚͟ͅ _̧̡̙̖̽̾̽̓͐͗ͧ͡ s̵̡̳̹ͬͩ͌̈͠ ơ̡̺̳͎̤̪͚̻̙ͨͭ̈́̉ͦͨ͟ n͎̦̞͔̉͠͡ p̨̝̱͓̜͍͚̜͉͖͉͊̏ͤ̉̆̀͐̆̃ a̴̵̢̭̦̟͍̺͈̮̳͚̾̒̿̏ͥͣ̒̎̄͌̀ͦ̉ͣ̔̄͌̾̕̚͠͞͠͡ t̛̰͖̲̻͒̂́̃̐̓̉̓̽̄̆͊͘̕͜ ḩ̡̫̭͈̫̝͍̃́͂͆̀̒̚͟͟͞ ≈ r̲̄ͮͦ̋ _̸̸̖̬̫̺͚̩̫̱̙̯ͬͮ͌̎̎̀̈́̔̈̅̀̕ e͕̝̓ ǵ̡̮̠̯͇̻͍̪̣͔͕̬̓̒̈̽̋ͬ̾̂̚͘͘͢͞ ȩ̛̹͈̘̌̀ͯ̽̔ͨͯ͠ x̌ͩͅ GIN index cons

Slide 83

Slide 83 text

Statistics: 1. Acceptance 2. GIN Indexes 3. Btree Indexes 4.

Slide 84

Slide 84 text

CREATE INDEX IF NOT EXISTS some_event_some_key_idx ON some_event ((data ->> 'some_key')); ANALYZE some_event; EXPLAIN (ANALYZE) SELECT * FROM some_event WHERE data ->> 'some_key' = 'some_value';

Slide 85

Slide 85 text

--------------------------------------------------------------------------------- -> Index Scan using some_event_some_key_idx_36234128 on some_event_36234128 some_event (cost=0.43..385981.93 rows=822579 width=17) (actual time=5.678..8079.450 rows=821785 loops=1) Index Cond: ((data ->> 'some_key'::text) = 'some_value'::text) ---------------------------------------------------------------------------------

Slide 86

Slide 86 text

Statistics: 1. Acceptance 2. GIN Indexes 3. Btree Indexes 4. Ultimate answer

Slide 87

Slide 87 text

Just throw at it!

Slide 88

Slide 88 text

SELECT * from citus_add_node('worker-1', 5432); SELECT * from citus_add_node('worker-2', 5432); SELECT * from citus_add_node('worker-3', 5432); -- Two screens later... SELECT * from citus_add_node('worker-n', 5432); -- Blazingly fast SELECT * FROM some_event WHERE data ->> 'some_key' = 'some_value';

Slide 89

Slide 89 text

More optimizations: 1. Default indexes 2.

Slide 90

Slide 90 text

CREATE TABLE IF NOT EXISTS profiles.permanent ( user_id VARCHAR NOT NULL CONSTRAINT pk_permanent PRIMARY KEY, -- ... ); CREATE INDEX IF NOT EXISTS volatile_profiles_user_id_time_index ON profiles.volatile (user_id, update_time, up_to_time);

Slide 91

Slide 91 text

More optimizations: 1. Default indexes 2. Denormalization

Slide 92

Slide 92 text

+----------+--------------+ |table_name|pg_size_pretty| +----------+--------------+ |******** |28 TB | |******** |15 TB | |******** |3179 GB | |******** |2812 GB | |******** |1977 GB | |******** |1559 GB | |******** |1553 GB | |******** |1435 GB | |******** |1306 GB | |******** |1304 GB | +----------+--------------+

Slide 93

Slide 93 text

SELECT user_id, time, perm.data ->> 'some_key' as some_key FROM events.some_event AS event INNER JOIN profiles.permanent perm USING (user_id);

Slide 94

Slide 94 text

ALTER TABLE events.some_event ADD COLUMN some_key TEXT; UPDATE events.some_event event_to_inject SET some_key = data_to_inject.some_key FROM (SELECT ... FROM events."some_event" AS event ...) data_to_inject WHERE event_to_inject.user_id = data_to_inject.user_id AND event_to_inject.time = data_to_inject.time;

Slide 95

Slide 95 text

CREATE OR REPLACE PROCEDURE events.inject_fields() LANGUAGE sql BEGIN ATOMIC; UPDATE events.some_event event_to_inject SET some_key = data_to_inject.some_key FROM (SELECT ... FROM events."some_event" AS event ...) data_to_inject WHERE event_to_inject.user_id = data_to_inject.user_id AND event_to_inject.time = data_to_inject.time; END;

Slide 96

Slide 96 text

SELECT cron.schedule('0 0 * * *', $$CALL events.inject_fields();$$);

Slide 97

Slide 97 text

SELECT user_id, time, some_key FROM events.some_event;

Slide 98

Slide 98 text

SELECT user_id, time, some_key FROM events.some_event WHERE event.time < current_date; UNION ALL SELECT user_id, time, data ->> 'some_key' AS some_key FROM events.some_event AS event INNER JOIN profiles.permanent perm USING (user_id) WHERE event.time > current_date

Slide 99

Slide 99 text

Pre-aggregation

Slide 100

Slide 100 text

SELECT user_id , count(*) FROM events.some_event GROUP BY user_id;

Slide 101

Slide 101 text

CREATE MATERIALIZED VIEW counts AS SELECT user_id, count(*) FROM events.some_event GROUP BY user_id; SELECT cron.schedule('0 0 * * *', $cmd$REFRESH MATERIALIZED VIEW counts; $cmd$);

Slide 102

Slide 102 text

CREATE MATERIALIZED VIEW counts AS SELECT user_id, count(*) FROM events.some_event GROUP BY user_id; SELECT create_distributed_table( 'counts', 'user_id'); SELECT cron.schedule('0 0 * * *', $cmd$REFRESH MATERIALIZED VIEW counts;$cmd$);

Slide 103

Slide 103 text

Nope. SELECT create_distributed_table( 'counts', 'user_id'); [42809] ERROR: counts is not a regular, foreign or partitioned table

Slide 104

Slide 104 text

https://github.com/citusdata/citus/issues/40

Slide 105

Slide 105 text

CREATE TABLE counts ( user_id TEXT, count BIGINT ); SELECT create_distributed_table( 'counts', 'user_id');

Slide 106

Slide 106 text

CREATE OR REPLACE PROCEDURE counts_refresh() LANGUAGE sql AS $cmd$ TRUNCATE counts; INSERT INTO counts SELECT user_id, count(*) FROM events.some_event GROUP BY user_id $cmd$; SELECT cron.schedule('0 0 * * *', $cmd$CALL counts_refresh();$cmd$);

Slide 107

Slide 107 text

Maintenance

Slide 108

Slide 108 text

No content

Slide 109

Slide 109 text

No content

Slide 110

Slide 110 text

GitOps

Slide 111

Slide 111 text

GitOps: 1. Repository 2.

Slide 112

Slide 112 text

GitOps: 1. Repository 2. Automated delivery 3.

Slide 113

Slide 113 text

GitOps: 1. Repository 2. Automated delivery 3. Changes only via commits

Slide 114

Slide 114 text

Multitenancy

Slide 115

Slide 115 text

Preferred ways

Slide 116

Slide 116 text

Preferred ways: ● Column-based multitenancy ●

Slide 117

Slide 117 text

Preferred ways: ● Column-based multitenancy ● Schema-based multitenancy

Slide 118

Slide 118 text

Our way

Slide 119

Slide 119 text

Our way: ● DATABASE-based multitenancy

Slide 120

Slide 120 text

● Strong isolation ● DATABASE-based multitenancy Pros and Cons

Slide 121

Slide 121 text

● Strong isolation ● Custom configuration ● DATABASE-based multitenancy Pros and Cons

Slide 122

Slide 122 text

DATABASE-based multitenancy Pros and Cons ● Strong isolation ● Custom configuration ● Parallel migrations

Slide 123

Slide 123 text

● Strong isolation ● Custom configuration ● Parallel migrations ● No X-DB queries ● DATABASE-based multitenancy Pros and Cons

Slide 124

Slide 124 text

SELECT * FROM tenant_one.events.some_event INNER JOIN tenant_two.events.some_event USING (user_id); [0A000] ERROR: cross-database references are not implemented: "tenant_two.events.some_event"

Slide 125

Slide 125 text

CREATE EXTENSION dblink; -- CREATE EXTENSION postgres_fdw; SELECT user_id, avg(data ->> 'some_key') AS some_key_avg FROM events.some_event GROUP BY user_id UNION ALL SELECT * FROM dblink('tenant_two', $$ SELECT ... FROM events.some_event ...;$$) AS t(user_id TEXT, some_key_avg NUMERIC);

Slide 126

Slide 126 text

● Strong isolation ● Custom configuration ● Parallel migrations ● No X-DB queries ● Poor support DATABASE-based multitenancy Pros and Cons

Slide 127

Slide 127 text

One DATABASE with requires: ● 1 daemon process on coordinator and workers ●

Slide 128

Slide 128 text

One DATABASE with requires: ● 1 daemon process on coordinator and workers ● N connections per worker for deadlocks detection ●

Slide 129

Slide 129 text

One DATABASE with requires: ● 1 daemon process on coordinator and workers ● N connections per worker for deadlocks detection ● 2 connections per worker for transaction recovery

Slide 130

Slide 130 text

N = 40;D = 30 Little math...

Slide 131

Slide 131 text

N = 40;D = 30 D * (N + 2) = Little math...

Slide 132

Slide 132 text

N = 40;D = 30 30 * (40 + 2) = Little math...

Slide 133

Slide 133 text

N = 40;D = 30 30 * (40 + 2) = 1260 connections Little math...

Slide 134

Slide 134 text

N = 40;D = 30 30 * (40 + 2) = 1260 connections ~60 connections every citus.recover_2pc_interval = '1min' Little math...

Slide 135

Slide 135 text

Little math... N = 40;D = 30 30 * (40 + 2) = 1260 connections ~60 connections every citus.recover_2pc_interval = '1min' 1200 idle connections due to citus.distributed_deadlock_detection_factor = '2ms'

Slide 136

Slide 136 text

Workaround: ALTER SYSTEM SET citus.recover_2pc_interval TO -1; ALTER SYSTEM SET citus.distributed_deadlock_detection_factor TO -1; SELECT cron.schedule( schedule := '*/5 * * * *', command := $cron$CALL execute_on_databases_with_citus( $cmd$SELECT check_distributed_deadlocks();$cmd$); $cron$);

Slide 137

Slide 137 text

https://github.com/citusdata/citus/issues/6548

Slide 138

Slide 138 text

Multi-level views iding_pain_meme.jp

Slide 139

Slide 139 text

CREATE MATERIALIZED VIEW one AS SELECT ...; SELECT cron.schedule('0 0 * * *', $$REFRESH MATERIALIZED VIEW one;$$); CREATE MATERIALIZED VIEW two AS SELECT ...; SELECT cron.schedule('0 1 * * *', $$REFRESH MATERIALIZED VIEW two;$$); CREATE MATERIALIZED VIEW three AS SELECT * FROM one join two ...; SELECT cron.schedule('0 2 * * *', $$REFRESH MATERIALIZED VIEW three;$$); CREATE MATERIALIZED VIEW four AS SELECT * FROM one join three ...; SELECT cron.schedule('0 3 * * *', $$REFRESH MATERIALIZED VIEW four;$$);

Slide 140

Slide 140 text

What’s wrong

Slide 141

Slide 141 text

What’s wrong: ● Pick cron manually ●

Slide 142

Slide 142 text

What’s wrong: ● Pick cron manually ● Hope there are no delays

Slide 143

Slide 143 text

What we did

Slide 144

Slide 144 text

No content

Slide 145

Slide 145 text

What we got

Slide 146

Slide 146 text

What we got: ● No cron (almost) ●

Slide 147

Slide 147 text

● No cron (almost) ● Delay-resistant ● What we got:

Slide 148

Slide 148 text

● No cron (almost) ● Delay-resistant ● Data for dashboard What we got:

Slide 149

Slide 149 text

No content

Slide 150

Slide 150 text

Rebalancing

Slide 151

Slide 151 text

https://www.citusdata.com/blog/2021/03/13/scaling-out-postgres-with-citus-open-source-shard-rebalancer/

Slide 152

Slide 152 text

https://www.citusdata.com/blog/2021/03/13/scaling-out-postgres-with-citus-open-source-shard-rebalancer/

Slide 153

Slide 153 text

Minor issues

Slide 154

Slide 154 text

Minor issues: ● Issues ●

Slide 155

Slide 155 text

https://github.com/citusdata/citus/issues/6705

Slide 156

Slide 156 text

Minor issues: ● Issues ● Underutilization

Slide 157

Slide 157 text

No content

Slide 158

Slide 158 text

No content

Slide 159

Slide 159 text

SELECT citus_rebalance_start();

Slide 160

Slide 160 text

No content

Slide 161

Slide 161 text

2 < 4

Slide 162

Slide 162 text

2 < 4; 2 = underutilization

Slide 163

Slide 163 text

-- SELECT citus_rebalance_start(); SELECT alter_distributed_table( table_name := 'events.some_event', shard_count := 16);

Slide 164

Slide 164 text

No content

Slide 165

Slide 165 text

Drawbacks

Slide 166

Slide 166 text

Drawbacks: ● No *start() function ●

Slide 167

Slide 167 text

Drawbacks: ● No *start() function ● Table is recreated ●

Slide 168

Slide 168 text

Drawbacks: ● No *start() function ● Table is recreated ● Table is recreated with issues

Slide 169

Slide 169 text

https://github.com/citusdata/citus/issues/5968

Slide 170

Slide 170 text

Slide 171

Slide 171 text

https://dzone.com/users/4759096/ivanvyazmitinov.html Skip ad ->>

Slide 172

Slide 172 text

User experience

Slide 173

Slide 173 text

Expectations

Slide 174

Slide 174 text

Expectations: ● Frequent (100/hour) ●

Slide 175

Slide 175 text

Expectations: ● Frequent (100/hour) ● Small (~100KB)

Slide 176

Slide 176 text

Reality

Slide 177

Slide 177 text

Reality: ● Infrequent (1/day) ● Large (~1.5gb)

Slide 178

Slide 178 text

Reasons

Slide 179

Slide 179 text

Reasons: ● Tableau extracts (hyper files) ●

Slide 180

Slide 180 text

Reasons: ● Tableau extracts (hyper files) ● Pandas DataFrame ●

Slide 181

Slide 181 text

Reasons: ● Tableau extracts (hyper files) ● Pandas DataFrame ● Common practice

Slide 182

Slide 182 text

Result

Slide 183

Slide 183 text

Result: ● Slow-query tolerance ●

Slide 184

Slide 184 text

Result: ● Slow-query tolerance ● Inefficient SQL

Slide 185

Slide 185 text

Measures

Slide 186

Slide 186 text

Measures: ● Education ●

Slide 187

Slide 187 text

Measures: ● Education ● Restrictions ●

Slide 188

Slide 188 text

ALTER SYSTEM SET STATEMENT_TIMEOUT = '40min'; ALTER ROLE analytics_user SET STATEMENT_TIMEOUT = '20min';

Slide 189

Slide 189 text

https://gitlab.com/ongresinc/extensions/noset CREATE EXTENSION noset; ALTER ROLE analytics_user SET noset.enabled = true; ALTER ROLE analytics_user SET noset.paramaters = 'STATEMENT_TIMEOUT';

Slide 190

Slide 190 text

CREATE OR REPLACE PROCEDURE admin.kill_long_analytical_queries() LANGUAGE plpgsql AS $proc$ DECLARE problematic_query RECORD; BEGIN FOR problematic_query IN (SELECT pid, query FROM (SELECT now() - query_start AS run_time, query, pid FROM pg_stat_activity WHERE usename IN (...) sub WHERE run_time >= INTERVAL '20 MINUTES') LOOP RAISE WARNING 'Killing pid % that was running %', problematic_query.pid, problematic_query.query; PERFORM pg_terminate_backend(problematic_query.pid); END LOOP; END; $proc$; SELECT cron.schedule('*/5 * * * *', $$CALL admin.kill_long_analytical_queries()$$);

Slide 191

Slide 191 text

Measures: ● Education ● Restrictions ● Monitoring

Slide 192

Slide 192 text

Monitoring

Slide 193

Slide 193 text

What to monitor: ● Memory consumption ●

Slide 194

Slide 194 text

No content

Slide 195

Slide 195 text

PIDS=$( ps -o pid --sort=-size --no-headers $( psql postgres -qt -c "SELECT pid FROM pg_stat_activity WHERE usename!='postgres'" ) | head -20 ) for PID in ${PIDS} do QUERY_ID=$( psql postgres -qt -c "SELECT COALESCE(query_id, 0) FROM pg_stat_activity WHERE pid=${PID}" | sed 's/ //g' ) MEM=$( ps -o size --no-headers ${PID} ) if [ -n "${MEM}" ] then echo '{"key":"processes_postgres_memory_usage_kilobytes", "pid":"'$ {PID}'", "query_id":"'${QUERY_ID}'", "value":'${MEM}'},' fi done

Slide 196

Slide 196 text

https://www.citusdata.com/blog/2022/07/21/citus-stat-activity-views-for-distributed-postgres/#gpid

Slide 197

Slide 197 text

What to monitor: ● Memory consumption ● Query execution times

Slide 198

Slide 198 text

No content

Slide 199

Slide 199 text

CREATE EXTENSION pg_stats_statements; SELECT queryid, datname::TEXT, rolname::text, now(), calls, mean_exec_time, total_exec_time FROM pg_stat_statements t1 JOIN pg_database t2 ON (t1.dbid = t2.oid) JOIN pg_roles t3 ON (t1.userid = t3.oid) WHERE t3.rolname IN (...) AND total_exec_time > 0;

Slide 200

Slide 200 text

ALTER SYSTEM SET log_destination = 'stderr,csvlog'; ALTER SYSTEM SET log_min_duration_statement = 60000; COPY long_query_examples FROM '/var/lib/postgresql/14/data/log/postgresql-.csv' WITH CSV WHERE command_tag IN ('SELECT', 'REFRESH MATERIALIZED VIEW', 'CALL') AND message LIKE '%%duration:%%';

Slide 201

Slide 201 text

Optimizations

Slide 202

Slide 202 text

Common optimizations: ● Look out for Distributed Subplan ●

Slide 203

Slide 203 text

WITH sc AS (SELECT * FROM events."some_event" WHERE time >= '2020-01-01' AND time < '2020-02-01') SELECT '*******' AS user_id_masked, count(er.user_id) AS "amount" FROM sc INNER JOIN events."another_event" er ON sc.data ->> 'country' = er.data ->> 'country' GROUP BY er.user_id;

Slide 204

Slide 204 text

------------------------------------------------------------------------ Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=72) -> Distributed Subplan 3_1 -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=72) Task Count: 80 Tasks Shown: One of 80 -> Task Node: host=10.97.97.10 port=5432 dbname=some_db -> Seq Scan on "some_event_2020_JAN_36474381" "some_event" (cost=0.00..59133.21 rows=2339605 width=47) ------------------------------------------------------------------------

Slide 205

Slide 205 text

Common optimizations: ● Look out for Distributed Subplan ● Partitions ●

Slide 206

Slide 206 text

-- Incorrect SELECT * FROM events.some_event WHERE date_trunc('DAY', time) >= '2023-01-01'; -- Correct EXPLAIN SELECT * FROM events.some_event WHERE time >= '2023-01-01';

Slide 207

Slide 207 text

Common optimizations: ● Look out for Distributed Subplan ● Partitions ● Indexes

Slide 208

Slide 208 text

Phew… x2

Slide 209

Slide 209 text

Conclusion

Slide 210

Slide 210 text

Top wanted features

Slide 211

Slide 211 text

Top wanted features: ● JSONB statistics ●

Slide 212

Slide 212 text

Top wanted features: ● JSONB statistics ● Multi-level schemas ●

Slide 213

Slide 213 text

Top wanted features: ● JSONB statistics ● Multi-level schemas ● Sub partitions ●

Slide 214

Slide 214 text

CREATE TABLE measurement ( logdate DATE NOT NULL, peaktime DATE NOT NULL ) PARTITION BY RANGE (logdate); SELECT create_distributed_table('measurement', 'city_id'); CREATE TABLE measurement_y2006m02 PARTITION OF measurement FOR VALUES FROM ('2006-02-01') TO ('2006-03-01') PARTITION BY RANGE (peaktemp); [0A000] ERROR: distributing multi-level partitioned tables is not supported

Slide 215

Slide 215 text

Top wanted features: ● JSONB statistics ● Multi-level schemas ● Sub partitions ● Distributed mat views