Make use of CrateDB's performance potential

CrateDB Make use of CrateDB's performance potential November 9, 2022

2 • MSc in Computer Science from University of Athens
• Background in software engineering and databases/datastores • Interests in databases & datastores, Java performance & benchmarking, Lucene About the speaker Marios Trivyzas Senior Software Engineer @Crate.io [email protected]

Data Insights at Scale Data Insights at Scale Sharding Partitioning
Upsert Performance Cluster Monitoring Query Performance Agenda 1 2 3 4 5 6 Q&A

4 Scale out as your data grows - Sharding Number
of Shards § Too many shards use cluster resources § Too few, you don’t use the full performance capacity of your cluster § Ideally for every table on every node, you want to have #shards = #CPU cores § Plan for the future, #tables, #of concurrent queries on those tables § Use dedicated master (master-eligible) nodes in the cluster § 1GB heap on dedicated master servers ~ 3000 shards § Number of replicas § At least one is mandatory to avoid downtimes § More replicas = more throughput, many concurrent clients querying the same table(s) § More replicas also consume cluster resources § More replicas can slow down inserts/updates

5 Scale out as your data grows - Sharding Size
of Shards § Too large shards can result in long delays during cluster recovering and sharding scheme changes § Better to plan ahead than increasing/decreasing #shards on a production cluster § Aim for 10GB-50GB shards § Choose the CLUSTERED BY key wisely Run tests with production-like data and get metrics from sys.shards

6 Scale out as your data grows - Partitioning/Shards allocation
Partitioning § Each table partition is internally just another table which each own shards § A table with 20 partitions and 10 shards has 200 shards § Choose the correct scheme for your needs, e.g.: § daily partitions could be too fine grain and you end up with lots of shards. § Changing the sharding scheme #shards/#replicas only apply to new partitions § Snapshot and close/drop old partitions, if needed to query on archived data, restore the partitions you need Shard Allocation § Divide your cluster into node groups § Allocate shards of non-critical tables to the low-spec nodes

7 Query Performance – Understand your queries EXPLAIN SELECT count(*),
userId FROM hits GROUP BY useriD ORDER BY 1 DESC limit 50 Eval[count(*), userid] └ Limit[50::bigint;0] └ OrderBy[count(*) DESC] └ GroupHashAggregate[userid | count(*)] └ Collect[doc.hits | [userid] | true] EXPLAIN SELECT counterid, regionid, userid FROM hits WHERE userId IN (?) LIMIT 10 Fetch[counterid, regionid, userid] └ Limit[10::bigint;0] └ Collect[doc.hits | [_fetchid] | (userid = ANY(_array($1)))] EXPLAIN SELECT * FROM hits WHERE userId IN (?) LIMIT 10 Fetch[watchid, javaenable, title, goodevent,… (100+ fields!)] └ Limit[10::bigint;0] └ Collect[doc.hits | [_fetchid] | (userid = ANY(_array($1)))]

8 Query Performance – Understand your queries

9 Query Performance - Tips & Tricks § Hit as
less partitions as possible – WHERE partKey IN (...) § Hit as less shards as possible – WHERE shKey IN (...) § Don't select unnecessary fields to return (avoid SELECT * FROM ...) § Make use of the correct LIMIT, rarely 10k rows are needed § Try to avoid joins as much as possible, use denormalized table schemes § Hash joins can be fast but nested loops usually not § When joining frequently with a mostly static "lookup" table, have few shards for the "lookup" table (ideally 1) § Don't refresh your dashboards too often! § A relatively light query can have an enormous performance hit if it's run every second by many clients

10 Upsert performance and trade-offs § Use batch inserts, test
to find the ideal balance § Large docs (many and/or large fields) may require smaller batch size § Disable indexing for columns you don’t need to be indexed (also save space) § Modify the refresh interval (disable and run manually between batch inserts) § Set validation = false when using COPY FROM § Don't use unnecessary check constraints and generated columns § More replicas § Read throughput § Write throughput § Small refresh time § Up-to-date query results § Write throughput

11 Cluster Monitoring - Be proactive § Check CPU/Memory/Disk -
avoid node hotspots § JVM stats - Check GC performance § Replicas are not up to speed: sys.shards.seq_no_stats § Segments health: OPTIMIZE your tables (especially when lots of upserts/deletes), use sys.segments § Up-to-date stats: ANALYZE your tables (use stats.service.max_bytes_per_sec for throttling) § Check slow queries – long running jobs: sys.jobs is your friend!

12 Do you want to learn more? Visit us §
https://crate.io/products/cratedb Read more § Documentation: https://crate.io/docs/crate/ § Visit https://crate.io/docs/crate/howtos/en/latest/performance/index.html to learn more about performance Try CrateDB § In the cloud for free: https://crate.io/lp-free-trial § On-premise/own cloud account: https://crate.io/download#cratedb

Data Insights at Scale Join our community: community.crate.io Contact: [email protected]
Thank you! Contribute: https://github.com/crate/crate

Make use of CrateDB's performance potential

Make use of CrateDB's performance potential

cratedb

More Decks by cratedb

Other Decks in Technology

Featured

Transcript

CrateDB Make use of CrateDB's performance potential November 9, 2022

2 • MSc in Computer Science from University of Athens

Data Insights at Scale Data Insights at Scale Sharding Partitioning

4 Scale out as your data grows - Sharding Number

5 Scale out as your data grows - Sharding Size

6 Scale out as your data grows - Partitioning/Shards allocation

7 Query Performance – Understand your queries EXPLAIN SELECT count(*),

8 Query Performance – Understand your queries

9 Query Performance - Tips & Tricks § Hit as

10 Upsert performance and trade-offs § Use batch inserts, test

11 Cluster Monitoring - Be proactive § Check CPU/Memory/Disk -

12 Do you want to learn more? Visit us §

Data Insights at Scale Join our community: community.crate.io Contact: [email protected]