Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make use of CrateDB's  performance potential

Make use of CrateDB's  performance potential

Datastores promise easy scaling of your cluster to handle your data needs, providing top performance, but it (almost) never works as promised with out-of-the-box configuration. So, how can you scale efficiently with CrateDB and achieve performant queries? You will learn how sharding & partitioning choices, along with correct data modeling can get you there. Also, discover how to monitor your cluster to avoid unpleasant surprises, as Marios shares his best practices to keep your cluster healthy at all times! Finally, you'll learn about the key factors which will allow you to deploy CrateDB to the cloud and achieve the desired performance.

cratedb

June 23, 2023
Tweet

More Decks by cratedb

Other Decks in Technology

Transcript

  1. 2 • MSc in Computer Science from University of Athens

    • Background in software engineering and databases/datastores • Interests in databases & datastores, Java performance & benchmarking, Lucene About the speaker Marios Trivyzas Senior Software Engineer @Crate.io [email protected]
  2. Data Insights at Scale Data Insights at Scale Sharding Partitioning

    Upsert Performance Cluster Monitoring Query Performance Agenda 1 2 3 4 5 6 Q&A
  3. 4 Scale out as your data grows - Sharding Number

    of Shards § Too many shards use cluster resources § Too few, you don’t use the full performance capacity of your cluster § Ideally for every table on every node, you want to have #shards = #CPU cores § Plan for the future, #tables, #of concurrent queries on those tables § Use dedicated master (master-eligible) nodes in the cluster § 1GB heap on dedicated master servers ~ 3000 shards § Number of replicas § At least one is mandatory to avoid downtimes § More replicas = more throughput, many concurrent clients querying the same table(s) § More replicas also consume cluster resources § More replicas can slow down inserts/updates
  4. 5 Scale out as your data grows - Sharding Size

    of Shards § Too large shards can result in long delays during cluster recovering and sharding scheme changes § Better to plan ahead than increasing/decreasing #shards on a production cluster § Aim for 10GB-50GB shards § Choose the CLUSTERED BY key wisely Run tests with production-like data and get metrics from sys.shards
  5. 6 Scale out as your data grows - Partitioning/Shards allocation

    Partitioning § Each table partition is internally just another table which each own shards § A table with 20 partitions and 10 shards has 200 shards § Choose the correct scheme for your needs, e.g.: § daily partitions could be too fine grain and you end up with lots of shards. § Changing the sharding scheme #shards/#replicas only apply to new partitions § Snapshot and close/drop old partitions, if needed to query on archived data, restore the partitions you need Shard Allocation § Divide your cluster into node groups § Allocate shards of non-critical tables to the low-spec nodes
  6. 7 Query Performance – Understand your queries EXPLAIN SELECT count(*),

    userId FROM hits GROUP BY useriD ORDER BY 1 DESC limit 50 Eval[count(*), userid] └ Limit[50::bigint;0] └ OrderBy[count(*) DESC] └ GroupHashAggregate[userid | count(*)] └ Collect[doc.hits | [userid] | true] EXPLAIN SELECT counterid, regionid, userid FROM hits WHERE userId IN (?) LIMIT 10 Fetch[counterid, regionid, userid] └ Limit[10::bigint;0] └ Collect[doc.hits | [_fetchid] | (userid = ANY(_array($1)))] EXPLAIN SELECT * FROM hits WHERE userId IN (?) LIMIT 10 Fetch[watchid, javaenable, title, goodevent,… (100+ fields!)] └ Limit[10::bigint;0] └ Collect[doc.hits | [_fetchid] | (userid = ANY(_array($1)))]
  7. 9 Query Performance - Tips & Tricks § Hit as

    less partitions as possible – WHERE partKey IN (...) § Hit as less shards as possible – WHERE shKey IN (...) § Don't select unnecessary fields to return (avoid SELECT * FROM ...) § Make use of the correct LIMIT, rarely 10k rows are needed § Try to avoid joins as much as possible, use denormalized table schemes § Hash joins can be fast but nested loops usually not § When joining frequently with a mostly static "lookup" table, have few shards for the "lookup" table (ideally 1) § Don't refresh your dashboards too often! § A relatively light query can have an enormous performance hit if it's run every second by many clients
  8. 10 Upsert performance and trade-offs § Use batch inserts, test

    to find the ideal balance § Large docs (many and/or large fields) may require smaller batch size § Disable indexing for columns you don’t need to be indexed (also save space) § Modify the refresh interval (disable and run manually between batch inserts) § Set validation = false when using COPY FROM § Don't use unnecessary check constraints and generated columns § More replicas § Read throughput § Write throughput § Small refresh time § Up-to-date query results § Write throughput
  9. 11 Cluster Monitoring - Be proactive § Check CPU/Memory/Disk -

    avoid node hotspots § JVM stats - Check GC performance § Replicas are not up to speed: sys.shards.seq_no_stats § Segments health: OPTIMIZE your tables (especially when lots of upserts/deletes), use sys.segments § Up-to-date stats: ANALYZE your tables (use stats.service.max_bytes_per_sec for throttling) § Check slow queries – long running jobs: sys.jobs is your friend!
  10. 12 Do you want to learn more? Visit us §

    https://crate.io/products/cratedb Read more § Documentation: https://crate.io/docs/crate/ § Visit https://crate.io/docs/crate/howtos/en/latest/performance/index.html to learn more about performance Try CrateDB § In the cloud for free: https://crate.io/lp-free-trial § On-premise/own cloud account: https://crate.io/download#cratedb