Slide 1

Slide 1 text

The System Architecture of Real-time Reporting in Laravel @LaravelConf 2022 Albert Chen

Slide 2

Slide 2 text

02 Aggregation Improvements in MySQL Outline 01 Aggregation Report in MySQL 03 Row vs Column Oriented Databases 04 Integration with ClickHouse 05 Deduplication in ClickHouse 07 Q&A 06 Optimizations in ClickHouse

Slide 3

Slide 3 text

Slido #1478692 https://sli.do

Slide 4

Slide 4 text

Report is Everywhere Daily revenue for product orders Analysis for top-selling products Ranks for most-clicked articles or videos Total cost for business operation Statements for financial department Summaries of registered users Statistics of unique visitors ... etc.

Slide 5

Slide 5 text

Data Source database, files, services Condition Filters resource id, time range, numbers, text, data type Aggregations sum, average, count, max, min Transformers concat, mask, numbers format Groups Orders Basic Elements of Report

Slide 6

Slide 6 text

How much time does it take for this query if your data rows is 1000, 10k, 1m, 10m or 100m? Report Performance in MySQL SELECT product_id, COUNT(id) as count WHERE status != 'canceled' GROUP BY product_id ORDER BY count DESC LIMIT 10 id user_id product_id amount discount shipping_fee total status created_at 1 35 2 1000 50 0 950 shipped 2022-11-11 09:24:13 2 78 5 860 20 60 900 paid 2022-11-11 10:12:04 3 93 11 780 0 60 840 canceled 2022-11-12 01:39:17

Slide 7

Slide 7 text

Add Appropriate Indexes Only helps filtering rows, not aggregation Too many indexes is not a good idea Partitions and Shardings Suitable for unchanged conditions combination Complicated and limited usage (join, transactions... etc) Vertical Scaling Database There's hard limit for single server Money is always not a problem Aggregation Improvements in MySQL

Slide 8

Slide 8 text

Pre-aggregated Results MySQL doesn't support Materialized View Aggregations for different combinations of conditions will be difficult to maintain Aggregation results need to be refreshed once source data keeps updating Aggregation Improvements in MySQL id amount discount shipping_fee total status period 1 2 50 0 950 shipped 2022-11-11 09:00 2 5 20 60 900 paid 2022-11-11 10:00 3 11 0 60 840 canceled 2022-11-11 11:00

Slide 9

Slide 9 text

Select Date Search Filters in Our Imagination Aggregation Improvements in MySQL Status: Order Date:

Slide 10

Slide 10 text

Text Select Select Select Date Date Date Search Select Date Filters for Users' Needs Aggregation Improvements in MySQL Name: Gender: City: Status: Shipping Date: Order Date: Accounting Date: Agent: Updated Date:

Slide 11

Slide 11 text

There's not much you can do with MySQL if... Data rows are large for aggregation (over 10 million) Filtering conditions are flexible Data rows are mutable Near real-time requirement Aggregation Improvements in MySQL

Slide 12

Slide 12 text

Row Oriented Databases Data associated with the same record are kept next to each other in memory Optimized for reading and writing a single row of data Common in OLTP database like MySQL, Postgres etc. Row vs Column Oriented Databases id name country age 1 Taylor USA 35 2 Evan China 35

Slide 13

Slide 13 text

Column Oriented Databases Data associated with the same column are kept next to each other in memory Optimized for reading to support analyzing data Common in Data Warehouses like Cassandra, HBase, Big Query, Clickhouse etc. id 1 2 Row vs Column Oriented Databases name Taylor Evan Country USA China age 35 35

Slide 14

Slide 14 text

Pros of Columnar Database Queries that use only a few columns out of many Aggregating queries against large volumes of data Column-wise data compression Cons of Columnar Database Poor at data mutations like update or delete Not optimized for querying row-specific data Lack of unique constraints Row vs Column Oriented Databases

Slide 15

Slide 15 text

Aggregation in Row Oriented Databases Row vs Column Oriented Databases (https://clickhouse.com/docs/en/faq/general/columnar-database)

Slide 16

Slide 16 text

Aggregation in Column Oriented Databases Row vs Column Oriented Databases (https://clickhouse.com/docs/en/faq/general/columnar-database)

Slide 17

Slide 17 text

Benchmark Row vs Column Oriented Databases (https://benchmark.clickhouse.com)

Slide 18

Slide 18 text

An open source (Apache 2.0), relational, OLAP database since 2016 Was originally developed as an internal project by Yandex (a Russian search engine) The word ClickHouse is a combination of Clickstream and Data WareHouse Known for extremely high query speed and performance Fully based on ANSI SQL, which makes it easier to interact with Highly productive development on Github Many companies are using ClickHouse in production like Uber, Slack, Tesla, Tencent, Tiktok, Cloudflare, etc. Introduction of ClickHouse

Slide 19

Slide 19 text

Why is ClickHouse so fast? Column-oriented storage Data Compression with specialized codecs that can make data even more compact Vectorized query execution to allow for SIMD CPU instructions usage ClickHouse can leverage all available CPU cores and disks to execute even a single query Introduction of ClickHouse

Slide 20

Slide 20 text

Downside of ClickHouse Limited Updates and Deletes Mutations in ClickHouse are heavy operation and not designed for frequent use Limited Transactional (ACID) support at this moment (INSERT into one partition in one table of MergeTree family) ClickHouse Keeper is not stable enough like official's announcement Introduction of ClickHouse

Slide 21

Slide 21 text

Table Engines MergeTree ReplacingMergeTree, SummingMergeTree, AggregationMergeTree, CollapsingMergeTree, etc. Log Log, TinyLog and StripeLog Integration Engines MySQL, Postgres, ODBC, JDBC, MongoDB, Kafka, etc. Special Engines File, MaterializedView, URL, Buffer, Memory, etc. Introduction of ClickHouse

Slide 22

Slide 22 text

Report Data Syncing Flow Integration with ClickHouse

Slide 23

Slide 23 text

Data Syncing Flow Integration with ClickHouse id name age updated_at 1 Taylor 35 2022-12-01 00:00:00 2 Evan 35 2022-12-01 00:00:00 id name age updated_at 1 Taylor 35 2022-12-01 00:00:00 2 Evan 35 2022-12-01 00:00:00

Slide 24

Slide 24 text

Data Syncing Flow Integration with ClickHouse id name age updated_at 1 Taylor 36 2022-12-01 00:01:00 2 Evan 35 2022-12-01 00:00:00 insert id name age updated_at 1 Taylor 35 2022-12-01 00:00:00 2 Evan 35 2022-12-01 00:00:00

Slide 25

Slide 25 text

Data Syncing Flow Integration with ClickHouse id name age updated_at 1 Taylor 35 2022-12-01 00:00:00 2 Evan 35 2022-12-01 00:00:00 1 Taylor 36 2022-12-01 00:01:00 id name age updated_at 1 Taylor 36 2022-12-01 00:01:00 2 Evan 35 2022-12-01 00:00:00 ?

Slide 26

Slide 26 text

Data Syncing Flow Integration with ClickHouse id name age updated_at 1 Taylor 36 2022-12-01 00:01:00 2 Evan 35 2022-12-01 00:00:00 id name age updated_at 1 Taylor 36 2022-12-01 00:01:00 2 Evan 35 2022-12-01 00:00:00 Deduplicate

Slide 27

Slide 27 text

UPDATE/DELETE mutations ReplacingMergeTree / CollapsingMergeTree with FINAL modifier with OPTIMIZE statement Aggregate Function MaterializedMySQL Database Engine Altinity Sink Connector Lightweight DELETE feature UniqueMergeTree in ByteHouse (not implemented) Deduplication in ClickHouse

Slide 28

Slide 28 text

UPDATE/DELETE mutations UPDATE and DELETE are not supported in early versions Mutations are extremely heavy operations Mutations are asynchronous by default There's no transaction guaranteed Deduplication in ClickHouse ALTER TABLE authors DELETE WHERE id = 1 INSERT INTO authors VALUEs (1, 'Taylor', 36, '2022-12-01 00:00:00')

Slide 29

Slide 29 text

ReplacingMergeTree / CollapsingMergeTree ReplacingMergeTree will deduplicate data by itself Merge timing is controlled by ClickHouse Need to use FINAL or OPTIMIZE to force merging Deduplication in ClickHouse id name age updated_at 1 Taylor 35 2022-12-01 00:00:00 2 Evan 35 2022-12-01 00:00:00 1 Taylor 36 2022-12-01 00:01:00 id name age updated_at 1 Taylor 36 2022-12-01 00:01:00 2 Evan 35 2022-12-01 00:00:00 Deduplicate

Slide 30

Slide 30 text

ReplacingMergeTree / CollapsingMergeTree with FINAL modifier FINAL will slow down the query performance FINAL runs on single thread, and improved as multi-thread in lately versions Deduplication in ClickHouse SELECT * FROM authors FINAL

Slide 31

Slide 31 text

OPTIMIZE TABLE authors PARTITION '202212' FINAL ReplacingMergeTree / CollapsingMergeTree with OPTIMIZE statement OPTIMIZE will trigger partitions merge immediately OPTIMIZE will use lots of CPU and disk resource Needs to enable optimize_throw_if_noop for exception Can be optimized with specific partitions Deduplication in ClickHouse 202211 ... unchanged 202212 merge duplicated

Slide 32

Slide 32 text

Aggregate Function argMax with GROUP BY Create a view for better query experience Duplicated data still exists, but being filtered Difficult for performance optimization in some complicated queries Deduplication in ClickHouse CREATE VIEW authors_view AS SELECT id, argMax(name, updated_at) as name, argMax(age, updated_at) as age, max(updated_at) as updated_at FEOM authors GROUP BY id

Slide 33

Slide 33 text

CREATE DATABASE mysql ENGINE = MaterializedMySQL('localhost:3306', 'db', 'user', '***') SETTINGS allows_query_when_mysql_lost=true, max_wait_time_when_mysql_unavailable=10000; MaterializedMySQL Database Engine Experimental feature, not production ready Some data types are not supported Implicit errors will stop all replication process Deduplication in ClickHouse

Slide 34

Slide 34 text

MaterializedMySQL Database Engine Deduplication in ClickHouse (https://altinity.com/blog/using-clickhouse-as-an-analytic-extension-for-mysql)

Slide 35

Slide 35 text

Altinity Sink Connector Still in early stage of development Support INSERT, UPDATE and DELETE Rely on row policy and FINAL modifier Deduplication in ClickHouse (https://altinity.com/blog/using-clickhouse-as-an-analytic-extension-for-mysql)

Slide 36

Slide 36 text

Altinity Sink Connector Deduplication in ClickHouse (https://altinity.com/blog/using-clickhouse-as-an-analytic-extension-for-mysql)

Slide 37

Slide 37 text

Lightweight DELETE Feature Experiment feature in ClickHouse 22.8 Lightweight deletes are asynchronous by default Still don't use it as aggressively as on an OLTP system Implement with bitmap mechanism, deleted data are filtered There's no transaction guaranteed between delete and insert Deduplication in ClickHouse DELETE FROM authors WHERE id = 1 INSERT INTO authors VALUEs (1, 'Taylor', 36, '2022-12-01 00:00:00')

Slide 38

Slide 38 text

Lightweight DELETE Feature Deduplication in ClickHouse (https://github.com/ClickHouse/ClickHouse/issues/19627)

Slide 39

Slide 39 text

Lightweight DELETE Feature Deduplication in ClickHouse (https://github.com/ClickHouse/ClickHouse/issues/19627)

Slide 40

Slide 40 text

Adopted Solutions ReplacingMergeTree with FINAL INSERT and OPTIMIZE tables Lightweight DELETE (current version) How about transaction guarantee? Deduplication in ClickHouse DELETE FROM authors WHERE id = 1 INSERT INTO authors VALUEs (1, 'Taylor', 36, '2022-12-01 00:00:00')

Slide 41

Slide 41 text

Deduplication in ClickHouse id name age updated_at 1 Taylor 36 2022-12-01 00:01:00 2 Evan 35 2022-12-01 00:00:00 id name age updated_at 1 Taylor 36 2022-12-01 00:01:00 2 Evan 35 2022-12-01 00:00:00 Original Table Replica Table sync Alternative for FINAL Modifier To prevent duplicated data during the merge To make it more performant than FINAL Sync to replica table first, then redirect queries to replica table while deduplication process

Slide 42

Slide 42 text

Deduplication in ClickHouse UniqueMergeTree in ByteHouse Implement mutations with mark-delete and insert Similar concepts to lightweight delete + insert, but based on a table engine level There's an RFC, but not being accepted yet: https://github.com/ClickHouse/ClickHouse/issues/41817

Slide 43

Slide 43 text

Integration with Laravel ClickHouse Packages smi2/phpclickhouse the-tinderbox/clickhouse-builder (support eloquent builder) bavix/laravel-clickhouse (support eloquent model) Only basic grammar support for ClickHouse features

Slide 44

Slide 44 text

Integration with Laravel Unsupported Features

Slide 45

Slide 45 text

Optimizations in ClickHouse Only Select Necessary Columns ClickHouse is column-oriented database, selecting all columns of one row will increase disk I/O (https://altinity.com/blog/using-clickhouse-as-an-analytic-extension-for-mysql)

Slide 46

Slide 46 text

Optimizations in ClickHouse Partitions (https://www.slideshare.net/Altinity/webinar-secrets-of-clickhouse-query-performance-by-robert-hodges-173379008)

Slide 47

Slide 47 text

SELECT orders.user_id, sum(orders.revenue) FROM orders LEFT JOIN users on orders.user_id = users.id WHERE users.rank = 'gold' group by orders.user_id order by orders.user_id limit 10 SELECT orders.user_id, sum(orders.revenue) FROM orders WHERE orders.user_id IN (SELECT id FROM users WHERE rank = 'gold') group by orders.user_id order by orders.user_id limit 10 Optimizations in ClickHouse Use SubQuery Rather Than JOIN

Slide 48

Slide 48 text

Optimizations in ClickHouse Use Small Table as Right Table in JOIN ClickHouse reads all data from right table first as hash map (https://zhuanlan.zhihu.com/p/377506070)

Slide 49

Slide 49 text

Optimizations in ClickHouse Filter before Aggregation SELECT COUNT(), SUM(revenue), platform_id FROM orders GROUP BY platform_id HAVING finished_at > '2022-01-01' SELECT COUNT(), SUM(revenue), platform_id FROM orders WHERE finished_at > '2022-01-01' GROUP BY platform_id

Slide 50

Slide 50 text

Optimizations in ClickHouse ORDER BY is Evil Ordering in large amount of data is extremely slow because the totally different index design in ClickHouse For example, ordering 20 million data scans all table and takes 21G ram in 2.3 seconds Queries of GROUP BY with ORDER BY are not optimized SELECT * FROM orders ORDER BY finished_at DESC LIMIT 10

Slide 51

Slide 51 text

Optimizations in ClickHouse ORDER BY is Evil It's not optimized by ClickHouse's query optimizer yet Open issue: https://github.com/ClickHouse/ClickHouse/issues/5329 Manual optimized query scans 20 million data scans nearly all table but takes 197M ram in 0.1 seconds SELECT * FROM orders WHERE id IN (SELECT id FROM orders ORDER BY settled_at DESC LIMIT 10) ORDER BY finished_at DESC LIMIT 10

Slide 52

Slide 52 text

Optimizations in ClickHouse ORDER BY is Evil We implement some query optimizers in Eloquent Builder It will optimize ORDER BY queries automatically

Slide 53

Slide 53 text

Optimizations in ClickHouse EXPLAIN Query Pipelines

Slide 54

Slide 54 text

Q&A