Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The System Architecture of Real-time Report in Laravel

Albert Chen
December 20, 2022

The System Architecture of Real-time Report in Laravel

LaravelVueConf Taiwan 2022

Albert Chen

December 20, 2022
Tweet

More Decks by Albert Chen

Other Decks in Programming

Transcript

  1. The System Architecture
    of Real-time Reporting
    in Laravel
    @LaravelConf 2022
    Albert Chen

    View Slide

  2. 02 Aggregation Improvements in MySQL
    Outline
    01 Aggregation Report in MySQL
    03 Row vs Column Oriented Databases
    04 Integration with ClickHouse
    05 Deduplication in ClickHouse
    07 Q&A
    06 Optimizations in ClickHouse

    View Slide

  3. Slido
    #1478692
    https://sli.do

    View Slide

  4. Report is Everywhere
    Daily revenue for product orders
    Analysis for top-selling products
    Ranks for most-clicked articles or videos
    Total cost for business operation
    Statements for financial department
    Summaries of registered users
    Statistics of unique visitors
    ... etc.

    View Slide

  5. Data Source
    database, files, services
    Condition Filters
    resource id, time range, numbers, text, data type
    Aggregations
    sum, average, count, max, min
    Transformers
    concat, mask, numbers format
    Groups
    Orders
    Basic Elements of Report

    View Slide

  6. How much time does it take for this query if your data

    rows is 1000, 10k, 1m, 10m or 100m?
    Report Performance in MySQL
    SELECT product_id, COUNT(id) as count WHERE status != 'canceled'
    GROUP BY product_id ORDER BY count DESC LIMIT 10
    id user_id product_id amount discount shipping_fee total status created_at
    1 35 2 1000 50 0 950 shipped 2022-11-11 09:24:13
    2 78 5 860 20 60 900 paid 2022-11-11 10:12:04
    3 93 11 780 0 60 840 canceled 2022-11-12 01:39:17

    View Slide

  7. Add Appropriate Indexes
    Only helps filtering rows, not aggregation
    Too many indexes is not a good idea
    Partitions and Shardings
    Suitable for unchanged conditions combination
    Complicated and limited usage (join, transactions... etc)
    Vertical Scaling Database
    There's hard limit for single server
    Money is always not a problem
    Aggregation Improvements in MySQL

    View Slide

  8. Pre-aggregated Results
    MySQL doesn't support Materialized View
    Aggregations for different combinations of conditions

    will be difficult to maintain
    Aggregation results need to be refreshed once source

    data keeps updating
    Aggregation Improvements in MySQL
    id amount discount shipping_fee total status period
    1 2 50 0 950 shipped 2022-11-11 09:00
    2 5 20 60 900 paid 2022-11-11 10:00
    3 11 0 60 840 canceled 2022-11-11 11:00

    View Slide

  9. Select Date
    Search
    Filters in Our Imagination
    Aggregation Improvements in MySQL
    Status: Order Date:

    View Slide

  10. Text
    Select
    Select
    Select
    Date
    Date
    Date
    Search
    Select
    Date
    Filters for Users' Needs
    Aggregation Improvements in MySQL
    Name:
    Gender:
    City:
    Status:
    Shipping Date:
    Order Date:
    Accounting Date:
    Agent:
    Updated Date:

    View Slide

  11. There's not much you can do with MySQL if...
    Data rows are large for aggregation (over 10 million)
    Filtering conditions are flexible
    Data rows are mutable
    Near real-time requirement
    Aggregation Improvements in MySQL

    View Slide

  12. Row Oriented Databases
    Data associated with the same record are kept next to each

    other in memory
    Optimized for reading and writing a single row of data
    Common in OLTP database like MySQL, Postgres etc.
    Row vs Column Oriented Databases
    id name country age
    1 Taylor USA 35
    2 Evan China 35

    View Slide

  13. Column Oriented Databases
    Data associated with the same column are kept next to each

    other in memory
    Optimized for reading to support analyzing data
    Common in Data Warehouses like Cassandra, HBase, Big

    Query, Clickhouse etc.
    id
    1
    2
    Row vs Column Oriented Databases
    name
    Taylor
    Evan
    Country
    USA
    China
    age
    35
    35

    View Slide

  14. Pros of Columnar Database
    Queries that use only a few columns out of many
    Aggregating queries against large volumes of data
    Column-wise data compression
    Cons of Columnar Database
    Poor at data mutations like update or delete
    Not optimized for querying row-specific data
    Lack of unique constraints
    Row vs Column Oriented Databases

    View Slide

  15. Aggregation in Row Oriented Databases
    Row vs Column Oriented Databases
    (https://clickhouse.com/docs/en/faq/general/columnar-database)

    View Slide

  16. Aggregation in Column Oriented Databases
    Row vs Column Oriented Databases
    (https://clickhouse.com/docs/en/faq/general/columnar-database)

    View Slide

  17. Benchmark
    Row vs Column Oriented Databases
    (https://benchmark.clickhouse.com)

    View Slide

  18. An open source (Apache 2.0), relational, OLAP database since
    2016
    Was originally developed as an internal project by Yandex (a
    Russian search engine)
    The word ClickHouse is a combination of Clickstream and Data
    WareHouse
    Known for extremely high query speed and performance
    Fully based on ANSI SQL, which makes it easier to interact with
    Highly productive development on Github
    Many companies are using ClickHouse in production like Uber,
    Slack, Tesla, Tencent, Tiktok, Cloudflare, etc.
    Introduction of ClickHouse

    View Slide

  19. Why is ClickHouse so fast?
    Column-oriented storage
    Data Compression with specialized codecs that can make

    data even more compact
    Vectorized query execution to allow for SIMD CPU

    instructions usage
    ClickHouse can leverage all available CPU cores and disks to

    execute even a single query
    Introduction of ClickHouse

    View Slide

  20. Downside of ClickHouse
    Limited Updates and Deletes
    Mutations in ClickHouse are heavy operation and not

    designed for frequent use
    Limited Transactional (ACID) support at this moment

    (INSERT into one partition in one table of MergeTree family)
    ClickHouse Keeper is not stable enough like official's

    announcement
    Introduction of ClickHouse

    View Slide

  21. Table Engines
    MergeTree
    ReplacingMergeTree, SummingMergeTree,
    AggregationMergeTree, CollapsingMergeTree, etc.
    Log
    Log, TinyLog and StripeLog
    Integration Engines
    MySQL, Postgres, ODBC, JDBC, MongoDB, Kafka, etc.
    Special Engines
    File, MaterializedView, URL, Buffer, Memory, etc.
    Introduction of ClickHouse

    View Slide

  22. Report
    Data Syncing Flow
    Integration with ClickHouse

    View Slide

  23. Data Syncing Flow
    Integration with ClickHouse
    id name age updated_at
    1 Taylor 35 2022-12-01 00:00:00
    2 Evan 35 2022-12-01 00:00:00
    id name age updated_at
    1 Taylor 35 2022-12-01 00:00:00
    2 Evan 35 2022-12-01 00:00:00

    View Slide

  24. Data Syncing Flow
    Integration with ClickHouse
    id name age updated_at
    1 Taylor 36 2022-12-01 00:01:00
    2 Evan 35 2022-12-01 00:00:00
    insert
    id name age updated_at
    1 Taylor 35 2022-12-01 00:00:00
    2 Evan 35 2022-12-01 00:00:00

    View Slide

  25. Data Syncing Flow
    Integration with ClickHouse
    id name age updated_at
    1 Taylor 35 2022-12-01 00:00:00
    2 Evan 35 2022-12-01 00:00:00
    1 Taylor 36 2022-12-01 00:01:00
    id name age updated_at
    1 Taylor 36 2022-12-01 00:01:00
    2 Evan 35 2022-12-01 00:00:00
    ?

    View Slide

  26. Data Syncing Flow
    Integration with ClickHouse
    id name age updated_at
    1 Taylor 36 2022-12-01 00:01:00
    2 Evan 35 2022-12-01 00:00:00
    id name age updated_at
    1 Taylor 36 2022-12-01 00:01:00
    2 Evan 35 2022-12-01 00:00:00
    Deduplicate

    View Slide

  27. UPDATE/DELETE mutations
    ReplacingMergeTree / CollapsingMergeTree
    with FINAL modifier
    with OPTIMIZE statement
    Aggregate Function
    MaterializedMySQL Database Engine
    Altinity Sink Connector
    Lightweight DELETE feature
    UniqueMergeTree in ByteHouse (not implemented)
    Deduplication in ClickHouse

    View Slide

  28. UPDATE/DELETE mutations
    UPDATE and DELETE are not supported in early versions
    Mutations are extremely heavy operations
    Mutations are asynchronous by default
    There's no transaction guaranteed
    Deduplication in ClickHouse
    ALTER TABLE authors DELETE WHERE id = 1
    INSERT INTO authors VALUEs (1, 'Taylor', 36, '2022-12-01 00:00:00')

    View Slide

  29. ReplacingMergeTree / CollapsingMergeTree
    ReplacingMergeTree will deduplicate data by itself
    Merge timing is controlled by ClickHouse
    Need to use FINAL or OPTIMIZE to force merging
    Deduplication in ClickHouse
    id name age updated_at
    1 Taylor 35 2022-12-01 00:00:00
    2 Evan 35 2022-12-01 00:00:00
    1 Taylor 36 2022-12-01 00:01:00
    id name age updated_at
    1 Taylor 36 2022-12-01 00:01:00
    2 Evan 35 2022-12-01 00:00:00
    Deduplicate

    View Slide

  30. ReplacingMergeTree / CollapsingMergeTree
    with FINAL modifier
    FINAL will slow down the query performance
    FINAL runs on single thread, and improved as multi-thread in

    lately versions
    Deduplication in ClickHouse
    SELECT * FROM authors FINAL

    View Slide

  31. OPTIMIZE TABLE authors PARTITION '202212' FINAL
    ReplacingMergeTree / CollapsingMergeTree
    with OPTIMIZE statement
    OPTIMIZE will trigger partitions merge immediately
    OPTIMIZE will use lots of CPU and disk resource
    Needs to enable optimize_throw_if_noop for exception
    Can be optimized with specific partitions
    Deduplication in ClickHouse
    202211
    ...
    unchanged
    202212
    merge
    duplicated

    View Slide

  32. Aggregate Function
    argMax with GROUP BY
    Create a view for better query experience
    Duplicated data still exists, but being filtered
    Difficult for performance optimization in some complicated

    queries
    Deduplication in ClickHouse
    CREATE VIEW authors_view AS SELECT id,
    argMax(name, updated_at) as name,
    argMax(age, updated_at) as age,
    max(updated_at) as updated_at
    FEOM authors
    GROUP BY id

    View Slide

  33. CREATE DATABASE mysql ENGINE =

    MaterializedMySQL('localhost:3306', 'db', 'user', '***')
    SETTINGS
    allows_query_when_mysql_lost=true,
    max_wait_time_when_mysql_unavailable=10000;
    MaterializedMySQL Database Engine
    Experimental feature, not production ready
    Some data types are not supported
    Implicit errors will stop all replication process
    Deduplication in ClickHouse

    View Slide

  34. MaterializedMySQL Database Engine
    Deduplication in ClickHouse
    (https://altinity.com/blog/using-clickhouse-as-an-analytic-extension-for-mysql)

    View Slide

  35. Altinity Sink Connector
    Still in early stage of development
    Support INSERT, UPDATE and DELETE
    Rely on row policy and FINAL modifier
    Deduplication in ClickHouse
    (https://altinity.com/blog/using-clickhouse-as-an-analytic-extension-for-mysql)

    View Slide

  36. Altinity Sink Connector
    Deduplication in ClickHouse
    (https://altinity.com/blog/using-clickhouse-as-an-analytic-extension-for-mysql)

    View Slide

  37. Lightweight DELETE Feature
    Experiment feature in ClickHouse 22.8
    Lightweight deletes are asynchronous by default
    Still don't use it as aggressively as on an OLTP system
    Implement with bitmap mechanism, deleted data are filtered
    There's no transaction guaranteed between delete and insert
    Deduplication in ClickHouse
    DELETE FROM authors WHERE id = 1
    INSERT INTO authors VALUEs (1, 'Taylor', 36, '2022-12-01 00:00:00')

    View Slide

  38. Lightweight DELETE Feature
    Deduplication in ClickHouse
    (https://github.com/ClickHouse/ClickHouse/issues/19627)

    View Slide

  39. Lightweight DELETE Feature
    Deduplication in ClickHouse
    (https://github.com/ClickHouse/ClickHouse/issues/19627)

    View Slide

  40. Adopted Solutions
    ReplacingMergeTree with FINAL
    INSERT and OPTIMIZE tables
    Lightweight DELETE (current version)
    How about transaction guarantee?
    Deduplication in ClickHouse
    DELETE FROM authors WHERE id = 1
    INSERT INTO authors VALUEs (1, 'Taylor', 36, '2022-12-01 00:00:00')

    View Slide

  41. Deduplication in ClickHouse
    id name age updated_at
    1 Taylor 36 2022-12-01 00:01:00
    2 Evan 35 2022-12-01 00:00:00
    id name age updated_at
    1 Taylor 36 2022-12-01 00:01:00
    2 Evan 35 2022-12-01 00:00:00
    Original Table Replica Table
    sync
    Alternative for FINAL Modifier
    To prevent duplicated data during the merge
    To make it more performant than FINAL
    Sync to replica table first, then redirect queries to replica

    table while deduplication process

    View Slide

  42. Deduplication in ClickHouse
    UniqueMergeTree in ByteHouse
    Implement mutations with mark-delete and insert
    Similar concepts to lightweight delete + insert, but based on

    a table engine level
    There's an RFC, but not being accepted yet:

    https://github.com/ClickHouse/ClickHouse/issues/41817

    View Slide

  43. Integration with Laravel
    ClickHouse Packages
    smi2/phpclickhouse
    the-tinderbox/clickhouse-builder (support eloquent builder)
    bavix/laravel-clickhouse (support eloquent model)
    Only basic grammar support for ClickHouse features

    View Slide

  44. Integration with Laravel
    Unsupported Features

    View Slide

  45. Optimizations in ClickHouse
    Only Select Necessary Columns
    ClickHouse is column-oriented database, selecting all

    columns of one row will increase disk I/O
    (https://altinity.com/blog/using-clickhouse-as-an-analytic-extension-for-mysql)

    View Slide

  46. Optimizations in ClickHouse
    Partitions
    (https://www.slideshare.net/Altinity/webinar-secrets-of-clickhouse-query-performance-by-robert-hodges-173379008)

    View Slide

  47. SELECT orders.user_id, sum(orders.revenue)
    FROM orders
    LEFT JOIN users on orders.user_id = users.id
    WHERE users.rank = 'gold'
    group by orders.user_id
    order by orders.user_id
    limit 10
    SELECT orders.user_id, sum(orders.revenue)
    FROM orders
    WHERE orders.user_id IN (SELECT id FROM users

    WHERE rank = 'gold')
    group by orders.user_id
    order by orders.user_id
    limit 10
    Optimizations in ClickHouse
    Use SubQuery Rather Than JOIN

    View Slide

  48. Optimizations in ClickHouse
    Use Small Table as Right Table in JOIN
    ClickHouse reads all data from right table first as hash map
    (https://zhuanlan.zhihu.com/p/377506070)

    View Slide

  49. Optimizations in ClickHouse
    Filter before Aggregation
    SELECT COUNT(), SUM(revenue), platform_id
    FROM orders
    GROUP BY platform_id
    HAVING finished_at > '2022-01-01'
    SELECT COUNT(), SUM(revenue), platform_id
    FROM orders
    WHERE finished_at > '2022-01-01'
    GROUP BY platform_id

    View Slide

  50. Optimizations in ClickHouse
    ORDER BY is Evil
    Ordering in large amount of data is extremely slow because

    the totally different index design in ClickHouse
    For example, ordering 20 million data scans all table and

    takes 21G ram in 2.3 seconds
    Queries of GROUP BY with ORDER BY are not optimized
    SELECT * FROM orders
    ORDER BY finished_at
    DESC LIMIT 10

    View Slide

  51. Optimizations in ClickHouse
    ORDER BY is Evil
    It's not optimized by ClickHouse's query optimizer yet
    Open issue:

    https://github.com/ClickHouse/ClickHouse/issues/5329
    Manual optimized query scans 20 million data scans nearly

    all table but takes 197M ram in 0.1 seconds


    SELECT * FROM orders
    WHERE id IN (SELECT id FROM orders ORDER BY

    settled_at DESC LIMIT 10)
    ORDER BY finished_at
    DESC LIMIT 10


    View Slide

  52. Optimizations in ClickHouse
    ORDER BY is Evil
    We implement some query optimizers in Eloquent Builder
    It will optimize ORDER BY queries automatically

    View Slide

  53. Optimizations in ClickHouse
    EXPLAIN Query Pipelines

    View Slide

  54. Q&A

    View Slide