$30 off During Our Annual Pro Sale. View Details »

Generate anonymised database with MasKING - 2023.09.21 Euruko Unconference Talk

Generate anonymised database with MasKING - 2023.09.21 Euruko Unconference Talk

Chikahiro Tokoro

September 22, 2023
Tweet

More Decks by Chikahiro Tokoro

Other Decks in Programming

Transcript

  1. Generate anonymised database
    with MasKING
    Chikahiro Tokoro Euruko 2023 at Vilnius: Unconference Talk

    View Slide

  2. Why do we want anonymised
    database?

    View Slide

  3. One day, during the release….

    View Slide

  4. One day, during the release….
    the main table was locked
    during database migration😱

    View Slide

  5. staging
    (QA)
    production
    local
    (development)
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    0.2 sec 3 sec 1 hour
    the postmortem: check the delivery process

    View Slide

  6. If with similar dataset with production...
    staging
    (QA)
    production
    local
    (development)
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    1 hour

    View Slide

  7. But, we should not use
    the production dataset
    Let’s use
    anonymised dataset!

    View Slide

  8. Use cases:
    ● simulate database migration
    local
    (development)
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    —-
    1 hour

    View Slide

  9. Use cases:
    ● SQL performance optimization
    accurate execution plan
    # Time: 2023-09-13T15:45:12.987654Z
    # User@Host: web_app[web_db] @ 10.0.0.2
    # Query_time: 25.987654 Lock_time: 2.345678
    Rows_sent: 50 Rows_examined: 5000
    SET timestamp=1234567890;
    SELECT
    users.id,
    users.username,
    COUNT(posts.id) AS post_count,
    SUM(posts.likes) AS total_likes
    FROM users
    LEFT JOIN posts ON users.id = posts.user_id
    WHERE users.created_at >= '2023-01-01'
    GROUP BY users.id
    HAVING post_count >= 5
    ORDER BY total_likes DESC
    LIMIT 10;
    slow query log

    View Slide

  10. Use cases:
    ● make ease to reproduce and fix bugs
    unexpected
    values
    missing
    related record
    duplicated
    records

    View Slide

  11. Use cases:
    ● better feature preview
    ● stress test
    ● analyse / Business Intelligence data source
    And… your idea here!

    View Slide

  12. https://www.snaplet.dev/
    Founded on 2020

    View Slide

  13. https://www.snaplet.dev/
    Founded on 2020

    View Slide

  14. How to implement anonymisation process?
    1. copy database and “UPDATE” records
    production
    production
    (copy)
    UPDATE users
    SET name = 'User1',
    email = '[email protected]',
    ....
    (most common way)

    View Slide

  15. How to implement anonymisation process?
    2. setup replica and database trigger
    production
    production
    (replica)
    CREATE TRIGGER after_user_insert
    AFTER INSERT ON users FOR EACH ROW
    BEGIN
    UPDATE users
    SET username = CONCAT('User', NEW.user_id)
    WHERE user_id = NEW.user_id;
    ...

    View Slide

  16. How to implement anonymisation process?
    3. proxy connection and interrupt SQL on the fly
    production
    INSERT INTO users (username, email, ...)
    VALUES ('JohnDoe', '[email protected]'
    , ...);
    production
    (copy)
    Connection Proxy
    INSERT INTO users (username, email, ...)
    VALUES ('JohnDoe', '[email protected]'
    , ...);
    INSERT INTO users (username, email, ...)
    VALUES ('User1', '[email protected]
    ', ...);

    View Slide

  17. How to implement anonymisation process?
    4. read database and create anonymised database
    production
    dump
    anonymised
    dump
    production
    (anonymised)
    database
    client
    (second common way)

    View Slide

  18. How to implement anonymisation process?
    4.
    data leak
    risk
    Medium
    Low
    Low
    effort
    Medium
    High
    Medium
    Low
    config
    High
    Medium
    Low
    Low
    time gap
    realtime
    realtime
    Medium
    Long
    performance
    regression
    No
    Yes
    No
    No
    2. DB trigger
    3. DB proxy
    read &
    create
    1. copy &
    UPDATE High

    View Slide

  19. How to implement anonymisation process?
    4.
    data leak
    risk
    Medium
    Low
    Low
    effort
    Medium
    High
    Medium
    Low
    config
    High
    Medium
    Low
    Low
    time gap
    realtime
    realtime
    Medium
    Long
    performance
    regression
    No
    Yes
    No
    No
    2. DB trigger
    3. DB proxy
    read &
    create
    1. copy &
    UPDATE High

    View Slide

  20. https://github.com/kibitan/masking

    View Slide

  21. Usage
    [email protected]
    [email protected]
    [email protected]

    users:
    email: anonymized+%{n}@example.com

    View Slide

  22. Usage

    $ mysql --complete-insert DB_NAME | masking | mysql ANONYMIZED_DB_NAME

    View Slide

  23. Generate anonymised database with MasKING - Demo - YouTube
    Demo

    View Slide

  24. Generate anonymised database with MasKING - Demo - YouTube
    Demo
    masking.yml

    View Slide

  25. Generate anonymised database with MasKING - Demo - YouTube
    Demo

    View Slide

  26. Generate anonymised database with MasKING - Demo - YouTube
    Demo

    View Slide

  27. Design Concepts

    View Slide

  28. Design Concept: KISS ~ Keep it simple, stupid! ~
    “Do one thing, and do it well”
    Unix philosophy
    ❌ Database connection
    ❌ File handling
    ✅ Standard input/output

    View Slide

  29. Design Concept: No external dependency
    No gems dependencies, use only standard libraries

    View Slide

  30. Design Concept: Quality

    View Slide

  31. Design Concept: Quality - metrics

    View Slide

  32. Design Concept: Quality - Acceptance test
    setup
    test
    assertion
    docker-compose.yml
    acceptance/run_test.sh

    View Slide

  33. Design Concept: Quality - Acceptance test on CI

    View Slide

  34. How to implement?

    View Slide

  35. users:
    name: Anonymized Name
    email: anonymized+%{n}@example.com
    masking.yml
    dump.sql anonymised_dump.sql
    masking

    View Slide

  36. Domains: uncertainties
    Mapping
    configuration
    ?
    Parse/Rebuild
    SQL

    View Slide

  37. Domains: uncertainties
    Mapping
    configuration
    Parse/Rebuild
    SQL
    Introduction to Test-Driven Development (TDD) with Classic TDD Example | Khalil Stemmler

    View Slide

  38. ① Outside-in
    Mapping
    configuration
    stdin
    ① Outside-in
    ② Inside-out
    Parse/Rebuild
    SQL
    ③ Inside-out
    Domains: finding architecture/models
    Output
    Line
    Input
    Line
    stdout
    Processor
    ⑤ Outside-in ⑤ Outside-in
    ④ Inside-out

    View Slide

  39. How to parse dump?

    View Slide

  40. sample_dump.sql
    setup database
    define table schema
    insert data

    View Slide

  41. https://dev.mysql.com/doc/refman/8.0/en/mysqldump.html

    View Slide

  42. setup database
    define table schema
    insert data
    sample_dump_with_complete-insert_option.sql

    View Slide

  43. setup database
    define table schema
    insert data
    sample_dump_with_complete-insert_option.sql

    View Slide

  44. How to parse SQL?

    View Slide

  45. https://github.com/kibitan/masking/blob/main/lib/masking/insert_statement.rb
    dynamically make
    regular expression object
    ?
    ?
    ?
    VALUE_REGEXP x 3

    View Slide

  46. https://github.com/kibitan/masking/blob/main/lib/masking/insert_statement.rb
    dynamically make
    regular expression object
    ?
    ?
    ?
    VALUE_REGEXP x 3

    View Slide

  47. $ mysqldump --complete-insert production_db | masking > anonymised_dump.sql
    /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:15:in `block in initialize':
    Masking::Error::InsertStatementParseError (Masking::Error::InsertStatementParseError)
    from :90:in `tap'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:14:in `initialize'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `new'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `initialize'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `new'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `mask'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:23:in `block in run'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `each_line'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `run'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:10:in `run'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/cli.rb:17:in `run'
    from exe/masking:10:in `'
    let’s try with production data!

    View Slide

  48. $ mysqldump --complete-insert production_db | masking > anonymised_dump.sql
    /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:15:in `block in initialize':
    Masking::Error::InsertStatementParseError (Masking::Error::InsertStatementParseError)
    from :90:in `tap'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:14:in `initialize'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `new'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `initialize'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `new'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `mask'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:23:in `block in run'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `each_line'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `run'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:10:in `run'
    from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/cli.rb:17:in `run'
    from exe/masking:10:in `'
    let’s try with production data!

    View Slide

  49. the journey to ready for production data
    https://github.com/kibitan/masking/commit/792e73b7d316fac1fa6694d74b6adcfee19dccc4

    View Slide

  50. https://github.com/kibitan/masking/commit/8ca7fdad2c52d3af5e6c56faadc08e8768500108
    the journey to ready for production data

    View Slide

  51. https://github.com/kibitan/masking/commit/04515ec0f9c8a5457c10b8224b488b90c90c6aa2
    the journey to ready for production data

    View Slide

  52. https://github.com/kibitan/masking/pull/3
    the journey to ready for production data

    View Slide

  53. $ mysqldump --complete-insert production_db | masking > anonymised_dump.sql
    $ mysql anonymised_db < anonymised_dump.sql
    ERROR 1064 (42000) at line 547659: You have an error in your SQL
    syntax; check the manual that corresponds to your MySQL server
    version for the right syntax to use near 'smelati’),' at line 547659
    the journey to ready for production data

    View Slide

  54. (‘ab’,1),(’cd’,2),(’ef’,3);
    the journey to ready for production data

    View Slide

  55. (‘ab’,1),(’cd’,2),(’ef’,3);
    ‘ab’,1,’cd’,2,’ef’,3;
    same with
    CSV!!!
    the journey to ready for production data

    View Slide

  56. https://www.clear-code.com/blog/2018/12/25.html
    the journey to ready for production data

    View Slide

  57. 2.even? 3.even?
    4.even?
    concat with next
    4.even?
    the journey to ready for production data

    View Slide

  58. https://github.com/kibitan/masking/pull/5
    the journey to ready for production data

    View Slide

  59. https://github.com/kibitan/masking/pull/5
    the journey to ready for production data

    View Slide

  60. View Slide

  61. what’s next?
    ● support random/fake data
    ● parse table schema
    ● performance optimization
    ○ type validation for config
    ○ working without “--complete-insert” option

    View Slide

  62. what’s next?: for PostgreSQL
    https://github.com/kibitan/pgmasking

    View Slide

  63. Thank you!
    Any feedback is welcome!
    https://github.com/kibitan/masking
    Chikahiro
    x/twitter: @chikahirotokoro
    https://chikahirotokoro.com

    View Slide