Slide 1

Slide 1 text

Generate anonymised database with MasKING Chikahiro Tokoro Euruko 2023 at Vilnius Ruby Unconf 2024 at Hamburg

Slide 2

Slide 2 text

1. What is the anonymised database? 2. How it was build? ● Use cases ● Anonymising Strategies ● Usage and Demo: MasKING πŸ₯‘ You can build your anonymised database ● Design concepts ● Implementation detail ● The journey for production πŸ₯‘ You can make your library in quality Agenda:

Slide 3

Slide 3 text

Why do we want anonymised database?

Slide 4

Slide 4 text

One day, during the release….

Slide 5

Slide 5 text

One day, during the release…. the main table was locked during database migration😱

Slide 6

Slide 6 text

staging (QA) production local (development) β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- 0.2 sec 3 sec 1 hour the postmortem: check the delivery process

Slide 7

Slide 7 text

If with similar dataset with production... staging (QA) production local (development) β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- 1 hour

Slide 8

Slide 8 text

But, we should not use the production dataset Let’s use anonymised dataset!

Slide 9

Slide 9 text

Use cases: ● simulate database migration local (development) β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- β€”- 1 hour

Slide 10

Slide 10 text

Use cases: ● SQL performance optimization accurate execution plan # Time: 2023-09-13T15:45:12.987654Z # User@Host: web_app[web_db] @ 10.0.0.2 # Query_time: 25.987654 Lock_time: 2.345678 Rows_sent: 50 Rows_examined: 5000 SET timestamp=1234567890; SELECT users.id, users.username, COUNT(posts.id) AS post_count, SUM(posts.likes) AS total_likes FROM users LEFT JOIN posts ON users.id = posts.user_id WHERE users.created_at >= '2023-01-01' GROUP BY users.id HAVING post_count >= 5 ORDER BY total_likes DESC LIMIT 10; slow query log

Slide 11

Slide 11 text

Use cases: ● make ease to reproduce and fix bugs unexpected values missing related record duplicated records

Slide 12

Slide 12 text

Use cases: ● better feature preview ● stress test ● analyse / Business Intelligence data source And… your idea here!

Slide 13

Slide 13 text

https://www.snaplet.dev/ Founded on 2020

Slide 14

Slide 14 text

https://www.snaplet.dev/ Founded on 2020

Slide 15

Slide 15 text

How to implement anonymisation process? 1. copy database and β€œUPDATE” records production production (copy) UPDATE users SET name = 'User1', email = '[email protected]', .... (most common way)

Slide 16

Slide 16 text

How to implement anonymisation process? 2. setup replica and database trigger production production (replica) CREATE TRIGGER after_user_insert AFTER INSERT ON users FOR EACH ROW BEGIN UPDATE users SET username = CONCAT('User', NEW.user_id) WHERE user_id = NEW.user_id; ...

Slide 17

Slide 17 text

How to implement anonymisation process? 3. proxy connection and interrupt SQL on the fly production INSERT INTO users (username, email, ...) VALUES ('JohnDoe', '[email protected]' , ...); production (copy) Connection Proxy INSERT INTO users (username, email, ...) VALUES ('JohnDoe', '[email protected]' , ...); INSERT INTO users (username, email, ...) VALUES ('User1', '[email protected] ', ...);

Slide 18

Slide 18 text

How to implement anonymisation process? 4. read database and create anonymised database production dump anonymised dump production (anonymised) database client (second common way)

Slide 19

Slide 19 text

How to implement anonymisation process? 4. data leak risk Medium Low Low effort Medium High Medium Low config High Medium Low Low time gap realtime realtime Medium Long performance regression No Yes No No 2. DB trigger 3. DB proxy read & create 1. copy & UPDATE High

Slide 20

Slide 20 text

How to implement anonymisation process? 4. data leak risk Medium Low Low effort Medium High Medium Low config High Medium Low Low time gap realtime realtime Medium Long performance regression No Yes No No 2. DB trigger 3. DB proxy read & create 1. copy & UPDATE High

Slide 21

Slide 21 text

https://github.com/kibitan/masking

Slide 22

Slide 22 text

Daily Download ranked in Top 4% πŸš€ Total Download ranked in Top 20% πŸš€ (2024.07)

Slide 23

Slide 23 text

Usage [email protected] [email protected] [email protected] … users: email: anonymized+%{n}@example.com

Slide 24

Slide 24 text

Usage [email protected] [email protected] [email protected] … [email protected] NULL [email protected] … users: email: anonymized+%{n}@example.com users: email?: anonymized+%{n}@example.com

Slide 25

Slide 25 text

Usage $ mysql --complete-insert DB_NAME | masking | mysql ANONYMIZED_DB_NAME

Slide 26

Slide 26 text

Generate anonymised database with MasKING - Demo - YouTube Demo

Slide 27

Slide 27 text

Generate anonymised database with MasKING - Demo - YouTube Demo masking.yml

Slide 28

Slide 28 text

Generate anonymised database with MasKING - Demo - YouTube Demo

Slide 29

Slide 29 text

Generate anonymised database with MasKING - Demo - YouTube Demo

Slide 30

Slide 30 text

Design Concepts

Slide 31

Slide 31 text

Design Concept: KISS ~ Keep it simple, stupid! ~ β€œDo one thing, and do it well” Unix philosophy ❌ Database connection ❌ File handling βœ… Standard input/output

Slide 32

Slide 32 text

Design Concept: No external dependency No gems dependencies, use only standard libraries

Slide 33

Slide 33 text

Design Concept: Quality

Slide 34

Slide 34 text

Design Concept: Quality - metrics

Slide 35

Slide 35 text

Design Concept: Quality - Acceptance test setup test assertion docker-compose.yml acceptance/run_test.sh

Slide 36

Slide 36 text

Design Concept: Quality - Acceptance test on CI

Slide 37

Slide 37 text

How to implement?

Slide 38

Slide 38 text

users: name: Anonymized Name email: anonymized+%{n}@example.com masking.yml dump.sql anonymised_dump.sql masking

Slide 39

Slide 39 text

Domains: uncertainties Mapping configuration ? Parse/Rebuild SQL

Slide 40

Slide 40 text

Domains: uncertainties Mapping configuration Parse/Rebuild SQL Introduction to Test-Driven Development (TDD) with Classic TDD Example | Khalil Stemmler

Slide 41

Slide 41 text

β‘  Outside-in Mapping configuration stdin β‘  Outside-in β‘‘ Inside-out Parse/Rebuild SQL β‘’ Inside-out Domains: finding architecture/models Output Line Input Line stdout Processor β‘€ Outside-in β‘€ Outside-in β‘£ Inside-out

Slide 42

Slide 42 text

How to parse dump?

Slide 43

Slide 43 text

sample_dump.sql setup database define table schema insert data

Slide 44

Slide 44 text

https://dev.mysql.com/doc/refman/8.0/en/mysqldump.html

Slide 45

Slide 45 text

setup database define table schema insert data sample_dump_with_complete-insert_option.sql

Slide 46

Slide 46 text

setup database define table schema insert data sample_dump_with_complete-insert_option.sql

Slide 47

Slide 47 text

How to parse SQL?

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

https://github.com/kibitan/masking/blob/main/lib/masking/insert_statement.rb dynamically make regular expression object ? ? ? VALUE_REGEXP x 3

Slide 50

Slide 50 text

https://github.com/kibitan/masking/blob/main/lib/masking/insert_statement.rb dynamically make regular expression object ? ? ? VALUE_REGEXP x 3

Slide 51

Slide 51 text

$ mysqldump --complete-insert production_db | masking > anonymised_dump.sql /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:15:in `block in initialize': Masking::Error::InsertStatementParseError (Masking::Error::InsertStatementParseError) from :90:in `tap' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:14:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `mask' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:23:in `block in run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `each_line' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:10:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/cli.rb:17:in `run' from exe/masking:10:in `' let’s try with production data!

Slide 52

Slide 52 text

$ mysqldump --complete-insert production_db | masking > anonymised_dump.sql /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:15:in `block in initialize': Masking::Error::InsertStatementParseError (Masking::Error::InsertStatementParseError) from :90:in `tap' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:14:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `mask' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:23:in `block in run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `each_line' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:10:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/cli.rb:17:in `run' from exe/masking:10:in `' let’s try with production data!

Slide 53

Slide 53 text

the journey to ready for production data https://github.com/kibitan/masking/commit/792e73b7d316fac1fa6694d74b6adcfee19dccc4

Slide 54

Slide 54 text

https://github.com/kibitan/masking/commit/8ca7fdad2c52d3af5e6c56faadc08e8768500108 the journey to ready for production data

Slide 55

Slide 55 text

https://github.com/kibitan/masking/commit/04515ec0f9c8a5457c10b8224b488b90c90c6aa2 the journey to ready for production data

Slide 56

Slide 56 text

https://github.com/kibitan/masking/pull/3 the journey to ready for production data

Slide 57

Slide 57 text

$ mysqldump --complete-insert production_db | masking > anonymised_dump.sql $ mysql anonymised_db < anonymised_dump.sql ERROR 1064 (42000) at line 547659: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'smelati’),' at line 547659 the journey to ready for production data

Slide 58

Slide 58 text

(β€˜ab’,1),(’cd’,2),(’ef’,3); the journey to ready for production data

Slide 59

Slide 59 text

(β€˜ab’,1),(’cd’,2),(’ef’,3); β€˜ab’,1,’cd’,2,’ef’,3; same with CSV!!! the journey to ready for production data

Slide 60

Slide 60 text

https://www.clear-code.com/blog/2018/12/25.html the journey to ready for production data

Slide 61

Slide 61 text

2.even? 3.even? 4.even? concat with next 4.even? the journey to ready for production data

Slide 62

Slide 62 text

https://github.com/kibitan/masking/pull/5 the journey to ready for production data

Slide 63

Slide 63 text

https://github.com/kibitan/masking/pull/5 the journey to ready for production data

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

what’s next? ● Supporting hashed anonymizing ● Parse table schema ● Strict mode β—‹ type validation for config β—‹ working without β€œ--complete-insert” option ● Performance optimization β—‹ Custom command

Slide 66

Slide 66 text

what’s next?: for PostgreSQL https://github.com/kibitan/pgmasking

Slide 67

Slide 67 text

Thank you! Any feedback is welcome! https://github.com/kibitan/masking Chikahiro x/twitter: @chikahirotokoro https://chikahirotokoro.com