Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Generate anonymised database with MasKING - 2023.09.21 Euruko Unconference Talk

Generate anonymised database with MasKING - 2023.09.21 Euruko Unconference Talk

Chikahiro Tokoro

September 22, 2023
Tweet

More Decks by Chikahiro Tokoro

Other Decks in Programming

Transcript

  1. staging (QA) production local (development) —- —- —- —- —-

    —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- 0.2 sec 3 sec 1 hour the postmortem: check the delivery process
  2. If with similar dataset with production... staging (QA) production local

    (development) —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- 1 hour
  3. Use cases: • simulate database migration local (development) —- —-

    —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- 1 hour
  4. Use cases: • SQL performance optimization accurate execution plan #

    Time: 2023-09-13T15:45:12.987654Z # User@Host: web_app[web_db] @ 10.0.0.2 # Query_time: 25.987654 Lock_time: 2.345678 Rows_sent: 50 Rows_examined: 5000 SET timestamp=1234567890; SELECT users.id, users.username, COUNT(posts.id) AS post_count, SUM(posts.likes) AS total_likes FROM users LEFT JOIN posts ON users.id = posts.user_id WHERE users.created_at >= '2023-01-01' GROUP BY users.id HAVING post_count >= 5 ORDER BY total_likes DESC LIMIT 10; slow query log
  5. Use cases: • make ease to reproduce and fix bugs

    unexpected values missing related record duplicated records
  6. Use cases: • better feature preview • stress test •

    analyse / Business Intelligence data source And… your idea here!
  7. How to implement anonymisation process? 1. copy database and “UPDATE”

    records production production (copy) UPDATE users SET name = 'User1', email = '[email protected]', .... (most common way)
  8. How to implement anonymisation process? 2. setup replica and database

    trigger production production (replica) CREATE TRIGGER after_user_insert AFTER INSERT ON users FOR EACH ROW BEGIN UPDATE users SET username = CONCAT('User', NEW.user_id) WHERE user_id = NEW.user_id; ...
  9. How to implement anonymisation process? 3. proxy connection and interrupt

    SQL on the fly production INSERT INTO users (username, email, ...) VALUES ('JohnDoe', '[email protected]' , ...); production (copy) Connection Proxy INSERT INTO users (username, email, ...) VALUES ('JohnDoe', '[email protected]' , ...); INSERT INTO users (username, email, ...) VALUES ('User1', '[email protected] ', ...);
  10. How to implement anonymisation process? 4. read database and create

    anonymised database production dump anonymised dump production (anonymised) database client (second common way)
  11. How to implement anonymisation process? 4. data leak risk Medium

    Low Low effort Medium High Medium Low config High Medium Low Low time gap realtime realtime Medium Long performance regression No Yes No No 2. DB trigger 3. DB proxy read & create 1. copy & UPDATE High
  12. How to implement anonymisation process? 4. data leak risk Medium

    Low Low effort Medium High Medium Low config High Medium Low Low time gap realtime realtime Medium Long performance regression No Yes No No 2. DB trigger 3. DB proxy read & create 1. copy & UPDATE High
  13. Design Concept: KISS ~ Keep it simple, stupid! ~ “Do

    one thing, and do it well” Unix philosophy ❌ Database connection ❌ File handling ✅ Standard input/output
  14. ① Outside-in Mapping configuration stdin ① Outside-in ② Inside-out Parse/Rebuild

    SQL ③ Inside-out Domains: finding architecture/models Output Line Input Line stdout Processor ⑤ Outside-in ⑤ Outside-in ④ Inside-out
  15. $ mysqldump --complete-insert production_db | masking > anonymised_dump.sql /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:15:in `block

    in initialize': Masking::Error::InsertStatementParseError (Masking::Error::InsertStatementParseError) from <internal:kernel>:90:in `tap' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:14:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `mask' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:23:in `block in run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `each_line' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:10:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/cli.rb:17:in `run' from exe/masking:10:in `<main>' let’s try with production data!
  16. $ mysqldump --complete-insert production_db | masking > anonymised_dump.sql /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:15:in `block

    in initialize': Masking::Error::InsertStatementParseError (Masking::Error::InsertStatementParseError) from <internal:kernel>:90:in `tap' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:14:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `mask' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:23:in `block in run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `each_line' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:10:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/cli.rb:17:in `run' from exe/masking:10:in `<main>' let’s try with production data!
  17. $ mysqldump --complete-insert production_db | masking > anonymised_dump.sql $ mysql

    anonymised_db < anonymised_dump.sql ERROR 1064 (42000) at line 547659: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'smelati’),' at line 547659 the journey to ready for production data
  18. what’s next? • support random/fake data • parse table schema

    • performance optimization ◦ type validation for config ◦ working without “--complete-insert” option