Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Generate anonymised database with MasKING - 202...

Generate anonymised database with MasKING - 2023.09.21 Euruko Unconference Talk

Chikahiro Tokoro

September 22, 2023
Tweet

More Decks by Chikahiro Tokoro

Other Decks in Programming

Transcript

  1. 1. What is the anonymised database? 2. How it was

    build? • Use cases • Anonymising Strategies • Usage and Demo: MasKING 🥡 You can build your anonymised database • Design concepts • Implementation detail • The journey for production 🥡 You can make your library in quality Agenda:
  2. staging (QA) production local (development) —- —- —- —- —-

    —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- 0.2 sec 3 sec 1 hour the postmortem: check the delivery process
  3. If with similar dataset with production... staging (QA) production local

    (development) —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- 1 hour
  4. Use cases: • simulate database migration local (development) —- —-

    —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- —- 1 hour
  5. Use cases: • SQL performance optimization accurate execution plan #

    Time: 2023-09-13T15:45:12.987654Z # User@Host: web_app[web_db] @ 10.0.0.2 # Query_time: 25.987654 Lock_time: 2.345678 Rows_sent: 50 Rows_examined: 5000 SET timestamp=1234567890; SELECT users.id, users.username, COUNT(posts.id) AS post_count, SUM(posts.likes) AS total_likes FROM users LEFT JOIN posts ON users.id = posts.user_id WHERE users.created_at >= '2023-01-01' GROUP BY users.id HAVING post_count >= 5 ORDER BY total_likes DESC LIMIT 10; slow query log
  6. Use cases: • make ease to reproduce and fix bugs

    unexpected values missing related record duplicated records
  7. Use cases: • better feature preview • stress test •

    analyse / Business Intelligence data source And… your idea here!
  8. How to implement anonymisation process? 1. copy database and “UPDATE”

    records production production (copy) UPDATE users SET name = 'User1', email = '[email protected]', .... (most common way)
  9. How to implement anonymisation process? 2. setup replica and database

    trigger production production (replica) CREATE TRIGGER after_user_insert AFTER INSERT ON users FOR EACH ROW BEGIN UPDATE users SET username = CONCAT('User', NEW.user_id) WHERE user_id = NEW.user_id; ...
  10. How to implement anonymisation process? 3. proxy connection and interrupt

    SQL on the fly production INSERT INTO users (username, email, ...) VALUES ('JohnDoe', '[email protected]' , ...); production (copy) Connection Proxy INSERT INTO users (username, email, ...) VALUES ('JohnDoe', '[email protected]' , ...); INSERT INTO users (username, email, ...) VALUES ('User1', '[email protected] ', ...);
  11. How to implement anonymisation process? 4. read database and create

    anonymised database production dump anonymised dump production (anonymised) database client (second common way)
  12. How to implement anonymisation process? 4. data leak risk Medium

    Low Low effort Medium High Medium Low config High Medium Low Low time gap realtime realtime Medium Long performance regression No Yes No No 2. DB trigger 3. DB proxy read & create 1. copy & UPDATE High
  13. How to implement anonymisation process? 4. data leak risk Medium

    Low Low effort Medium High Medium Low config High Medium Low Low time gap realtime realtime Medium Long performance regression No Yes No No 2. DB trigger 3. DB proxy read & create 1. copy & UPDATE High
  14. Usage [email protected] [email protected] [email protected][email protected] NULL [email protected] … users:

    email: anonymized+%{n}@example.com users: email?: anonymized+%{n}@example.com
  15. Design Concept: KISS ~ Keep it simple, stupid! ~ “Do

    one thing, and do it well” Unix philosophy ❌ Database connection ❌ File handling ✅ Standard input/output
  16. ① Outside-in Mapping configuration stdin ① Outside-in ② Inside-out Parse/Rebuild

    SQL ③ Inside-out Domains: finding architecture/models Output Line Input Line stdout Processor ⑤ Outside-in ⑤ Outside-in ④ Inside-out
  17. $ mysqldump --complete-insert production_db | masking > anonymised_dump.sql /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:15:in `block

    in initialize': Masking::Error::InsertStatementParseError (Masking::Error::InsertStatementParseError) from <internal:kernel>:90:in `tap' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:14:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `mask' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:23:in `block in run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `each_line' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:10:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/cli.rb:17:in `run' from exe/masking:10:in `<main>' let’s try with production data!
  18. $ mysqldump --complete-insert production_db | masking > anonymised_dump.sql /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:15:in `block

    in initialize': Masking::Error::InsertStatementParseError (Masking::Error::InsertStatementParseError) from <internal:kernel>:90:in `tap' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/insert_statement.rb:14:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/data_mask_processor.rb:12:in `initialize' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `new' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/sql_dump_line.rb:13:in `mask' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:23:in `block in run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `each_line' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:22:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking.rb:10:in `run' from /Users/kibitan/repo/github.com/kibitan/masking/lib/masking/cli.rb:17:in `run' from exe/masking:10:in `<main>' let’s try with production data!
  19. $ mysqldump --complete-insert production_db | masking > anonymised_dump.sql $ mysql

    anonymised_db < anonymised_dump.sql ERROR 1064 (42000) at line 547659: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'smelati’),' at line 547659 the journey to ready for production data
  20. what’s next? • Supporting hashed anonymizing • Parse table schema

    • Strict mode ◦ type validation for config ◦ working without “--complete-insert” option • Performance optimization ◦ Custom command