Upgrade to Pro — share decks privately, control downloads, hide ads and more …

TMPA-2021: An approach to create a synthetic financial transactions dataset based on NDA-protected dataset

TMPA-2021: An approach to create a synthetic financial transactions dataset based on NDA-protected dataset

Luba Konnova, Yuri Silenok, Dmitry Fomin, Andrey Novikov, Egor Kolesnikov, Ksenia Vorontsova and Daria Degtyarenko, Exactpro

An approach to create a synthetic financial transactions dataset based on NDA-protected dataset

TMPA is an annual International Conference on Software Testing, Machine Learning and Complex Process Analysis. The conference will focus on the application of modern methods of data science to the analysis of software quality.

To learn more about Exactpro, visit our website https://exactpro.com/

Follow us on
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Twitter https://twitter.com/exactpro

Exactpro
PRO

November 27, 2021
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. 1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS

    ANALYSIS An approach for creating a synthetic financial transactions dataset based on NDA-protected dataset Andrey Novikov, Egor Kolesnikov – syndata.io Luba Konnova, Yuri Silenok, Dmitry Fomin, Ksenia Vorontsova, Daria Degtyarenko – Exactpro
  2. 2 SWIFT Hackathon 2021 Challenge 2: Building ‘synthetic’ data-sets required

    for AI-based product development, whilst protecting privacy In the digital world, banks are striving to create new intelligent services that improve customer experiences of banking services – all underpinned by machine learning algorithms that recognise transaction types and learn from user behaviour. Building, maintaining and improving such services requires using large datasets to train machine learning models. Yet financial institutions cannot use their own customer datasets due to data protection laws. Teams must develop novel simulation techniques that maintain the ‘utility’ of the original transaction data, whilst fully protecting the privacy of the institutions involved.
  3. 3 Original Dataset Description The analysed dataset contains around 400

    thousand SWIFT MT103 Single Customer Credit Transfer messages
  4. 4 • Unique global dataset, interesting to many parties: ◦

    Government analytics (geographies, amounts) ◦ Central banks (banks, currencies) ◦ Currency traders (banks, currencies) ◦ Industry analytics (industries, amounts, distributions) ◦ … • Basic obfuscation is potentially reversible. ◦ Motivating example: a large-volume retailer, legally incorporated in a small city • Scammers, fraudsters, hackers… use the same ML as intended information consumers. What valuable information is actually shareable? The Nature of the Challenge
  5. 5 • Adjustable dataset generation process for each usage scenario.

    Leaves a desired level of precision in each of the dimensions: ◦ Geographies ◦ Amounts, distributions, attribution to banks/customers ◦ Banks • More restrictive usage rights for more precise data Solution
  6. 6 Initial Dataset Senders/Receivers graph parameters Ordering/Beneficiary: distributions of addresses,

    transactions quantities and amounts Transaction amounts distributions for sender+receiver+currency +outlier structure Generate graph: Sender-Receiver-Currency Value Date/Currency/Interbank Settled Amount 32A Populate clients base: geo, name OrderingCustomer, Beneficiary Customer 50A/F/K, 59F Generate amounts 32A:Interbank Settled Amount and 33B:Instructed Amount Populate charges 71A Details of Charges, 71F:Sender's Charges, 71G:Receiver's Charges Charges distributions Populate blanks, constants Assemble SWIFT-style dataset Evaluate metrics #vertices / edges #cycles Currencies distribution Triangulation index Connectivity components number + sizes distribution InEdge/OutEdge distribution Addresses distribution in Sender/Receiver/Currency bins 50A/F/K, 59 distribution, blanks percentage Amounts distributions in each Sender/Receiver/Currency/Geo bin Charges distribution INPUT SOLUTION METRICS PRESERVED Generation Process
  7. 7 Dataset: EDA as result of the analysis it was

    discovered: – how many unique Senders/Receivers there are; – which columns are blank; – which columns are obfuscated; – how column values interrelate; – how transaction volumes are distributed per Sender, per Sender/Receiver,per Sender/Receiver currency; – how Sender/Receiver charges are populated; – how Ordering Customer/Beneficiary Customer are distributed per Sender/Receiver, geography of their distribution.
  8. 8 Banks/Currency Graph

  9. 9 9 Scalability: generate a larger graph with similar relations:

    • #nodes / #edges • Connectivity rate • Triangulation index • Distributions of volumes and currencies vs edge types Generating the Bank/currency Graph
  10. 10 Initial Dataset Senders/Receivers graph parameters Ordering/Beneficiary: distributions of addresses,

    transactions quantities and amounts Transaction amounts distributions for sender+receiver+currency +outlier structure Generate graph: Sender-Receiver-Currency Value Date/Currency/Interbank Settled Amount 32A Populate clients base: geo, name OrderingCustomer, Beneficiary Customer 50A/F/K, 59F Generate amounts 32A:Interbank Settled Amount and 33B:Instructed Amount Populate charges 71A Details of Charges, 71F:Sender's Charges, 71G:Receiver's Charges Charges distributions Populate blanks, constants Assemble SWIFT-style dataset Evaluate metrics #vertices / edges #cycles Currencies distribution Triangulation index Connectivity components number + sizes distribution InEdge/OutEdge distribution Addresses distribution in Sender/Receiver/Currency bins 50A/F/K, 59 distribution, blanks percentage Amounts distributions in each Sender/Receiver/Currency/Geo bin Charges distribution INPUT SOLUTION METRICS PRESERVED Generation Process
  11. 11 Geographies Name Kevin Zip code 123456 Street Lenina st

    City Moscow Country RU Patricia 481516 Kirova st - Faker lib Obfuscated Initial ?
  12. 12 Geographies City Tokyo Rio Nairobi Denver Berlin ... Lisbon

    Country JP BR KE US DE ... PT Longitude 139.656 -43.183 36.809 -105.012 13.376 ... -9.177 Latitude 35.674 -22.907 -1.299 39.748 52.518 ... 38.684 - API (e.g. mapquest.com) - Geopy lib
  13. 13 Geographies

  14. 14 Geographies From the original dataset From publicly available dataset

  15. 15 Geographies Name Kevin Zip code 123456 Street Lenina st

    City Moscow Country RU Patricia 481516 Kirova st Obninsk RU - Faker lib Obfuscated Initial - API - Geopy lib
  16. 16 16 • Replicating densities in geo areas, keeping or

    modifying the number of cities. • Level of distortion is adjustable The Principles of Geographies Generation
  17. 17 Generation Process Initial Dataset Senders/Receivers graph parameters Ordering/Beneficiary: distributions

    of addresses, transactions quantities and amounts Transaction amounts distributions for sender+receiver+currency +outlier structure Generate graph: Sender-Receiver-Currency Value Date/Currency/Interbank Settled Amount 32A Populate clients base: geo, name OrderingCustomer, Beneficiary Customer 50A/F/K, 59F Generate amounts 32A:Interbank Settled Amount and 33B:Instructed Amount Populate charges 71A Details of Charges, 71F:Sender's Charges, 71G:Receiver's Charges Charges distributions Populate blanks, constants Assemble SWIFT-style dataset Evaluate metrics #vertices / edges #cycles Currencies distribution Triangulation index Connectivity components number + sizes distribution InEdge/OutEdge distribution Addresses distribution in Sender/Receiver/Currency bins 50A/F/K, 59 distribution, blanks percentage Amounts distributions in each Sender/Receiver/Currency/Geo bin Charges distribution INPUT SOLUTION METRICS PRESERVED
  18. 18 18 Step 1: Generate end customer entities (initiators, beneficiaries)

    as multi-dimensional random variables: • The customer is a client of a certain bank(s) • In a specific geo area • Has a certain number of (in, out) transactions and a distribution of their volumes and currencies. Numbers and Correlations are preserved. Generating Transactions
  19. 19 19

  20. 20 20 Step 2: Generate transactions and distribute them between

    customers, filling their patterns Generating Transactions
  21. 21 21

  22. 22 22 Step 3: Substitute geographies with obfuscated ones, keeping

    geo correlations Generating Transactions
  23. 23 23

  24. 24 Initial Dataset Senders/Receivers graph parameters Ordering/Beneficiary: distributions of addresses,

    transactions quantities and amounts Transaction amounts distributions for sender+receiver+currency +outlier structure Generate graph: Sender-Receiver-Currency Value Date/Currency/Interbank Settled Amount 32A Populate clients base: geo, name OrderingCustomer, Beneficiary Customer 50A/F/K, 59F Generate amounts 32A:Interbank Settled Amount and 33B:Instructed Amount Populate charges 71A Details of Charges, 71F:Sender's Charges, 71G:Receiver's Charges Charges distributions Populate blanks, constants Assemble SWIFT-style dataset Evaluate metrics #vertices / edges #cycles Currencies distribution Triangulation index Connectivity components number + sizes distribution InEdge/OutEdge distribution Addresses distribution in Sender/Receiver/Currency bins 50A/F/K, 59 distribution, blanks percentage Amounts distributions in each Sender/Receiver/Currency/Geo bin Charges distribution INPUT SOLUTION METRICS PRESERVED Generation Process
  25. 25 If any data can be potentially misused, what is

    the “necessary and sufficient” level of data obfuscation? Considerations: • Commercial common sense (competition, etc.) • Legal responsibility (GDPR, etc.) Question remaining: To what extent should the data be “spoiled”?
  26. 26

  27. 27 Thank You Follow TMPA on Facebook TMPA-2021 Conference