Slide 1

Slide 1 text

1 25-27 NOVEMBER SOFTWARE TESTING, MACHINE LEARNING AND COMPLEX PROCESS ANALYSIS An approach for creating a synthetic financial transactions dataset based on NDA-protected dataset Andrey Novikov, Egor Kolesnikov – syndata.io Luba Konnova, Yuri Silenok, Dmitry Fomin, Ksenia Vorontsova, Daria Degtyarenko – Exactpro

Slide 2

Slide 2 text

2 SWIFT Hackathon 2021 Challenge 2: Building ‘synthetic’ data-sets required for AI-based product development, whilst protecting privacy In the digital world, banks are striving to create new intelligent services that improve customer experiences of banking services – all underpinned by machine learning algorithms that recognise transaction types and learn from user behaviour. Building, maintaining and improving such services requires using large datasets to train machine learning models. Yet financial institutions cannot use their own customer datasets due to data protection laws. Teams must develop novel simulation techniques that maintain the ‘utility’ of the original transaction data, whilst fully protecting the privacy of the institutions involved.

Slide 3

Slide 3 text

3 Original Dataset Description The analysed dataset contains around 400 thousand SWIFT MT103 Single Customer Credit Transfer messages

Slide 4

Slide 4 text

4 ● Unique global dataset, interesting to many parties: ○ Government analytics (geographies, amounts) ○ Central banks (banks, currencies) ○ Currency traders (banks, currencies) ○ Industry analytics (industries, amounts, distributions) ○ … ● Basic obfuscation is potentially reversible. ○ Motivating example: a large-volume retailer, legally incorporated in a small city ● Scammers, fraudsters, hackers… use the same ML as intended information consumers. What valuable information is actually shareable? The Nature of the Challenge

Slide 5

Slide 5 text

5 ● Adjustable dataset generation process for each usage scenario. Leaves a desired level of precision in each of the dimensions: ○ Geographies ○ Amounts, distributions, attribution to banks/customers ○ Banks ● More restrictive usage rights for more precise data Solution

Slide 6

Slide 6 text

6 Initial Dataset Senders/Receivers graph parameters Ordering/Beneficiary: distributions of addresses, transactions quantities and amounts Transaction amounts distributions for sender+receiver+currency +outlier structure Generate graph: Sender-Receiver-Currency Value Date/Currency/Interbank Settled Amount 32A Populate clients base: geo, name OrderingCustomer, Beneficiary Customer 50A/F/K, 59F Generate amounts 32A:Interbank Settled Amount and 33B:Instructed Amount Populate charges 71A Details of Charges, 71F:Sender's Charges, 71G:Receiver's Charges Charges distributions Populate blanks, constants Assemble SWIFT-style dataset Evaluate metrics #vertices / edges #cycles Currencies distribution Triangulation index Connectivity components number + sizes distribution InEdge/OutEdge distribution Addresses distribution in Sender/Receiver/Currency bins 50A/F/K, 59 distribution, blanks percentage Amounts distributions in each Sender/Receiver/Currency/Geo bin Charges distribution INPUT SOLUTION METRICS PRESERVED Generation Process

Slide 7

Slide 7 text

7 Dataset: EDA as result of the analysis it was discovered: – how many unique Senders/Receivers there are; – which columns are blank; – which columns are obfuscated; – how column values interrelate; – how transaction volumes are distributed per Sender, per Sender/Receiver,per Sender/Receiver currency; – how Sender/Receiver charges are populated; – how Ordering Customer/Beneficiary Customer are distributed per Sender/Receiver, geography of their distribution.

Slide 8

Slide 8 text

8 Banks/Currency Graph

Slide 9

Slide 9 text

9 9 Scalability: generate a larger graph with similar relations: ● #nodes / #edges ● Connectivity rate ● Triangulation index ● Distributions of volumes and currencies vs edge types Generating the Bank/currency Graph

Slide 10

Slide 10 text

10 Initial Dataset Senders/Receivers graph parameters Ordering/Beneficiary: distributions of addresses, transactions quantities and amounts Transaction amounts distributions for sender+receiver+currency +outlier structure Generate graph: Sender-Receiver-Currency Value Date/Currency/Interbank Settled Amount 32A Populate clients base: geo, name OrderingCustomer, Beneficiary Customer 50A/F/K, 59F Generate amounts 32A:Interbank Settled Amount and 33B:Instructed Amount Populate charges 71A Details of Charges, 71F:Sender's Charges, 71G:Receiver's Charges Charges distributions Populate blanks, constants Assemble SWIFT-style dataset Evaluate metrics #vertices / edges #cycles Currencies distribution Triangulation index Connectivity components number + sizes distribution InEdge/OutEdge distribution Addresses distribution in Sender/Receiver/Currency bins 50A/F/K, 59 distribution, blanks percentage Amounts distributions in each Sender/Receiver/Currency/Geo bin Charges distribution INPUT SOLUTION METRICS PRESERVED Generation Process

Slide 11

Slide 11 text

11 Geographies Name Kevin Zip code 123456 Street Lenina st City Moscow Country RU Patricia 481516 Kirova st - Faker lib Obfuscated Initial ?

Slide 12

Slide 12 text

12 Geographies City Tokyo Rio Nairobi Denver Berlin ... Lisbon Country JP BR KE US DE ... PT Longitude 139.656 -43.183 36.809 -105.012 13.376 ... -9.177 Latitude 35.674 -22.907 -1.299 39.748 52.518 ... 38.684 - API (e.g. mapquest.com) - Geopy lib

Slide 13

Slide 13 text

13 Geographies

Slide 14

Slide 14 text

14 Geographies From the original dataset From publicly available dataset

Slide 15

Slide 15 text

15 Geographies Name Kevin Zip code 123456 Street Lenina st City Moscow Country RU Patricia 481516 Kirova st Obninsk RU - Faker lib Obfuscated Initial - API - Geopy lib

Slide 16

Slide 16 text

16 16 ● Replicating densities in geo areas, keeping or modifying the number of cities. ● Level of distortion is adjustable The Principles of Geographies Generation

Slide 17

Slide 17 text

17 Generation Process Initial Dataset Senders/Receivers graph parameters Ordering/Beneficiary: distributions of addresses, transactions quantities and amounts Transaction amounts distributions for sender+receiver+currency +outlier structure Generate graph: Sender-Receiver-Currency Value Date/Currency/Interbank Settled Amount 32A Populate clients base: geo, name OrderingCustomer, Beneficiary Customer 50A/F/K, 59F Generate amounts 32A:Interbank Settled Amount and 33B:Instructed Amount Populate charges 71A Details of Charges, 71F:Sender's Charges, 71G:Receiver's Charges Charges distributions Populate blanks, constants Assemble SWIFT-style dataset Evaluate metrics #vertices / edges #cycles Currencies distribution Triangulation index Connectivity components number + sizes distribution InEdge/OutEdge distribution Addresses distribution in Sender/Receiver/Currency bins 50A/F/K, 59 distribution, blanks percentage Amounts distributions in each Sender/Receiver/Currency/Geo bin Charges distribution INPUT SOLUTION METRICS PRESERVED

Slide 18

Slide 18 text

18 18 Step 1: Generate end customer entities (initiators, beneficiaries) as multi-dimensional random variables: ● The customer is a client of a certain bank(s) ● In a specific geo area ● Has a certain number of (in, out) transactions and a distribution of their volumes and currencies. Numbers and Correlations are preserved. Generating Transactions

Slide 19

Slide 19 text

19 19

Slide 20

Slide 20 text

20 20 Step 2: Generate transactions and distribute them between customers, filling their patterns Generating Transactions

Slide 21

Slide 21 text

21 21

Slide 22

Slide 22 text

22 22 Step 3: Substitute geographies with obfuscated ones, keeping geo correlations Generating Transactions

Slide 23

Slide 23 text

23 23

Slide 24

Slide 24 text

24 Initial Dataset Senders/Receivers graph parameters Ordering/Beneficiary: distributions of addresses, transactions quantities and amounts Transaction amounts distributions for sender+receiver+currency +outlier structure Generate graph: Sender-Receiver-Currency Value Date/Currency/Interbank Settled Amount 32A Populate clients base: geo, name OrderingCustomer, Beneficiary Customer 50A/F/K, 59F Generate amounts 32A:Interbank Settled Amount and 33B:Instructed Amount Populate charges 71A Details of Charges, 71F:Sender's Charges, 71G:Receiver's Charges Charges distributions Populate blanks, constants Assemble SWIFT-style dataset Evaluate metrics #vertices / edges #cycles Currencies distribution Triangulation index Connectivity components number + sizes distribution InEdge/OutEdge distribution Addresses distribution in Sender/Receiver/Currency bins 50A/F/K, 59 distribution, blanks percentage Amounts distributions in each Sender/Receiver/Currency/Geo bin Charges distribution INPUT SOLUTION METRICS PRESERVED Generation Process

Slide 25

Slide 25 text

25 If any data can be potentially misused, what is the “necessary and sufficient” level of data obfuscation? Considerations: ● Commercial common sense (competition, etc.) ● Legal responsibility (GDPR, etc.) Question remaining: To what extent should the data be “spoiled”?

Slide 26

Slide 26 text

26

Slide 27

Slide 27 text

27 Thank You Follow TMPA on Facebook TMPA-2021 Conference