Detection of Sensitive Personal Data in Large Data Sets Using Ray with Deep Learning (Srinivasa Rao Aravilli, Visa)

Detection of sensitive data in large data sets using Ray
Srinivasa Rao Aravilli Director, Visa https://www.linkedin.com/in/aravilli/ [email protected]

Data Protection - Use case Large amount of data collected
and processed in enterprise companies and this data grows exponentially. The data can be in the form of structured or/and unstructured and it may contain sensitive information. Detection and protection of sensitive data is extremely important because of the directives and laws imposed by various regulations ( GDPR, CCPA , LGBD, IDP Bill) in several countries. Image/Data Source : United Nations

Traditional Approaches for Sensitive Data Detection & Data Loss Prevention
• Static Rules with custom code • Regular Expressions • Static Sampling

Rule based / RegEx based Table Column Name Employee First_Name
Country C_Name Product Product_Name Telemetry Metric_Name Limitations • Missing Context • High to moderate False Positive rate RegEx Classification Actual

Static Sampling Limitations • Same configuration applied for all data
sets or different static configurations based on the data set • HR Data, Marketing Data, Sales Data, Log Data, Transactional Data, Telemetry Data , etc • No Learning based on the scanned data / scanned results • Based on Number of rows in the structured data • Based on % of data (rows) in the structured data • File Size ( Semi Structured/Unstructured data) or number of files

Approaches to overcome the limitations • Context aware sensitive data
detection using machine learning/deep learning • Dynamic Sampling based on the accuracy of the sensitive data scans

Load Sensitive Data Set Data Pre-processing Features Generation Cross Validation
Hyper Parameters Tuning Classifiers Logistic Regression Classifiers Decision Tree Support Vector Machine Random Forest Word Embedding (Context Vectors ) Sensitivity Classifiers Evaluation & Selection word2vec Glove fastText Bert MLlib ML/DL Approach using Spark Accuracy based Sampling

Sensitive Data Detection using Ray • ray.init() • ray.shutdown() •
@ray.remote • num_cpus • num_returns • .remote() • ray.get() • Ray Actors • Ray Dashboard Ray Functions and Features used

Sensitive Data Detection using Ray Data Loading and Feature Generation
Table Column Name Data Type Data Length Nullable YesNo Sensitive Type Employee First_Name CHAR 100 No Yes F_NAME Product Product_Name CHAR 255 No No F_NAME Telemetry Metric_Name 255 No No F_NAME Country Country_Name CHAR 150 No No F_NAME

Sensitive Data Detection using Ray Word-embeddings using Gensim Input_vec vec_features
[ Employee, FirstName, CHAR, 100, No] [-0.0041883737, -0.001030758, -0.00023175575, …..

Sensitive Data Detection using Ray Label Extraction Train/Test Data for
Model Input/Output Features

Sensitive Data Detection using Ray Model Training Prediction

Future journey of Ray • Benchmark with Spark ML vs
Ray with Spark • Ray using multiple CPUs and GPUs cluster • Benchmark with Ray Serve vs Tensorflow Serve • Ray Cluster & Ray tune

Thank You Be safe in the pandemic

Detection of Sensitive Personal Data in Large D...

Detection of Sensitive Personal Data in Large Data Sets Using Ray with Deep Learning (Srinivasa Rao Aravilli, Visa)

Anyscale

More Decks by Anyscale

Other Decks in Technology

Featured

Transcript

Detection of sensitive data in large data sets using Ray

Data Protection - Use case Large amount of data collected

Traditional Approaches for Sensitive Data Detection & Data Loss Prevention

Rule based / RegEx based Table Column Name Employee First_Name

Static Sampling Limitations • Same configuration applied for all data

Approaches to overcome the limitations • Context aware sensitive data

Load Sensitive Data Set Data Pre-processing Features Generation Cross Validation

Sensitive Data Detection using Ray • ray.init() • ray.shutdown() •

Sensitive Data Detection using Ray Data Loading and Feature Generation

Sensitive Data Detection using Ray Word-embeddings using Gensim Input_vec vec_features

Sensitive Data Detection using Ray Label Extraction Train/Test Data for

Sensitive Data Detection using Ray Model Training Prediction

Future journey of Ray • Benchmark with Spark ML vs

Thank You Be safe in the pandemic