Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Detection of Sensitive Personal Data in Large Data Sets Using Ray with Deep Learning (Srinivasa Rao Aravilli, Visa)

Detection of Sensitive Personal Data in Large Data Sets Using Ray with Deep Learning (Srinivasa Rao Aravilli, Visa)

In this talk, Srinivasa discusses:

* Traditional approaches to detect the sensitive data for data protection and data loss prevention and the drawbacks of the traditional approaches
* Novel ways of using ML , DL, and NLP along with Spark/Ray to detect the sensitive data in peta bytes scale
* Novel approaches/algorithm to address the scaling challenges in detection of sensitive data
* Benchmark numbers and accuracy results with the novel approach


July 14, 2021

More Decks by Anyscale

Other Decks in Technology


  1. Detection of sensitive data in large data sets using Ray

    Srinivasa Rao Aravilli Director, Visa https://www.linkedin.com/in/aravilli/ [email protected]
  2. Data Protection - Use case Large amount of data collected

    and processed in enterprise companies and this data grows exponentially. The data can be in the form of structured or/and unstructured and it may contain sensitive information. Detection and protection of sensitive data is extremely important because of the directives and laws imposed by various regulations ( GDPR, CCPA , LGBD, IDP Bill) in several countries. Image/Data Source : United Nations
  3. Traditional Approaches for Sensitive Data Detection & Data Loss Prevention

    • Static Rules with custom code • Regular Expressions • Static Sampling
  4. Rule based / RegEx based Table Column Name Employee First_Name

    Country C_Name Product Product_Name Telemetry Metric_Name Limitations • Missing Context • High to moderate False Positive rate RegEx Classification Actual
  5. Static Sampling Limitations • Same configuration applied for all data

    sets or different static configurations based on the data set • HR Data, Marketing Data, Sales Data, Log Data, Transactional Data, Telemetry Data , etc • No Learning based on the scanned data / scanned results • Based on Number of rows in the structured data • Based on % of data (rows) in the structured data • File Size ( Semi Structured/Unstructured data) or number of files
  6. Approaches to overcome the limitations • Context aware sensitive data

    detection using machine learning/deep learning • Dynamic Sampling based on the accuracy of the sensitive data scans
  7. Load Sensitive Data Set Data Pre-processing Features Generation Cross Validation

    Hyper Parameters Tuning Classifiers Logistic Regression Classifiers Decision Tree Support Vector Machine Random Forest Word Embedding (Context Vectors ) Sensitivity Classifiers Evaluation & Selection word2vec Glove fastText Bert MLlib ML/DL Approach using Spark Accuracy based Sampling
  8. Sensitive Data Detection using Ray • ray.init() • ray.shutdown() •

    @ray.remote • num_cpus • num_returns • .remote() • ray.get() • Ray Actors • Ray Dashboard Ray Functions and Features used
  9. Sensitive Data Detection using Ray Data Loading and Feature Generation

    Table Column Name Data Type Data Length Nullable YesNo Sensitive Type Employee First_Name CHAR 100 No Yes F_NAME Product Product_Name CHAR 255 No No F_NAME Telemetry Metric_Name 255 No No F_NAME Country Country_Name CHAR 150 No No F_NAME
  10. Sensitive Data Detection using Ray Word-embeddings using Gensim Input_vec vec_features

    [ Employee, FirstName, CHAR, 100, No] [-0.0041883737, -0.001030758, -0.00023175575, …..
  11. Future journey of Ray • Benchmark with Spark ML vs

    Ray with Spark • Ray using multiple CPUs and GPUs cluster • Benchmark with Ray Serve vs Tensorflow Serve • Ray Cluster & Ray tune