Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PII Analyzer

Avatar for Savio Abuga Savio Abuga
November 02, 2015

PII Analyzer

Analyzing PII(Personally Identifiable Information) in datasets

Avatar for Savio Abuga

Savio Abuga

November 02, 2015
Tweet

Other Decks in Programming

Transcript

  1. HDX The Humanitarian Data Exchange (HDX) is an open platform

    for sharing data. The goal of HDX is to make humanitarian data easy to find and use for analysis.
  2. Problem HDX want a tool to determine whether new datasets

    uploaded to HDX contains any personally identifiable information - data that can be used on its own or with other information to identify, contact, or locate a single person, or to identify an individual in context.
  3. PII Data Personally identifiable information includes: • Full name (if

    not common) • Home address • Email address • National identification number and Passport number • IP address (in some cases) • Credit card numbers
  4. My Approach PII can be divided into two: Ones with

    a pattern: emails, phone numbers, street addresses, IP address Ones without pattern: Names of people
  5. ...my approach Creating a python package to analyze the data.

    HDX use python. Using the methods: • Using Regular Expressions - for those with a pattern • Using Machine Learning - for those with no particular pattern.
  6. … my approach Using the following packages: • Regular Expressions

    - CommonRegex • Machine Learning - Standford Named Entity Recognizer and Python Natural Language Toolkit(nltk)
  7. Python Package Install it from Github (https://github. com/savioabuga/piianalyzer) Use it:

    >>> from piianalyzer.analyzer import PiiAnalyzer >>> filepath = '/path/or/url/to/your/file.csv' >>> piianalyzer = PiiAnalyzer(filepath) >>> analysis = piianalyzer.analysis()
  8. Findings & Improvements • Perfect PII Classification not possible •

    Using Standford NER together with a locations database • Adding a score to datasets after analysis