Upgrade to Pro — share decks privately, control downloads, hide ads and more …

BIForum 2017 - Keystroke Analysis for Fraud Det...

Valerio Maggio
November 15, 2017

BIForum 2017 - Keystroke Analysis for Fraud Detection

User identification is a fundamental, but yet an open problem in fraud detection.
Traditional approaches resort to user account information or browsing history.
However, such information can pose security and privacy risks, and it is not robust as can
be easily changed, e.g., the user changes to a new device or using a different application.
Monitoring biometric information including a user’s typing behaviours tends to produce
consistent results over time while being less disruptive to user’s experience.

In this talk I will present the Machine Learning pipeline I set up to prevent frauds in user
authentications. Challenges for processing and filtering real user data
accessing bank accounts from web and mobile devices will be discussed, along
with the deep neural networks adopted to learn to detect impostors.
During the talk, I will present the Pythonic tools (e.g. `pandas`) and data formats
(i.e. `hdf5` and `json`) I used to collect and store data, as well as those to
configure the machine learning process (i.e. `scipy.cluster`, `sklearn` and `keras`).

The talk is meant for data scientists, as well as for practitioners with no specific background in
machine or deep learning. Basic knowledge of `pandas` and other `numpy` based
scientific libraries is assumed.

Valerio Maggio

November 15, 2017
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. Keystroke Behavioural Analysis For Fraud Detection Valerio Maggio @leriomaggio Data

    Scientist and Researcher Fondazione Bruno Kessler (FBK)
 Trento, Italy
  2. Keystroke Dynamics Keystroke dynamics consists in analysing the way a

    user types by monitoring keyboard inputs thousand of times per second, and processing this data through an algorithm, which then defines a pattern for future comparison Identifying an individual based on their way of typing on a physical or virtual keyboard
  3. Keystroke Dynamic Analysis Time between two key pressures 
 Down-Down

    Time Time between one pressure and one release- 
 Dwell Time Time between one release and one pressure
 Flight Time Time between two key release
 Up-Up Time
  4. Data Pipeline: (1) Data Collection Time between two key pressures

    
 Down-Down Time Time between one pressure and one release- 
 Dwell Time Time between one release and one pressure
 Flight Time Time between two key release
 Up-Up Time
  5. Data Pipeline: (2) Feature Extraction Time between two key pressures

    Time between one pressure and one release Time between one release and one pressure Time between two key release TimeShifting key-presses - if deletions happen Only Data leading to a Successful Login
  6. Data Analysis Protocol (DAP) Reduce the 
 Selection Bias!! 80%

    20% Use separately for 
 HyperParams Search Don’t Mix
  7. Deep Keystroke Learning Deep AutoEncoder Encoder Decoder … Classification Deep

    Network One AutoEncoder + FC Network Outlier Detector (per user)
  8. Deep Keystroke Learning
 User Identification Deep AutoEncoder Encoder Decoder …

    Classification Deep Network Confusion Classification Matrix Avg. Accuracy Score: 0.999090 One AutoEncoder + FC Network Outlier Detector (per user) Avg. FPR: 0.002246
  9. Conclusions and Take Aways • Data Processing and Cleaning is

    never painless • 80% of the time for Data Science Processing • 20% is for Machine/Deep Learning Code • 90% of which is looking for Optimum HyperParameters 
 (exp. for Deep Learning) • Use Unsupervised Approaches to get useful insights on the data • Feature Scaling is paramount • Beware of the Selection Bias (Multiple Time K-Fold CV) • DL is not silver bullet