Your Smartphone Knows What You're Doing

YOUR SMARTPHONE KNOWS WHAT YOU’RE DOING C Todd Lombardo

OUTLINE • Problem statement • Performance summary • Process ◦
Exploratory data analysis ◦ Model selection ◦ Feature selection ◦ Model tuning ◦ Error analysis ◦ Tried and failed • Next steps

Problem Statement

Problem statement Hypothesis By examining accelerometer and gyroscope sensor data
from a smartphone, a model can classify which activity was performed by a person Goals Accurately predict a human movement from accelerometer and gyroscope smartphone data, both provided in the data repo and captured by a mobile app. Risks and limitations Feature engineering may need to go beyond the scope of the course content. One is the possible exploration of a principal component analysis and other feature reduction techniques.

Performance Summary

Results: Classify by logistic regression Logistic Regression Accuracy = 0.98640
on the test data set With PCA, the 561 features could be reduced to 120 principal components Sitting and Standing were the most difficult to discern: 20 false positives among them

Process

Target variable: Activity The model will need to accurately predict
one of these movements: Walking Walking_upstairs Walking_downstairs Sitting Standing Laying dynamic static

Exploratory Data Analysis

About the dataset Human Activity Recognition w/Smartphone (Sources Kaggle, UC
Irvine) 1. Inertial sensor data a. Raw triaxial signals from the accelerometer & gyroscope of all the trials with participants b. The labels of all the performed activities 2. Records of activity windows. Each one composed of: a. A 561-feature vector with time and frequency domain variables. b. Its associated activity label c. An identifier of the subject who carried out the experiment. The experiments were carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING-UPSTAIRS, WALKING-DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.

About the dataset Features will be selected and/or engineered from:
Time-series signals ▸ tBodyAcc-XYZ ▸ tGravityAcc-XYZ ▸ tBodyAccJerk-XYZ ▸ tBodyGyro-XYZ ▸ tBodyGyroJerk-XYZ ▸ tBodyAccMag ▸ tGravityAccMag ▸ tBodyAccJerkMag ▸ tBodyGyroMag ▸ tBodyGyroJerkMag Fourier Transformed Signals ▸ fBodyAcc-XYZ ▸ fBodyAccJerk-XYZ ▸ fBodyGyro-XYZ ▸ fBodyAccMag ▸ fBodyAccJerkMag ▸ fBodyGyroMag ▸ fBodyGyroJerkMag Features are normalized and bounded within [-1,1]. Each feature vector is a row on the text file. Some feature derivations are included in the dataset. The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2). The gyroscope units are rad/seg.

About the dataset Count of the Features

Data: Basic statistics hua.describe()

So about those correlations...

Data: By activity hua.groupby(Activity).count()

Which correlations matter? Feature_1 Feature_2 Correlation Abs_Corr tBodyAccJerk-energy()-X fBodyAccJerk-energy()-X 0.999999
0.999999 fBodyAccJerk-energy()-X tBodyAccJerk-energy()-X 0.999999 0.999999 fBodyAcc-bandsEnergy()-1,24 fBodyAcc-energy()-X 0.999878 0.999878 fBodyAcc-energy()-X fBodyAcc-bandsEnergy()-1,24 0.999878 0.999878 fBodyGyro-energy()-X fBodyGyro-bandsEnergy()-1,24 0.999767 0.999767 fBodyGyro-bandsEnergy()-1,24 fBodyGyro-energy()-X 0.999767 0.999767 fBodyAcc-bandsEnergy()-1,24.1 fBodyAcc-energy()-Y 0.999661 0.999661 fBodyAcc-energy()-Y fBodyAcc-bandsEnergy()-1,24.1 0.999661 0.999661 tBodyAccJerkMag-mean() tBodyAccJerk-sma() 0.999656 0.999656 tBodyAccJerk-sma() tBodyAccJerkMag-mean() 0.999656 0.999656 corr_values = hua[feature_cols].corr() corr_values[(corr_values.Correlation < 1.0) & (corr_values.Correlation > 0.9)]

What do the activities look like for one subject?

What do the activities look like for all subjects?

What does one activity look like for different subjects?

Can we separate static and dynamic activities? Yes ——> If
tBodyAccMag < -0.5 then it’s a static activity If tBodyAccMag > -0.5 then it’s a dynamic activity dynamic static

Is it possible to separate speciﬁc activities?

Is it possible to separate all the activities? (t-SNE)

Model Selection

Model Selection: “Naive” (No feature selection, no tuning) Algorithm Train
Cross-Validation Score Test Accuracy Score DecisionTreeClassifier 0.840870 0.852392 RandomForestClassifier 0.917985 0.922973 KNNeighborsClassifier 0.897175 0.900238 LogisticRegression 0.933495 0.957923 Naive running is fitting models with no feature selection, tuning, or optimization. Yes, I ran all 561 features. And yes, it took a while. Must be classification algorithm, ran four

Feature Selection

Principal Component Analysis: Visualization of PC1 & PC2

PCA – Project Back to the Features

Map back to actual features? Signal Value angle(tBodyGyroJerkMean,gravityMean) 0.479798 tBodyAccJerk-mean()-X
0.201355 tGravityAcc-energy()-Y 0.115365 fBodyAcc-kurtosis()-Z 0.098975 fBodyAcc-skewness()-Z 0.098076 tGravityAcc-correlation()-X,Y 0.085849 angle(tBodyGyroMean,gravityMean) 0.075399 angle(Z,gravityMean) 0.074471 tGravityAcc-min()-Y 0.059661 tGravityAcc-mean()-Y 0.055262 pd.Series(pc1, index=hua_pca.columns).sort_values(ascending=False) Which are the most predictive signals?

This matches up to the corr heatmap...

Model Tuning

How many PCA features needed to increase accuracy? Plotting all
the accuracy scores by increasing the number of PCA variables in the model

Which are the most important PCA Features? coef variable abscoef
6.015424 PC3 6.015424 3.503071 PC22 3.503071 2.330216 PC48 2.330216 1.976726 PC32 1.976726 -1.844982 PC26 1.844982 1.678715 PC23 1.678715 1.291059 PC45 1.291059 1.269009 PC13 1.269009 -1.153241 PC6 1.153241 1.136382 PC24 1.136382 coefs_vars.sort_values('abscoef', ascending=False, inplace=True) Which are the most important PCA Features?

Error Analysis

Confusion Matrix 120 PCA features With 20 false positives

Tried and Failed

These didn’t work so well 1. Scaling the data with
StandardScaler: Accuracy dropped to 0.911 (“features are normalized”) 2. Further feature reduction of PCA or Signal features a. a challenge to find specific signals, it seems the combination of signals is far more useful b. even when using top 20 principal components 3. Ridge — struggled to get this to work properly

Next Steps

With more time 1. Horn's parallel analysis 2. Optimize Lasso
and Ridge, would a penalty improve the model? 3. Dig further into t-SNE for feature selection & reduction 4. Run experiment with my own phone a. Perform the six activities with a sensor recorder app b. Process the raw data, run the model and score it 5. Try this with data from a smartwatch!

References & credits https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones http://www.cis.fordham.edu/wisdm/public_files/sensorKDD-2010.pdf https://upcommons.upc.edu/bitstream/handle/2117/101769/IWAAL2012.pdf https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2013-11.pdf https://github.com/anas337/Human-Activity-Recognition-Using-Smartphones https://github.com/markdregan/K-Nearest-Neighbors-with-Dynamic-Time-Warping http://www.cis.fordham.edu/wisdm/dataset.php
With code lovingly borrowed from many GA Jupyter and Kaggle notebooks

Your Smartphone Knows What You're Doing

Your Smartphone Knows What You're Doing

More Decks by C. Todd Lombardo

Other Decks in Technology

Featured

Transcript