C. Todd Lombardo
May 06, 2020
330

# Your Smartphone Knows What You're Doing

A data science project that took a Kaggle dataset and created a logistic regression model that predicted with 98% accuracy one of six activities performed by a group of test participants.

I'm no data scientist nor do I play one on YouTube, but this was a great project to extend my knowledge in how data from gyroscopes and accelerometers can be used in the real world.

May 06, 2020

## Transcript

2. ### OUTLINE • Problem statement • Performance summary • Process ◦

Exploratory data analysis ◦ Model selection ◦ Feature selection ◦ Model tuning ◦ Error analysis ◦ Tried and failed • Next steps

4. ### Problem statement Hypothesis By examining accelerometer and gyroscope sensor data

from a smartphone, a model can classify which activity was performed by a person Goals Accurately predict a human movement from accelerometer and gyroscope smartphone data, both provided in the data repo and captured by a mobile app. Risks and limitations Feature engineering may need to go beyond the scope of the course content. One is the possible exploration of a principal component analysis and other feature reduction techniques.

6. ### Results: Classify by logistic regression Logistic Regression Accuracy = 0.98640

on the test data set With PCA, the 561 features could be reduced to 120 principal components Sitting and Standing were the most difficult to discern: 20 false positives among them

8. ### Target variable: Activity The model will need to accurately predict

one of these movements: Walking Walking_upstairs Walking_downstairs Sitting Standing Laying dynamic static

10. ### About the dataset Human Activity Recognition w/Smartphone (Sources Kaggle, UC

Irvine) 1. Inertial sensor data a. Raw triaxial signals from the accelerometer & gyroscope of all the trials with participants b. The labels of all the performed activities 2. Records of activity windows. Each one composed of: a. A 561-feature vector with time and frequency domain variables. b. Its associated activity label c. An identifier of the subject who carried out the experiment. The experiments were carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING-UPSTAIRS, WALKING-DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
11. ### About the dataset Features will be selected and/or engineered from:

Time-series signals ▸ tBodyAcc-XYZ ▸ tGravityAcc-XYZ ▸ tBodyAccJerk-XYZ ▸ tBodyGyro-XYZ ▸ tBodyGyroJerk-XYZ ▸ tBodyAccMag ▸ tGravityAccMag ▸ tBodyAccJerkMag ▸ tBodyGyroMag ▸ tBodyGyroJerkMag Fourier Transformed Signals ▸ fBodyAcc-XYZ ▸ fBodyAccJerk-XYZ ▸ fBodyGyro-XYZ ▸ fBodyAccMag ▸ fBodyAccJerkMag ▸ fBodyGyroMag ▸ fBodyGyroJerkMag Features are normalized and bounded within [-1,1]. Each feature vector is a row on the text file. Some feature derivations are included in the dataset. The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2). The gyroscope units are rad/seg.

16. ### Which correlations matter? Feature_1 Feature_2 Correlation Abs_Corr tBodyAccJerk-energy()-X fBodyAccJerk-energy()-X 0.999999

0.999999 fBodyAccJerk-energy()-X tBodyAccJerk-energy()-X 0.999999 0.999999 fBodyAcc-bandsEnergy()-1,24 fBodyAcc-energy()-X 0.999878 0.999878 fBodyAcc-energy()-X fBodyAcc-bandsEnergy()-1,24 0.999878 0.999878 fBodyGyro-energy()-X fBodyGyro-bandsEnergy()-1,24 0.999767 0.999767 fBodyGyro-bandsEnergy()-1,24 fBodyGyro-energy()-X 0.999767 0.999767 fBodyAcc-bandsEnergy()-1,24.1 fBodyAcc-energy()-Y 0.999661 0.999661 fBodyAcc-energy()-Y fBodyAcc-bandsEnergy()-1,24.1 0.999661 0.999661 tBodyAccJerkMag-mean() tBodyAccJerk-sma() 0.999656 0.999656 tBodyAccJerk-sma() tBodyAccJerkMag-mean() 0.999656 0.999656 corr_values = hua[feature_cols].corr() corr_values[(corr_values.Correlation < 1.0) & (corr_values.Correlation > 0.9)]

20. ### Can we separate static and dynamic activities? Yes ——> If

tBodyAccMag < -0.5 then it’s a static activity If tBodyAccMag > -0.5 then it’s a dynamic activity dynamic static

24. ### Model Selection: “Naive” (No feature selection, no tuning) Algorithm Train

Cross-Validation Score Test Accuracy Score DecisionTreeClassifier 0.840870 0.852392 RandomForestClassifier 0.917985 0.922973 KNNeighborsClassifier 0.897175 0.900238 LogisticRegression 0.933495 0.957923 Naive running is fitting models with no feature selection, tuning, or optimization. Yes, I ran all 561 features. And yes, it took a while. Must be classification algorithm, ran four

29. ### Map back to actual features? Signal Value angle(tBodyGyroJerkMean,gravityMean) 0.479798 tBodyAccJerk-mean()-X

0.201355 tGravityAcc-energy()-Y 0.115365 fBodyAcc-kurtosis()-Z 0.098975 fBodyAcc-skewness()-Z 0.098076 tGravityAcc-correlation()-X,Y 0.085849 angle(tBodyGyroMean,gravityMean) 0.075399 angle(Z,gravityMean) 0.074471 tGravityAcc-min()-Y 0.059661 tGravityAcc-mean()-Y 0.055262 pd.Series(pc1, index=hua_pca.columns).sort_values(ascending=False) Which are the most predictive signals?

32. ### How many PCA features needed to increase accuracy? Plotting all

the accuracy scores by increasing the number of PCA variables in the model
33. ### Which are the most important PCA Features? coef variable abscoef

6.015424 PC3 6.015424 3.503071 PC22 3.503071 2.330216 PC48 2.330216 1.976726 PC32 1.976726 -1.844982 PC26 1.844982 1.678715 PC23 1.678715 1.291059 PC45 1.291059 1.269009 PC13 1.269009 -1.153241 PC6 1.153241 1.136382 PC24 1.136382 coefs_vars.sort_values('abscoef', ascending=False, inplace=True) Which are the most important PCA Features?

37. ### These didn’t work so well 1. Scaling the data with

StandardScaler: Accuracy dropped to 0.911 (“features are normalized”) 2. Further feature reduction of PCA or Signal features a. a challenge to find specific signals, it seems the combination of signals is far more useful b. even when using top 20 principal components 3. Ridge — struggled to get this to work properly

39. ### With more time 1. Horn's parallel analysis 2. Optimize Lasso

and Ridge, would a penalty improve the model? 3. Dig further into t-SNE for feature selection & reduction 4. Run experiment with my own phone a. Perform the six activities with a sensor recorder app b. Process the raw data, run the model and score it 5. Try this with data from a smartwatch!
40. ### References & credits https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones http://www.cis.fordham.edu/wisdm/public_files/sensorKDD-2010.pdf https://upcommons.upc.edu/bitstream/handle/2117/101769/IWAAL2012.pdf https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2013-11.pdf https://github.com/anas337/Human-Activity-Recognition-Using-Smartphones https://github.com/markdregan/K-Nearest-Neighbors-with-Dynamic-Time-Warping http://www.cis.fordham.edu/wisdm/dataset.php

With code lovingly borrowed from many GA Jupyter and Kaggle notebooks