A data science project that took a Kaggle dataset and created a logistic regression model that predicted with 98% accuracy one of six activities performed by a group of test participants.
I'm no data scientist nor do I play one on YouTube, but this was a great project to extend my knowledge in how data from gyroscopes and accelerometers can be used in the real world.
YOUR SMARTPHONE KNOWS
WHAT YOU’RE DOING
C Todd Lombardo
● Problem statement
● Performance summary
○ Exploratory data analysis
○ Model selection
○ Feature selection
○ Model tuning
○ Error analysis
○ Tried and failed
● Next steps
By examining accelerometer and gyroscope sensor data from a smartphone, a
model can classify which activity was performed by a person
Accurately predict a human movement from accelerometer and gyroscope
smartphone data, both provided in the data repo and captured by a mobile app.
Risks and limitations
Feature engineering may need to go beyond the scope of the course content. One
is the possible exploration of a principal component analysis and other
feature reduction techniques.
Results: Classify by logistic regression
Accuracy = 0.98640 on the
test data set
With PCA, the 561 features
could be reduced to 120
Sitting and Standing were
the most difficult to
discern: 20 false positives
Target variable: Activity
The model will need to accurately predict one
of these movements:
Exploratory Data Analysis
About the dataset
Human Activity Recognition w/Smartphone (Sources Kaggle, UC Irvine)
1. Inertial sensor data
a. Raw triaxial signals from the accelerometer & gyroscope of all
the trials with participants
b. The labels of all the performed activities
2. Records of activity windows. Each one composed of:
a. A 561-feature vector with time and frequency domain variables.
b. Its associated activity label
c. An identifier of the subject who carried out the experiment.
The experiments were carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING,
WALKING-UPSTAIRS, WALKING-DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer
and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The obtained dataset has been randomly
partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
About the dataset
Features will be selected and/or engineered from:
Fourier Transformed Signals
Features are normalized and bounded within [-1,1].
Each feature vector is a row on the text file. Some feature derivations are included in the dataset.
The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2).
The gyroscope units are rad/seg.
About the dataset
Count of the Features
Data: Basic statistics
So about those correlations...
Data: By activity
Which correlations matter?
Feature_1 Feature_2 Correlation Abs_Corr
tBodyAccJerk-energy()-X fBodyAccJerk-energy()-X 0.999999 0.999999
fBodyAccJerk-energy()-X tBodyAccJerk-energy()-X 0.999999 0.999999
fBodyAcc-bandsEnergy()-1,24 fBodyAcc-energy()-X 0.999878 0.999878
fBodyAcc-energy()-X fBodyAcc-bandsEnergy()-1,24 0.999878 0.999878
fBodyGyro-energy()-X fBodyGyro-bandsEnergy()-1,24 0.999767 0.999767
fBodyGyro-bandsEnergy()-1,24 fBodyGyro-energy()-X 0.999767 0.999767
fBodyAcc-bandsEnergy()-1,24.1 fBodyAcc-energy()-Y 0.999661 0.999661
fBodyAcc-energy()-Y fBodyAcc-bandsEnergy()-1,24.1 0.999661 0.999661
tBodyAccJerkMag-mean() tBodyAccJerk-sma() 0.999656 0.999656
tBodyAccJerk-sma() tBodyAccJerkMag-mean() 0.999656 0.999656
corr_values = hua[feature_cols].corr()
corr_values[(corr_values.Correlation < 1.0) & (corr_values.Correlation > 0.9)]
What do the activities look like for one subject?
What do the activities look like for all subjects?
What does one activity look like for different subjects?
Can we separate static and dynamic activities?
Yes ——> If tBodyAccMag < -0.5 then it’s a static activity
If tBodyAccMag > -0.5 then it’s a dynamic activity
Is it possible to separate speciﬁc activities?
Is it possible to separate all the activities? (t-SNE)
Model Selection: “Naive” (No feature selection, no tuning)
Algorithm Train Cross-Validation Score Test Accuracy Score
DecisionTreeClassifier 0.840870 0.852392
RandomForestClassifier 0.917985 0.922973
KNNeighborsClassifier 0.897175 0.900238
LogisticRegression 0.933495 0.957923
Naive running is fitting models with no feature selection, tuning, or optimization. Yes, I ran all 561 features. And yes, it took a while.
Must be classification algorithm, ran four
Principal Component Analysis: Visualization of PC1 & PC2
PCA – Project Back to the Features
Map back to actual features?
Which are the most
This matches up to the corr heatmap...
How many PCA features needed to increase accuracy?
Plotting all the accuracy
scores by increasing the
number of PCA variables in
Which are the most important PCA Features?
coef variable abscoef
6.015424 PC3 6.015424
3.503071 PC22 3.503071
2.330216 PC48 2.330216
1.976726 PC32 1.976726
-1.844982 PC26 1.844982
1.678715 PC23 1.678715
1.291059 PC45 1.291059
1.269009 PC13 1.269009
-1.153241 PC6 1.153241
1.136382 PC24 1.136382
coefs_vars.sort_values('abscoef', ascending=False, inplace=True)
Which are the most
120 PCA features
With 20 false positives
Tried and Failed
These didn’t work so well
1. Scaling the data with StandardScaler: Accuracy dropped to 0.911
(“features are normalized”)
2. Further feature reduction of PCA or Signal features
a. a challenge to find specific signals, it seems the
combination of signals is far more useful
b. even when using top 20 principal components
3. Ridge — struggled to get this to work properly
With more time
1. Horn's parallel analysis
2. Optimize Lasso and Ridge, would a penalty improve the model?
3. Dig further into t-SNE for feature selection & reduction
4. Run experiment with my own phone
a. Perform the six activities with a sensor recorder app
b. Process the raw data, run the model and score it
5. Try this with data from a smartwatch!
References & credits
With code lovingly borrowed from many GA Jupyter and Kaggle notebooks