Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Your Smartphone Knows What You're Doing

Your Smartphone Knows What You're Doing

A data science project that took a Kaggle dataset and created a logistic regression model that predicted with 98% accuracy one of six activities performed by a group of test participants.

I'm no data scientist nor do I play one on YouTube, but this was a great project to extend my knowledge in how data from gyroscopes and accelerometers can be used in the real world.

C. Todd Lombardo

May 06, 2020
Tweet

More Decks by C. Todd Lombardo

Other Decks in Technology

Transcript

  1. YOUR SMARTPHONE KNOWS
    WHAT YOU’RE DOING
    C Todd Lombardo

    View Slide

  2. OUTLINE
    ● Problem statement
    ● Performance summary
    ● Process
    ○ Exploratory data analysis
    ○ Model selection
    ○ Feature selection
    ○ Model tuning
    ○ Error analysis
    ○ Tried and failed
    ● Next steps

    View Slide

  3. Problem Statement

    View Slide

  4. Problem statement
    Hypothesis
    By examining accelerometer and gyroscope sensor data from a smartphone, a
    model can classify which activity was performed by a person
    Goals
    Accurately predict a human movement from accelerometer and gyroscope
    smartphone data, both provided in the data repo and captured by a mobile app.
    Risks and limitations
    Feature engineering may need to go beyond the scope of the course content. One
    is the possible exploration of a principal component analysis and other
    feature reduction techniques.

    View Slide

  5. Performance Summary

    View Slide

  6. Results: Classify by logistic regression
    Logistic Regression
    Accuracy = 0.98640 on the
    test data set
    With PCA, the 561 features
    could be reduced to 120
    principal components
    Sitting and Standing were
    the most difficult to
    discern: 20 false positives
    among them

    View Slide

  7. Process

    View Slide

  8. Target variable: Activity
    The model will need to accurately predict one
    of these movements:
    Walking
    Walking_upstairs
    Walking_downstairs
    Sitting
    Standing
    Laying
    dynamic
    static

    View Slide

  9. Exploratory Data Analysis

    View Slide

  10. About the dataset
    Human Activity Recognition w/Smartphone (Sources Kaggle, UC Irvine)
    1. Inertial sensor data
    a. Raw triaxial signals from the accelerometer & gyroscope of all
    the trials with participants
    b. The labels of all the performed activities
    2. Records of activity windows. Each one composed of:
    a. A 561-feature vector with time and frequency domain variables.
    b. Its associated activity label
    c. An identifier of the subject who carried out the experiment.
    The experiments were carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING,
    WALKING-UPSTAIRS, WALKING-DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer
    and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The obtained dataset has been randomly
    partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.

    View Slide

  11. About the dataset
    Features will be selected and/or engineered from:
    Time-series signals
    ▸ tBodyAcc-XYZ
    ▸ tGravityAcc-XYZ
    ▸ tBodyAccJerk-XYZ
    ▸ tBodyGyro-XYZ
    ▸ tBodyGyroJerk-XYZ
    ▸ tBodyAccMag
    ▸ tGravityAccMag
    ▸ tBodyAccJerkMag
    ▸ tBodyGyroMag
    ▸ tBodyGyroJerkMag
    Fourier Transformed Signals
    ▸ fBodyAcc-XYZ
    ▸ fBodyAccJerk-XYZ
    ▸ fBodyGyro-XYZ
    ▸ fBodyAccMag
    ▸ fBodyAccJerkMag
    ▸ fBodyGyroMag
    ▸ fBodyGyroJerkMag
    Features are normalized and bounded within [-1,1].
    Each feature vector is a row on the text file. Some feature derivations are included in the dataset.
    The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2).
    The gyroscope units are rad/seg.

    View Slide

  12. About the dataset
    Count of the Features

    View Slide

  13. Data: Basic statistics
    hua.describe()

    View Slide

  14. So about those correlations...

    View Slide

  15. Data: By activity
    hua.groupby(Activity).count()

    View Slide

  16. Which correlations matter?
    Feature_1 Feature_2 Correlation Abs_Corr
    tBodyAccJerk-energy()-X fBodyAccJerk-energy()-X 0.999999 0.999999
    fBodyAccJerk-energy()-X tBodyAccJerk-energy()-X 0.999999 0.999999
    fBodyAcc-bandsEnergy()-1,24 fBodyAcc-energy()-X 0.999878 0.999878
    fBodyAcc-energy()-X fBodyAcc-bandsEnergy()-1,24 0.999878 0.999878
    fBodyGyro-energy()-X fBodyGyro-bandsEnergy()-1,24 0.999767 0.999767
    fBodyGyro-bandsEnergy()-1,24 fBodyGyro-energy()-X 0.999767 0.999767
    fBodyAcc-bandsEnergy()-1,24.1 fBodyAcc-energy()-Y 0.999661 0.999661
    fBodyAcc-energy()-Y fBodyAcc-bandsEnergy()-1,24.1 0.999661 0.999661
    tBodyAccJerkMag-mean() tBodyAccJerk-sma() 0.999656 0.999656
    tBodyAccJerk-sma() tBodyAccJerkMag-mean() 0.999656 0.999656
    corr_values = hua[feature_cols].corr()
    corr_values[(corr_values.Correlation < 1.0) & (corr_values.Correlation > 0.9)]

    View Slide

  17. What do the activities look like for one subject?

    View Slide

  18. What do the activities look like for all subjects?

    View Slide

  19. What does one activity look like for different subjects?

    View Slide

  20. Can we separate static and dynamic activities?
    Yes ——> If tBodyAccMag < -0.5 then it’s a static activity
    If tBodyAccMag > -0.5 then it’s a dynamic activity
    dynamic
    static

    View Slide

  21. Is it possible to separate specific activities?

    View Slide

  22. Is it possible to separate all the activities? (t-SNE)

    View Slide

  23. Model Selection

    View Slide

  24. Model Selection: “Naive” (No feature selection, no tuning)
    Algorithm Train Cross-Validation Score Test Accuracy Score
    DecisionTreeClassifier 0.840870 0.852392
    RandomForestClassifier 0.917985 0.922973
    KNNeighborsClassifier 0.897175 0.900238
    LogisticRegression 0.933495 0.957923
    Naive running is fitting models with no feature selection, tuning, or optimization. Yes, I ran all 561 features. And yes, it took a while.
    Must be classification algorithm, ran four

    View Slide

  25. Feature Selection

    View Slide

  26. Principal Component Analysis: Visualization of PC1 & PC2

    View Slide

  27. PCA

    View Slide

  28. PCA – Project Back to the Features

    View Slide

  29. Map back to actual features?
    Signal Value
    angle(tBodyGyroJerkMean,gravityMean) 0.479798
    tBodyAccJerk-mean()-X 0.201355
    tGravityAcc-energy()-Y 0.115365
    fBodyAcc-kurtosis()-Z 0.098975
    fBodyAcc-skewness()-Z 0.098076
    tGravityAcc-correlation()-X,Y 0.085849
    angle(tBodyGyroMean,gravityMean) 0.075399
    angle(Z,gravityMean) 0.074471
    tGravityAcc-min()-Y 0.059661
    tGravityAcc-mean()-Y 0.055262
    pd.Series(pc1, index=hua_pca.columns).sort_values(ascending=False)
    Which are the most
    predictive signals?

    View Slide

  30. This matches up to the corr heatmap...

    View Slide

  31. Model Tuning

    View Slide

  32. How many PCA features needed to increase accuracy?
    Plotting all the accuracy
    scores by increasing the
    number of PCA variables in
    the model

    View Slide

  33. Which are the most important PCA Features?
    coef variable abscoef
    6.015424 PC3 6.015424
    3.503071 PC22 3.503071
    2.330216 PC48 2.330216
    1.976726 PC32 1.976726
    -1.844982 PC26 1.844982
    1.678715 PC23 1.678715
    1.291059 PC45 1.291059
    1.269009 PC13 1.269009
    -1.153241 PC6 1.153241
    1.136382 PC24 1.136382
    coefs_vars.sort_values('abscoef', ascending=False, inplace=True)
    Which are the most
    important PCA
    Features?

    View Slide

  34. Error Analysis

    View Slide

  35. Confusion Matrix
    120 PCA features
    With 20 false positives

    View Slide

  36. Tried and Failed

    View Slide

  37. These didn’t work so well
    1. Scaling the data with StandardScaler: Accuracy dropped to 0.911
    (“features are normalized”)
    2. Further feature reduction of PCA or Signal features
    a. a challenge to find specific signals, it seems the
    combination of signals is far more useful
    b. even when using top 20 principal components
    3. Ridge — struggled to get this to work properly

    View Slide

  38. Next Steps

    View Slide

  39. With more time
    1. Horn's parallel analysis
    2. Optimize Lasso and Ridge, would a penalty improve the model?
    3. Dig further into t-SNE for feature selection & reduction
    4. Run experiment with my own phone
    a. Perform the six activities with a sensor recorder app
    b. Process the raw data, run the model and score it
    5. Try this with data from a smartwatch!

    View Slide

  40. View Slide

  41. References & credits
    https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones
    http://www.cis.fordham.edu/wisdm/public_files/sensorKDD-2010.pdf
    https://upcommons.upc.edu/bitstream/handle/2117/101769/IWAAL2012.pdf
    https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2013-11.pdf
    https://github.com/anas337/Human-Activity-Recognition-Using-Smartphones
    https://github.com/markdregan/K-Nearest-Neighbors-with-Dynamic-Time-Warping
    http://www.cis.fordham.edu/wisdm/dataset.php
    With code lovingly borrowed from many GA Jupyter and Kaggle notebooks

    View Slide