$30 off During Our Annual Pro Sale. View Details »

Regression Analysis: The good, the bad and untold

Jalem Raj Rohit
September 01, 2017

Regression Analysis: The good, the bad and untold

Presented at Pydata Delhi 2017

Jalem Raj Rohit

September 01, 2017
Tweet

More Decks by Jalem Raj Rohit

Other Decks in Science

Transcript

  1. Regression Analysis: The good,
    the bad and untold

    View Slide

  2. INTRODUCTION
    I am Jalem Raj Rohit.
    Works on Devops and Machine Learning full-time.
    Contributes to Julia, Python and Go’s libraries as
    volunteer work, along with moderating the Devops
    and DataScience sites of StackOverflow

    View Slide

  3. REGRESSION ANALYSIS
    ● The Good
    ● The Bad
    ● The Untold

    View Slide

  4. MOTIVATION
    ● Understanding and appreciating the beauty behind the simplicity of
    Regression Analysis
    ● How can one leverage the age-old practice of regression analysis in the
    Deep Learning world
    ● Doing data-centric data science, and not model-centric.

    View Slide

  5. WHY NOT MODEL-CENTRIC?
    ● People change, and so do their data. Models become obsolete everyday
    ● Advanced models require expensive resources
    ● Model might become obsolete, but the data-driven domain knowledge
    doesn’t

    View Slide

  6. ● Regression Analysis is a statistical process for estimating the relationships among
    variables.
    ● It helps one understand how the typical value of the dependant variable changes
    when any of the independent variables is varied, while the other independent
    variables are fixed
    What exactly is Regression
    Analysis?

    View Slide

  7. IT IS A CONCEPT
    IT IS NOT AN ALGORITHM

    View Slide

  8. ● Regression Analysis is the first step for almost every data science problem
    ● Understand the relationship between variables or a set of variables
    ● Thus data-centric
    BACK IN THE DAYS….

    View Slide

  9. THESE DAYS...

    View Slide

  10. But but why?

    View Slide

  11. But but the data is already linearly seper….

    View Slide

  12. ● GROUND-BREAKING ACCURACIES
    ● CAN MODEL ALMOST ANY COMPLEX DATASET
    DEEP LEARNING

    View Slide

  13. ● BLACK BOXES. (Cannot be used in high-risk domains)
    ● COMPUTATIONALLY EXPENSIVE
    SHORTCOMINGS OF DEEP
    LEARNING

    View Slide

  14. ● NO ONE KNOWS WHAT THE DEEP LEARNING NETWORK DOES UNDER
    THE HOOD
    ● NOT FEASIBLE FOR HIGH RISK DOMAINS LIKE FINANCE AND
    HEALTHCARE
    ● NOT EASY TO TWEAK THE MODELS
    BLACK BOXES

    View Slide

  15. ● EXPLAINABILITY
    ● HIGHER CONTROL ON THE VARIABLES
    ● HIGH FLEXIBILITY AND VARIETY
    REGRESSION ANALYSIS: THE
    GOOD

    View Slide

  16. ● Summary of a regression analysis is very straightforward and easily interpretable
    ● It broadly contains 5 statistics
    EXPLAINABILITY

    View Slide

  17. REGRESSION ANALYSIS
    SUMMARY

    View Slide

  18. ● It all starts with a null hypothesis
    ● P-value: Lesser the value, greater the statistical significance of the variable.
    Probability that the variable is not relevant in explaining the independent variable
    ● T-value: The higher the t-value, the higher the confidence in rejecting the null
    hypothesis. It is the difference between population’s mean and the null mean
    ● F-value: The higher the test value. The farthest the distribution is from the null
    hypothesis distribution. It is more like a scaled out version of a T-value because it
    can take in more than 1 variable at a time for testing
    EXPLAINABILITY (GOOD AND
    UNTOLD)

    View Slide

  19. ● R-squared: Goodness of fit. Depends on the data variance between the fitted line and
    the data
    ● Significance Codes
    EXPLAINABILITY (contd..)

    View Slide

  20. ● Each variable can be tweaked and tested while keeping the others constant
    ● Helps understand variable importance and relationship between them
    HIGHER CONTROL

    View Slide

  21. ● Lot of flavours available in regression analysis which include Polynomial Regression,
    Spline-based regression analysis
    ● One can also model nonlinear relationships with these advanced algorithms
    HIGH FLEXIBILITY
    AND VARIETY

    View Slide

  22. ● Very sensitive to outliers
    ● Cannot model extremely complex relationships
    ● Data need to be studied properly, before choosing a regression method. An art which
    needs to be practised for getting good at
    THE BAD

    View Slide

  23. ● BETTER UNDERSTANDING OF THE DATA
    ● ENABLES DATA-CENTRIC DATA SCIENCE
    ● HELPS IN BETTER ARCHITECTURE CREATION AND SELECTION OF
    ACTIVATION FUNCTIONS
    DEEP LEARNING +
    REGRESSION ANALYSIS

    View Slide

  24. ● REGRESSION ANALYSIS IS NOT A REPLACEMENT FOR PREDICTION
    ALGORITHMS
    ● IT IS A CRITICAL MISSING STEP IN MODERN DAY DATA SCIENCE
    PIPELINE
    ● MAKING DATA-CENTRIC DATA SCIENCE GREAT AGAIN
    ● BECOMING BETTER DATA SCIENTISTS/ML ENGINEERS
    TAKEAWAYS

    View Slide

  25. THANK YOU
    ● Github: Dawny33
    ● Home: jrajrohit.me

    View Slide