Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Must-Have Projects for your Data Science Portfolio

Avatar for Jovian Jovian
September 08, 2022

Must-Have Projects for your Data Science Portfolio

Avatar for Jovian

Jovian

September 08, 2022
Tweet

More Decks by Jovian

Other Decks in Education

Transcript

  1. What is a portfolio project? • Large & unique real-world

    dataset • Demonstrates mastery over skills • Good code quality & documentation • Published online & linked in Resume
  2. Why build portfolio projects? • Hands-on learning by doing •

    Help your Resume stand out • Provide evidence of your skills • Acquire domain expertise quickly
  3. Where to find datasets? • Kaggle Datasets • Google Dataset

    Search • UCI ML Repository • World Bank
  4. How to build it? 1. Pick large real-world dataset 2.

    Perform data preparation & cleaning 3. Analyze & visualize processed data 4. Ask questions & summarize insights 5. Clean up, document & publish online
  5. Music Listening Analysis 73k+ rows, 19 columns analyzed with Numpy,

    Pandas & visualized with Matplotlib, Seaborn, Folium, Plotly Rohan
  6. Weather & Renewable Energy Analysis 5.63 million rows, 20+ columns

    analyzed with Numpy, Pandas & Seaborn Prasanthi
  7. How to build it? 1. Pick a SQL / CSV

    / Excel dataset 2. Perform cleaning, analysis, pivots, etc. 3. Import into Tableau / PowerBI / Looker 4. Create interactive graphs to show insights 5. Create and publish a dashboard online
  8. How to build it? 1. Pick real-world dataset from Kaggle

    2. Formulate the ML problem statement 3. Analyze & prepare data for modeling 4. Create features, train models & iterate 5. Summarize, document & publish online
  9. Walmart Store Sales Prediction • 537K+ rows, 14 columns sales

    data • Linear regression, Random Forest, GBMs • $1,352 WMAE, Random Forest • Feature engineering and Hyperparameter tuning • Scored Top 14% on Kaggle leaderboard Nicolas
  10. Online Ad Click Probability Prediction • 1 million+ rows, 10

    columns advertisement data • Feature selection, imputing, scaling, encoding • Linear regression, Random Forest, GBMs • Feature engineering and hyperparameter tuning • Best RMSE of 0.24857 for Random Forest Deepa
  11. Credit Card Default Payment Prediction • 150K+ rows, 11 columns

    credit behaviour data • Logistic regression, Random Forest, GBMs • 93.89% after 20-hour hyperparameter tuning • Scored Top 21% on Kaggle leaderboard David
  12. Airline Accident Prevention using ML • 4.9 million training, 19

    million test data • LightGbm, Random Forest, Xgboost • 0.712 log-loss after hyperparameter tuning • Scored Top 26% on Kaggle leaderboard Rishabh
  13. Must-Have DS & ML Projects • Exploratory Data Analysis •

    Business Intelligence Dashboard • Classical Machine Learning
  14. Optional DS & ML Projects • Web Scraping with Python

    • Deep Learning with PyTorch/TF • Web Development & Deployment
  15. About Jovian • Started in 2019 as DS & ML

    community • Offering 5 online courses & 1 bootcamp • 200,000+ learners from 180+ countries • Over 2 million video views on YouTube
  16. EDA on Police Violence In United States 51,000+ rows, 30+

    columns analyzed with Numpy, Pandas, Matplotlib, Seaborn & Plotly Distribution of deaths across US
  17. Wine Price Prediction using ML • 137k+ rows, 14 columns

    • Linear Regression, Random Forest, GBMs • $9 RMSE, 92% accuracy on test dataset