Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ML at Udaipur

ML at Udaipur

Machine Learning introduction and glossaries at GDG Udaipur

Krunal Kapadiya

July 06, 2019
Tweet

More Decks by Krunal Kapadiya

Other Decks in Technology

Transcript

  1. Agenda - Introduction to Machine Learning - Steps to towards

    into ML - Problems in Machine Learning Data - Ending note
  2. What is Machine Learning "A computer program is said to

    learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” - Tom. M. Mitchell For example: I need a program that will tell me which tweets will get retweets. - Task (T): Classify a tweet that has not been published as going to get retweets or not. - Experience (E): A corpus of tweets for an account where some have retweets and some do not. - Performance (P): Classification accuracy, the number of tweets predicted correctly out of all tweets considered as a percentage.
  3. What is Machine Learning - Data used - 2005 -

    130 Exabytes - 2010 - 1200 Exabytes - 2015 - 7900 Exabytes - 2020 - 40900 Exabytes 1 EB = 1000 PB = 1 million TB = 1 billion GB 1 billion GB = 10 Thousand Crore TB
  4. Applications of Machine Learning - In computer vision - In

    data predictor, stock market predictor - In data segmentations (customer segmentation) - Anomaly detection - Sentiment analysis
  5. Steps in ML Training and splitting data with validations 20%

    Test case 80% Training set Total numbers of training set
  6. Steps in ML Training and splitting data with validations 15%

    Test set 70% Training set Total numbers of training set 15% Validation
  7. Steps in ML Python from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection

    import train_test_split Import pandas as pd pd.read_csv(‘Social_Network_Ads.csv’) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0 R library(caTools) dataset = read.csv('Social_Network_Ads.csv') split = sample.split(dataset$Purchased, SplitRatio = 0.75
  8. Steps in ML Python from sklearn.preprocessing import StandardScaler sc =

    StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) classifier.fit(X_train, y_train) R training_set[-3] = scale(training_set[-3]) test_set[-3] = scale(test_set[-3]) library(rpart) classifier = rpart(formula = Purchased ~., data = training_set)
  9. Steps in ML Python from sklearn.metrics import confusion_matrix cm =

    confusion_matrix(y_test, y_pred) pickle.dumps(classifier, open(‘classifier_model_in_python.pkl’, ‘wb’)) R # Making the Confusion Matrix cm = table(test_set[, 3], y_pred) saveRDS(final_model, "classifier_model_in_R.rds")
  10. Weight Texture Label 150g Bumpy Orange 170g Bumpy Orange 140g

    Smooth Apple 130g Smooth Apple Feature Feature Find apple or orange problem Training Data
  11. Weight Texture Label 150g Bumpy Orange 170g Bumpy Orange 140g

    Smooth Apple 130g Smooth Apple Feature Feature Examples Find apple or orange problem Training Data
  12. Orange Apple Weight = 150 G Yes No Yes No

    Texture = bumpy ? ... Decision Tree Find apple or orange problem
  13. But, real problems are... • Insufficient quantity of training data

    • Non representative training data • Poor quality data • Irrelevant features • Overfitting • Underfitting
  14. How can I start it • Look at the dataset

    • Write down columns and it’s correlation • Make questions derived from the dataset • Explanatory Analysis with visualization • Frame problem • Create solution by creating model