Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Robust Model Building: Cross-Validation, XGBoos...

Robust Model Building: Cross-Validation, XGBoost, Leakage Prevention, and Logistic Regression

Presentation Title: Building Reliable Machine Learning Models: From Evaluation to Deployment

Description:
In this session, we focus on the essential principles and techniques for building machine learning models that are not only accurate in training environments but also reliable when applied to real-world data. We will cover:

Cross-Validation – A robust technique for evaluating model performance XGBoost – A powerful gradient boosting algorithm for structured data Data Leakage – A critical pitfall that can lead to overly optimistic results Logistic Regression – Practical application using a clean ML pipeline
This talk emphasizes that machine learning is not a one-shot process, but an iterative cycle of building, testing, and refining models. We’ll discuss how to make key decisions, including:

Which features (variables) to include Which model type to choose (e.g., Linear Regression, Decision Tree, Neural Network) How to tune hyperparameters effectively
The ultimate goal is to build models that generalize well to new, unseen data. A central challenge we’ll address is:
How do we reliably measure if our model is good before deploying it?

Explore the full pipeline and notebook here:
Notebook: Telco Churn EDA + Logistic Regression + Cross-Validation

Avatar for Phillip Ssempeebwa

Phillip Ssempeebwa

May 16, 2025
Tweet

Transcript

  1. Today, we'll dive into some crucial concepts for building machine

    learning models that are not just accurate on paper, but also reliable in the real world. We'll explore Cross-Validation for robust evaluation, the power of XGBoost, the critical pitfalls of Data Leakage, and then apply these ideas by building a Logistic Regression model. ✓ML isn't a one-shot process. It involves cycles of building, testing, and refining. ✓Key Decisions: • Which features (variables) to use? • Which model type (e.g., Linear Regression, Decision Tree, Neural Network)? • How to tune model parameters (hyperparameters)? ✓Goal: Build a model that generalizes well to new, unseen data. ✓Challenge: How do we reliably measure if our model is good before deploying it?
  2. The Problem with Simple Train/Validation Split) •Common Method: Split data

    -> Train on Training set -> Evaluate on Validation set. •Problem: Performance depends heavily on which specific rows end up in the validation set. A model might do well on one set of 1000 rows, even if it would be inaccurate on a different 1000 rows. ✓ Imagine a small validation set (e.g., 10 rows) - model performance could be pure luck! ✓ Larger validation set = More reliable score, BUT less data for training = Potentially worse model.
  3. Introducing Cross-Validation 01 Split the entire dataset into 'K' equal-sized

    folds (e.g., K=5) Repeat ...Repeat K times, using each fold as validation exactly once 04 Iteration 1 Use Fold 1 as validation, train on Folds 2-5. Calculate score 02 Iteration 2 Use Fold 2 as validation, train on Folds 1, 3-5. Calculate score 03 Final Score Average the scores from all K iterations 05 Split
  4. Why and When to Use Cross-Validation ✓ Benefits: • More

    Reliable Performance Estimate: Reduces the "luck" factor associated with a single split. The average score is usually a better indicator of generalization performance. • More Efficient Use of Data: All data points are used for both training and validation across the different folds. ✓ When to Use: • Small Datasets: Essential when data is limited, as making a large single validation set isn't feasible. • Important Decisions: When the consequence of deploying a poor model is high, a more robust evaluation is crucial. • Hyperparameter Tuning: Provides a more stable metric for comparing different model settings. ✓ Drawback: • Computationally More Expensive: Trains K models instead of just one. (Less of an issue for simpler models or with modern hardware).
  5. Logistic Regression - Our Chosen Model ✓ What is it?

    A fundamental classification algorithm. Used when the target variable is categorical (e.g., Yes/No, True/False, Spam/Not Spam). ✓ Core Idea: Models the probability that an input belongs to a particular category. • It uses a linear combination of inputs (like Linear Regression). • But then applies the Sigmoid (or Logit) function to squash the output between 0 and 1 (representing probability). ✓ Diagram: Show a simple graph with data points of two classes and a sigmoid curve separating them, or just the S-shaped sigmoid curve. ✓ Output: A probability score (e.g., 0.7 means 70% probability of belonging to the positive class). We typically set a threshold (e.g., 0.5) to make a final class prediction. ✓ Why Use It? • Simple, fast, and interpretable (we can understand the influence of features). • Good baseline model. • Outputs probabilities, which can be useful.
  6. XGBoost - The Powerhouse 01 Start with a simple initial

    model (e.g., predicting the average). Add this new model to the ensemble, weighted to correct the previous errors. 04 Calculate the errors (residuals) made by the current model. 02 Fit a new model (usually a tree) specifically to predict those errors 03 Repeat, with each new model focusing on the remaining mistakes. 05 What is it? An advanced, highly efficient implementation of Gradient Boosting, another ensemble method (like Random Forest). •Analogy: Like a team where each new member learns from and corrects the mistakes of the previous members. •Why XGBoost? Performance, speed, built-in regularization (prevents overfitting), handles missing values. Often wins Kaggle competitions.
  7. Key Parameters: XGBoost Parameters (Briefly) How many models (trees) to

    build in sequence. (Too few = underfit, too many = overfit). n_estimators Stops training if performance on a validation set doesn't improve for a specified number of rounds (prevents overfitting and finds optimal n_estimators automatically) early_stopping_rounds How much each new model contributes. Smaller rate requires more n_estimators but can lead to better generalization learning_rate Social networks