Kaggle Intro Model Validation and Underfitting and Overfitting

Model Validation

Model validation is the cornerstone of ensuring a robust and reliable machine learning model. It's the rigorous assessment of how well your model performs on unseen data, mimicking real-world scenarios. Done right, it prevents overfitting, exposes biases, and guides further model improvement.

Key aspects to consider:

  • Data:

    • Training, Validation, and Testing Sets: Clearly divide your data into distinct sets for training the model (learning patterns), validating its performance (fine-tuning hyperparameters), and final testing (unbiased assessment).
    • Representativeness: Ensure each set reflects the true distribution of your target problem to avoid misleading results.
    • Data Augmentation (if applicable): Artificially expand your dataset to enhancegeneralizability and robustness, especially for limited data scenarios.
  • Metrics:

    • Choose wisely: Select metrics aligned with your problem's objective. For example, accuracy might be suitable for classification, while mean squared error is better for regression.
    • Multiple metrics: Consider using a combination of metrics to capture different aspects of performance, like precision, recall, and F1-score for imbalanced classes.
    • Calibration and Interpretation: Evaluate how well your model's confidence scores correspond to actual correctness, especially for critical applications.
  • Techniques:

    • K-Fold Cross-Validation: Randomly split your data into k folds, train on k-1 folds, validate on the remaining fold, and repeat k times. Provides a more robust estimate of performance compared to a single split.
    • Hold-out Validation: Allocate a fixed portion of your data for validation, but reduces the amount available for training.
    • Early Stopping: Halt training when validation performance stops improving to prevent overfitting.
    • Regularization: Introduce techniques like L1/L2 regularization or dropout to penalize complex models and reduce overfitting.
  • Additional Considerations:

    • Computational Cost: Validation adds to training time, so balance thoroughness with efficiency based on your resource constraints.
    • Domain Knowledge: Leverage your expertise to interpret results, identify potential biases, and guide further model refinement.
    • Explainability and Fairness: If interpretability or fairness are crucial, consider techniques like LIME or SHAP to explain model predictions and mitigate potential biases.

Remember, model validation is an iterative process. Continuously evaluate, refine, and improve your model to ensure it generalizes well and meets your specific requirements.

Example with python code

# Data Loading Code Hidden Here
import pandas as pd

# Load data
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]


# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

Underfitting and Overfitting

Underfitting:

Imagine you're trying to learn a dance by only observing a few basic steps. This is akin to underfitting in machine learning. It occurs when a model is too simple to capture the underlying patterns and complexity of the data. This results in:

  • High bias: The model has strong assumptions that may not hold true for the entire dataset.
  • Low variance: The model's predictions are consistent but often inaccurate, leading to systematic errors.
  • Poor performance on unseen data: The model cannot generalize well to new data because it hasn't learned the true relationships.

Examples of underfitting:

  • Using a linear regression model for a highly non-linear dataset.
  • Using a decision tree with shallow depth for a complex problem.
  • Training a neural network with too few neurons or layers.

Overfitting:

Think of trying to mimic every minute detail of a dance move, even the tiniest twitch. This is analogous to overfitting. It happens when a model becomes too complex and memorizes the training data too closely, including noise and irrelevant details. This leads to:

  • Low bias: The model closely fits the training data, potentially capturing even noise and artifacts.
  • High variance: The model's predictions are highly sensitive to small changes in the data, leading to unstable and erratic behavior.
  • Poor performance on unseen data: The model cannot generalize well because it's adapted to the specific training data, not the underlying patterns.

Examples of overfitting:

  • Using a high-degree polynomial regression for a simple dataset.
  • Using a deep neural network with many layers and neurons for a small dataset.
  • Training a model for too many epochs without regularization.

Finding the sweet spot:

The goal is to find a model that strikes a balance between underfitting and overfitting. This is often achieved through:

  • Regularization: Techniques that penalize the model for being too complex and reduce variance.
  • Data augmentation: Increasing the training data size and diversity to help the model learn generalizable patterns.
  • Model selection: Choosing the right model complexity based on the data and problem.

By understanding these concepts, you can effectively diagnose and address underfitting and overfitting in your machine learning projects, leading to better-performing and more generalizable models.

Example with python code

import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)


candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]

scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)



# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size,random_state=1)

# fit the final model and uncomment the next two lines
final_model.fit(X, y)
val_predictions = final_model.predict(X)
val_mae = mean_absolute_error(val_predictions, y)
print("Validation MAE: {:,.0f}".format(val_mae))

See

https://builtin.com/data-science/enumerate-zip-sorted-reversed-python
https://builtin.com/data-science/model-validation-test
https://builtin.com/data-science/model-fit
https://builtin.com/data-science/multiple-regression

你可能感兴趣的:(New,Developer,数据,(Data),ML,&,ME,&,GPT,机器学习)