Model validation is the cornerstone of ensuring a robust and reliable machine learning model. It's the rigorous assessment of how well your model performs on unseen data, mimicking real-world scenarios. Done right, it prevents overfitting, exposes biases, and guides further model improvement.
Key aspects to consider:
Data:
Metrics:
Techniques:
Additional Considerations:
Remember, model validation is an iterative process. Continuously evaluate, refine, and improve your model to ensure it generalizes well and meets your specific requirements.
# Data Loading Code Hidden Here
import pandas as pd
# Load data
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)
# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
Underfitting:
Imagine you're trying to learn a dance by only observing a few basic steps. This is akin to underfitting in machine learning. It occurs when a model is too simple to capture the underlying patterns and complexity of the data. This results in:
Examples of underfitting:
Overfitting:
Think of trying to mimic every minute detail of a dance move, even the tiniest twitch. This is analogous to overfitting. It happens when a model becomes too complex and memorizes the training data too closely, including noise and irrelevant details. This leads to:
Examples of overfitting:
Finding the sweet spot:
The goal is to find a model that strikes a balance between underfitting and overfitting. This is often achieved through:
By understanding these concepts, you can effectively diagnose and address underfitting and overfitting in your machine learning projects, leading to better-performing and more generalizable models.
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)
# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)
# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size,random_state=1)
# fit the final model and uncomment the next two lines
final_model.fit(X, y)
val_predictions = final_model.predict(X)
val_mae = mean_absolute_error(val_predictions, y)
print("Validation MAE: {:,.0f}".format(val_mae))
https://builtin.com/data-science/enumerate-zip-sorted-reversed-python
https://builtin.com/data-science/model-validation-test
https://builtin.com/data-science/model-fit
https://builtin.com/data-science/multiple-regression