Training vs Testing vs Validation Sets (original) (raw)

Last Updated : 6 Jan, 2026

Training, validation and testing sets are three essential components in building reliable machine learning models. The training set teaches the model patterns, the validation set helps fine‑tune hyperparameters and prevent overfitting and the testing set evaluates how well the model performs on completely unseen data.

given_data

Training vs. Testing vs. Validation Sets

1. Training Set

The training set is the portion of the dataset used to fit the machine learning model. During training, the algorithm learns patterns, relationships and parameters (such as weights in neural networks or coefficients in regression models) directly from this data. The model repeatedly adjusts itself to minimize the training error using optimization techniques like gradient descent.

Use Cases

Advantages

Limitations

**Example:

Python `

from sklearn.datasets import fetch_openml from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression import pandas as pd

titanic = fetch_openml(name="titanic", version=1, as_frame=True) data = titanic.frame

data = data[["pclass", "sex", "age", "fare", "survived"]].copy() data["sex"] = data["sex"].map({"male": 0, "female": 1}) data["age"] = data["age"].fillna(data["age"].median()) data["fare"] = data["fare"].fillna(data["fare"].median())

X = data.drop("survived", axis=1) y = data["survived"]

X_train, X_temp, y_train, y_temp = train_test_split( X, y, test_size=0.3, random_state=42 )

model = LogisticRegression(max_iter=300) model.fit(X_train, y_train)

`

**Output:

Screenshot-2025-12-16-171644

2. Validation Set

The validation set is a separate subset of data used to tune model hyperparameters and make design decisions during training. Unlike the training set, it is not used to update model weights directly. Instead, it provides an unbiased estimate of model performance during development.

Use Cases

Advantages

Limitations

**Example:

Python `

X_val, X_test, y_val, y_test = train_test_split( X_temp, y_temp, test_size=0.5, random_state=42 )

validation_accuracy = model.score(X_val, y_val) print("Validation Accuracy:", validation_accuracy)

`

**Output:

Validation Accuracy: 0.8010204081632653

3. Testing Set

The testing set is a completely independent subset used to evaluate the final model’s performance after all training and tuning are complete. It simulates how the model will perform on unseen, real-world data and provides the most reliable estimate of generalization.

Use Cases

Advantages

Limitations

**Example:

Python `

test_accuracy = model.score(X_test, y_test) print("Test Accuracy:", test_accuracy)

`

**Output:

Test Accuracy: 0.817258883248731

Comparison

Let's compare them:

Aspect Training Set Validation Set Testing Set
Core Objective Learn patterns and fit model parameters Tune hyperparameters and select the best model Evaluate final model performance
Role in Learning Directly involved in model learning Indirect role (guides learning decisions) No role in learning
Parameter Updates Model weights are updated repeatedly Weights remain unchanged Weights remain unchanged
Hyperparameter Influence Does not guide hyperparameter selection Actively used for hyperparameter tuning Must not influence hyperparameters
Typical Usage Frequency Used many times across epochs/iterations Used multiple times during experimentation Ideally used only once
Performance Interpretation Indicates how well the model fits seen data Indicates whether the model is overfitting or underfitting Indicates real-world generalization
Risk if Overused Severe overfitting to training data Validation leakage and biased tuning Inflated and unreliable performance estimates
Relationship to Deployment Indirect, supports model creation Indirect, supports model refinement Direct indicator of deployment readiness
Typical Dataset Proportion Largest portion of the dataset Medium-sized portion Smallest portion