How to split a Dataset into Train and Test Sets using Python (original) (raw)

Last Updated : 7 Apr, 2026

To build and evaluate a machine learning model, the dataset must be divided into two parts i.e one for training the model and another for testing its performance. This process helps measure how well a model works on unseen data. This is done to properly assess how well the model will perform in real-world scenarios.

**Method 1: Splitting Dataset Using train_test_split()

The train_test_split() function from scikit-learn is the most common and easiest way to split a dataset.

Here:

from sklearn.model_selection import train_test_split import pandas as pd

data = { 'Age': [22, 25, 47, 52, 46], 'Salary': [25000, 32000, 48000, 52000, 50000], 'Purchased': [0, 1, 1, 0, 1] }

df = pd.DataFrame(data)

X = df[['Age', 'Salary']] y = df['Purchased']

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 )

print( X_train,'\n', X_test) print( y_train,'\n', y_test)

`

**Output:

Screenshot-2026-02-03-165102

Output

This shows the splitting of our dataset. Now let's see our models accuracy using logistic regression model.

Python `

from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score

Creating and training the model

model = LogisticRegression() model.fit(X_train, y_train)

Making predictions on test data

y_pred = model.predict(X_test)

Evaluating model performance

acc = accuracy_score(y_test, y_pred) print("Accuracy:", acc)

`

**Output:

Accuracy: 1.0

We can see our model is performing well after train and test split.

**Method 2: Manual Splitting Using Indexing

Manual splitting means dividing a dataset into training and testing parts without using built-in ML functions like train_test_split(). This approach gives full control over how data is shuffled and split.

Here:

import pandas as pd

data = { 'Age': [22, 25, 47, 52, 46], 'Salary': [25000, 32000, 48000, 52000, 50000], 'Purchased': [0, 1, 1, 0, 1] }

df = pd.DataFrame(data)

df = df.sample(frac=1).reset_index(drop=True)

split = int(0.8 * len(df))

train = df[:split] test = df[split:]

print(train) print(test)

`

**Output:

Screenshot-2026-02-03-161859

Output

**Method 3: Splitting Using NumPy

NumPy can also be used when working with arrays instead of DataFrames.

import numpy as np

arr = np.arange(20)

print("original array: ",arr)

train, test = np.split(arr, [16])

print("train: ",train) print("test: ", test)

`

**Output:

Screenshot-2026-02-03-165544

Output

Choosing the Right Split Ratio

Dataset Size Recommended Split
Small 70:30
Medium 80:20
Large 90:10

**Best Method to Use

**Common Mistakes to Avoid