Learning Model Building in Scikitlearn (original) (raw)

Learning Model Building in Scikit-learn

Last Updated : 29 Jan, 2025

Building machine learning models from scratch can be complex and time-consuming. However with the right tools and frameworks this process can become significantly easier. Scikit-learn is one such tool that makes machine learning model creation easy. It provides user-friendly tools for tasks like Classification, Regression, Clustering and many more.

Scikit-learn is a open-source Python library that include wide range of machine learning models, pre-processing, cross-validation and visualization algorithms and all accessible with simple interface. Its simplicity and versatility make it a better choice for both beginners and advanced data scientists to build and implement machine learning models. In this article we will learn essential features and techniques for building machine learning models using Scikit-learn.

**Installation of Scikit- learn

The latest version of Scikit-learn is 1.1 and it requires Python 3.8 or newer.

Scikit-learn requires:

Before installing scikit-learn, ensure that you have NumPy and SciPy installed. Once you have a working installation of NumPy and SciPy, **the easiest way to install scikit-learn is using pip:

!pip install -U scikit-learn

Let us get started with the modeling process now.

**Step 1: Load a Dataset

A dataset is nothing but a collection of data. A dataset generally has two main components:

1. **Loading exemplar dataset: Scikit-learn comes with few loaded example datasets like the iris and digits datasets for classification and the boston house prices dataset for regression.

Given below is an example of how we can load an exemplar dataset:

Python `

load the iris dataset as an example

from sklearn.datasets import load_iris iris = load_iris()

store the feature matrix (X) and response vector (y)

X = iris.data y = iris.target

store the feature and target names

feature_names = iris.feature_names target_names = iris.target_names

printing features and target names of our dataset

print("Feature names:", feature_names) print("Target names:", target_names)

X and y are numpy arrays

print("\nType of X is:", type(X))

printing first 5 input rows

print("\nFirst 5 rows of X:\n", X[:5])

`

**Output:

Feature names: ['sepal length (cm)','sepal width (cm)',
'petal length (cm)','petal width (cm)']
Target names: ['setosa' 'versicolor' 'virginica']
Type of X is:
First 5 rows of X:
[[ 5.1 3.5 1.4 0.2]
[ 4.9 3. 1.4 0.2]
[ 4.7 3.2 1.3 0.2]
[ 4.6 3.1 1.5 0.2]
[ 5. 3.6 1.4 0.2]]

**2. Loading external dataset: Now consider the case when we want to load an external dataset. For this we can use the **pandas library for easily loading and manipulating datasets.

For this you can refer to our article on How to import csv file in pandas

**Step 2: Splitting the Dataset

In machine learning **working with large datasets can be computationally expensive. For this_**we split the data into two parts: training data and testing data_. This approach helps reduce computational cost and also helps to evaluate model's performance and accuracy on unseen data.

**1. Load the Iris Dataset

Python `

from sklearn.datasets import load_iris iris = load_iris() X = iris.data y = iris.target

`

**2. Import and Use train_test_split to Split the Data

Python `

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

`

In this step we import**train_test_split** from sklearn.model_selection. This function splits the dataset into two parts: a training set and a testing set.

**3. Check the Shapes of the Split Data

When splitting data into training and testing sets verifying the shape ensures that both sets have correct proportions of data avoiding any potential errors in model evaluation or training.

Python `

print("X_train Shape:", X_train.shape) print("X_test Shape:", X_test.shape) print("Y_train Shape:", y_train.shape) print("Y_test Shape:", y_test.shape)

`

**Output:

X_train Shape: (90, 4)
X_test Shape: (60, 4)
Y_train Shape: (90,)
Y_test Shape: (60,)

Step 3: Handling Categorical Data

It's important to **handle categorical data correctly because machine learning algorithms typically require numerical input to process the data. If categorical data is not encoded algorithms may misinterpret the categories leading to incorrect results. This is why we need to handle categorical data by encoding it. **Scikit-learn provides several techniques for encoding categorical variables into numerical values.

**1. Label Encoding: It converts each category into a unique integer. For example in a column with categories like 'cat', 'dog', and 'bird', label encoding would convert them to 0, 1 and 2 respectively.

Python `

from sklearn.preprocessing import LabelEncoder

categorical_feature = ['cat', 'dog', 'dog', 'cat', 'bird']

encoder = LabelEncoder()

encoded_feature = encoder.fit_transform(categorical_feature)

print("Encoded feature:", encoded_feature)

`

**Output:

Encoded feature: [1 2 2 1 0]

This method is useful when the categorical values have an inherent order like "Low", "Medium" and "High" but it can be problematic for unordered categories.

**2. OneHotEncoding: It creates binary columns for each category where each column represents a category. For example if you have a column with values 'cat' 'dog' and 'bird' OneHotEncoding will create three new columns (one for each category) where each row will have a 1 in the column corresponding to its category and 0s in the others.

Python `

from sklearn.preprocessing import OneHotEncoder import numpy as np

categorical_feature = ['cat', 'dog', 'dog', 'cat', 'bird']

categorical_feature = np.array(categorical_feature).reshape(-1, 1)

encoder = OneHotEncoder(sparse_output=False)

encoded_feature = encoder.fit_transform(categorical_feature)

print("OneHotEncoded feature:\n", encoded_feature)

`

**Output:

OneHotEncoded feature:
[[0. 1. 0.]
[0. 0. 1.]
[0. 0. 1.]
[0. 1. 0.]
[1. 0. 0.]]

This method is useful for categorical variables without any inherent order ensuring that no numeric relationships are implied between the categories.

There are other techniques also : Mean Encoding

**Step 4: Training the Model

Now it's time to train our models using our dataset. Scikit-learn provides a wide range of machine learning algorithms that have a unified/consistent interface for fitting, predicting accuracy, etc. The example given below uses Logistic Regression.

**Note: We will not go into the details of how the algorithm works as we are interested in understanding its implementation only.

**2. Training Using Logistic Regression.

Python `

from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression(max_iter=200) log_reg.fit(X_train, y_train)

`

**3. Making Predictions

Python `

y_pred = log_reg.predict(X_test)

`

**4. Testing Accuracy

Python `

from sklearn import metrics print("Logistic Regression model accuracy:", metrics.accuracy_score(y_test, y_pred))

`

**5. Making Predictions

Python `

sample = [[3, 5, 4, 2], [2, 3, 5, 4]] preds = log_reg.predict(sample) pred_species = [iris.target_names[p] for p in preds] print("Predictions:", pred_species)

`

**Output:

Logistic Regression model accuracy: 0.9666666666666667
Predictions: ['virginica', 'virginica']

Features of Scikit-learn

Benefits of using Scikit-learn Libraries

Scikit-learn stands as one of the most important library in the field of machine learning providing a straightforward and powerful set of tools for building and deploying models. Whether you are a beginner or an experienced data scientist it is used by everyone for making machine learning models.