Binary classification using LightGBM (original) (raw)

Last Updated : 6 Sep, 2025

LightGBM (Light Gradient Boosting Machine) is an open-source gradient boosting framework designed for efficient and scalable machine learning. It is widely used for classification tasks, including binary classification and is optimized for speed and memory usage.

We will implement binary classification using LightGBM:

1. Installing Libraries

We will install LightGBM for classification tasks.

pip install lightgbm

2. Importing Libraries and Dataset

We will import the necessary Python libraries such as pandas, numpy, seaborn, matplotlib, sklearn and load the dataset.

You can download dataset from here.

Python `

import pandas as pd import numpy as np import seaborn as sb import matplotlib.pyplot as plt import lightgbm as lgb from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score from lightgbm import LGBMClassifier import warnings warnings.filterwarnings('ignore')

df = pd.read_csv('/content/diabetes.csv')

`

2.1. Previewing the Dataset

We will check the first few rows to understand the data structure.

df.head()

`

**Output:

head

Previewing the Dataset

2.2. Dataset Shape

We will check the dimensions of the dataset.

df.shape

`

**Output:

(768, 9)

2.3. Dataset Information

We will check the data types and null values.

df.info()

`

**Output:

info

Dataset Information

2.4. Descriptive Statistics

We will compute statistical summaries of numeric features.

df.describe()

`

**Output:

describe

Descriptive Statistics

3. Exploratory Data Analysis (EDA)

We will analyze patterns, distributions and relationships among features.

3.1. Class Distribution

We will visualize the distribution of the target variable Outcome.

temp = df['Outcome'].value_counts() plt.pie(temp.values, labels=temp.index.values, autopct='%1.1f%%') plt.title("Class Distribution") plt.show()

`

**Output:

piechart

Class Distribution

3.2. Correlation Matrix

We will check correlations between features.

sb.heatmap(df.corr() > 0.7, cbar=False, annot=True) plt.show()

`

**Output:

heatmap

Correlation Matrix

3.3. Feature Distributions

We will visualize individual feature's distributions.

num_cols = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']

plt.figure(figsize=(15, 15)) for col in num_cols: plt.subplot(3, 3, num_cols.index(col)+1) sb.histplot(df[col], kde=True)

plt.tight_layout() plt.show()

`

**Output:

features

Feature Distributions

3.4. Count Plots

We will visualize categorical feature relationships with the target.

sb.countplot(data=df, x='Pregnancies', hue='Outcome') plt.show()

`

**Output:

count

Count Plots

Insights from the Diabetes dataset:

4. Data Preprocessing

We will prepare the dataset for LightGBM.

4.1. Splitting Features and Target

We will split the dataset into input features and target variable.

features = df.drop('Outcome', axis=1) target = df['Outcome']

X_train, X_val, Y_train, Y_val = train_test_split( features, target, test_size=0.2, random_state=2023 )

`

4.2. Feature Scaling

We will standardize the features to improve model learning.

scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val)

`

5. Dataset Preparation for LightGBM

We will convert arrays into LightGBM dataset objects for training.

train_data = lgb.Dataset(X_train, label=Y_train) test_data = lgb.Dataset(X_val, label=Y_val, reference=train_data)

`

6. Binary Classification Model Using LightGBM

We will define model parameters and train the classifier.

params = { 'objective': 'binary', 'metric': 'auc', 'boosting_type': 'gbdt', 'num_leaves': 31, 'learning_rate': 0.05, 'feature_fraction': 0.9 }

num_round = 100 bst = lgb.train(params, train_data, num_round, valid_sets=[test_data], early_stopping_rounds=10)

`

**Output:

lightgbm

Binary Classification Model Using LightGBM

7. Prediction and Evaluation

We will generate predictions and evaluate performance using ROC-AUC.

y_train = bst.predict(X_train) y_val = bst.predict(X_val)

y_train_class = (y_train > 0.5).astype(int) y_val_class = (y_val > 0.5).astype(int)

print("Training ROC-AUC: ", ras(Y_train, y_train)) print("Validation ROC-AUC: ", ras(Y_val, y_val))

`

**Output:

Training ROC-AUC: 1.0
Validation ROC-AUC: 0.6791463194067643