Gaussian Naive Bayes using Sklearn (original) (raw)

Last Updated : 23 Jul, 2025

In the world of machine learning, Gaussian Naive Bayes is a simple yet powerful algorithm used for classification tasks. It belongs to the Naive Bayes algorithm family, which uses Bayes' Theorem as its foundation. The goal of this post is to explain the Gaussian Naive Bayes classifier and offer a detailed implementation tutorial for Python users utilizing the Sklearn module.

A family of algorithms known as "naive Bayes classifiers" use the Bayes Theorem with the strong (naive) presumption that every feature in the dataset is unrelated to every other feature. Naive Bayes classifiers perform very well in a variety of real-world situations despite this simplicity. The Naive Bayes classifier is a probabilistic algorithm based on Bayes' theorem. It assumes that features are conditionally independent, given the class label. Despite its 'naive' assumption, Naive Bayes often performs well in various real-world scenarios.

Gaussian Naive Bayes

The probabilistic classification algorithm Gaussian Naive Bayes (GNB) is founded on the Bayes theorem. Given the class label, it is assumed that features follow a Gaussian distribution and are conditionally independent. For continuous data, GNB is especially helpful. The algorithm calculates the variance and mean of each feature for every class during training. During the prediction stage, it determines which class an instance is most likely to belong to by calculating the probability of each class. Text classification and spam filtering are just two of the many applications that can benefit from GNB's computational efficiency and ability to handle high-dimensional datasets.

Bayes’ Theorem

The Bayes Theorem allows us to calculate the probability of an event based on the likelihood of a previous occurrence. The theorem is expressed mathematically as:

P(A∣B)=\frac{P(B∣A)⋅P(A)}{P(B)}

Where:

The Gaussian Naive Bayes classifier is one of several algorithms available in machine learning that may be used to tackle a wide range of issues. This article uses the well-known Scikit-Learn package (Sklearn) to walk readers who are new to data science and machine learning through the basic ideas of Gaussian Naive Bayes. We will go over the fundamental ideas, important vocabulary, and useful examples to help you grasp.

Representation for Gaussian Naïve Bayes

Gaussian Naive Bayes (GNB) uses Gaussian (normal) distributions to represent the probability distribution of features within each class. Estimating the mean (μ) and variance (σ2 ) for every feature in every class is part of the representation for a dataset with m features and n classes.

Mathematically, the Gaussian distribution for a feature Xi​ in class  Cj​ is represented as follows:

P(X_{i}|C_{i}) = \frac{1}{\sqrt{2\pi\sigma^2}}e^\frac{-(x-\mu_{c})^2}{2\sigma_c^2}

Where,

Implementation of Gaussian Naive Bayes using Synthetic Dataset

Generating a Synthetic Dataset

We’ll start by creating a synthetic dataset suitable for classification. The make_classification function in Sklearn will be used to create a dataset with two features.

Python3 `

from sklearn.datasets import make_classification import matplotlib.pyplot as plt

Generate a synthetic dataset

X, y = make_classification(n_samples=100, n_features=2, n_redundant=0, n_clusters_per_class=1, random_state=42)

Visualize the dataset

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='k') plt.title('Synthetic Dataset') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.show()

`

**Output:

download-(3)-Geeksforgeeks

Training the Gaussian Naive Bayes Model

Now, we’ll train the Gaussian Naive Bayes model using the synthetic dataset.

Python3 `

from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score

Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Initialize the Gaussian Naive Bayes classifier

gnb = GaussianNB()

Train the model

gnb.fit(X_train, y_train)

Predict the labels for the test set

y_pred = gnb.predict(X_test)

Calculate the accuracy

accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy}')

`

**Output:

Accuracy: 0.9666666666666667

Implementation of Gaussian Naive Bayes on Census Income Dataset

Importing Libraries

Python3 `

import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score

`

The code performs Naive Bayes classification using scikit-learn and handles data using pandas. Labels are encoded, data is divided into training and testing sets, a Gaussian Naive Bayes classifier is trained, and the accuracy of the classifier is assessed.

Loading the Census Income Dataset

We’ll start by loading the Census Income dataset from the UCI Machine Learning Repository.

Python3 `

Load the Census Income dataset

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation','relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'] census_data = pd.read_csv(url, names=column_names)

Display the first few rows of the dataset

print(census_data.head())

`

**Output:

age workclass fnlwgt education education-num \
0 39 State-gov 77516 Bachelors 13
1 50 Self-emp-not-inc 83311 Bachelors 13
2 38 Private 215646 HS-grad 9
3 53 Private 234721 11th 7
4 28 Private 338409 Bachelors 13
marital-status occupation relationship race sex \
0 Never-married Adm-clerical Not-in-family White Male
1 Married-civ-spouse Exec-managerial Husband White Male
2 Divorced Handlers-cleaners Not-in-family White Male
3 Married-civ-spouse Handlers-cleaners Husband Black Male
4 Married-civ-spouse Prof-specialty Wife Black Female
capital-gain capital-loss hours-per-week native-country income
0 2174 0 40 United-States <=50K
1 0 0 13 United-States <=50K
2 0 0 40 United-States <=50K
3 0 0 40 United-States <=50K
4 0 0 40 Cuba <=50K

Preprocessing the Data

Before we can train our model, we need to preprocess the data. This includes converting categorical variables into numerical values and normalizing the continuous variables.

Python3 `

from sklearn.preprocessing import LabelEncoder

Convert categorical variables to numerical values

le = LabelEncoder() categorical_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'income'] for feature in categorical_features: census_data[feature] = le.fit_transform(census_data[feature])

Normalize continuous variables

census_data[ ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']] = census_data[ ['age', 'fnlwgt','education-num', 'capital-gain', 'capital-loss', 'hours-per-week']].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

Display the preprocessed data

print(census_data.head())

`

**Output:

    age  workclass    fnlwgt  education  education-num  marital-status  \  

0 0.301370 7 0.044302 9 0.800000 4
1 0.452055 6 0.048238 9 0.800000 2
2 0.287671 4 0.138113 11 0.533333 0
3 0.493151 4 0.151068 1 0.400000 2
4 0.150685 4 0.221488 9 0.800000 2
occupation relationship race sex capital-gain capital-loss \
0 1 1 4 1 0.02174 0.0
1 4 0 4 1 0.00000 0.0
2 6 1 4 1 0.00000 0.0
3 6 0 2 1 0.00000 0.0
4 10 5 2 0 0.00000 0.0
hours-per-week native-country income
0 0.397959 39 0
1 0.122449 39 0
2 0.397959 39 0
3 0.397959 39 0
4 0.397959 5 0

Training the Gaussian Naive Bayes Model

With our data preprocessed, we can now train the Gaussian Naive Bayes model.

Python3 `

from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB from sklearn.metrics import accuracy_score

Extract features and labels

X = census_data.drop('income', axis=1) y = census_data['income']

Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Initialize the Gaussian Naive Bayes classifier

gnb = GaussianNB()

Train the model

gnb.fit(X_train, y_train)

Predict the labels for the test set

y_pred = gnb.predict(X_test)

Calculate the accuracy

accuracy = accuracy_score(y_test, y_pred) print(f'Accuracy: {accuracy}')

`

**Output:

Accuracy: 0.8086805200122837

This example shows how to use the Census Income dataset to apply Gaussian Naive Bayes. You may use this approach to forecast income levels based on employment and demographic characteristics by following these steps.

Conclusion

In this article, we've introduced the Gaussian Naive Bayes classifier and demonstrated its implementation using Scikit-Learn. Understanding the basics of this algorithm, key terminologies, and following the provided steps will empower you to apply Gaussian Naive Bayes to your own projects. As you continue your journey into machine learning, this knowledge will serve as a valuable foundation for more advanced concepts and techniques.