Generate an Explainability Report with the Automobile dataset using Contextual AI — Contextual AI documentation (original) (raw)

Contextual AI

This notebook demonstrates how to generate an explainability report using the Contextual AI compiler.

Motivation

When it comes to putting your machine learning model into production, one may need to consider the following:

  1. How can you compare newly trained models with previous versions?
  2. How can you present your new model’s results to non-technical users (e.g. managers, business developers) in an intuitive manner?
  3. How can you establish trust between your machine learning model and its end-users?

The Contextual AI project attempts to address these questions by providing interpretability and explanations throughout the lifecycle of a ML solution, including: 1. The data itself, allowing for robust data validation and intuitive visualisations 2. The training/validation processes of model, allowing users to understand how well the model trained and which features were important 3. The predictions, allowing for users to understand why a model exhibited particular behavior

For more details, please refer to our whitepaper

Steps

  1. Train a model on the automobile dataset
  2. Evaluate the model’s performance with Contextual AI report

1. Performance Model Training

import numpy as np import pandas as pd import warnings

1.1 Loading Data

data = pd.read_csv("Automobile_data.csv") data.describe

<bound method NDFrame.describe of symboling normalized-losses make fuel-type aspiration
0 3 ? alfa-romero gas std 1 3 ? alfa-romero gas std 2 1 ? alfa-romero gas std 3 2 164 audi gas std 4 2 164 audi gas std .. ... ... ... ... ... 200 -1 95 volvo gas std 201 -1 95 volvo gas turbo 202 -1 95 volvo gas std 203 -1 95 volvo diesel turbo 204 -1 95 volvo gas turbo

num-of-doors   body-style drive-wheels engine-location  wheel-base  ...  \

0 two convertible rwd front 88.6 ... 1 two convertible rwd front 88.6 ... 2 two hatchback rwd front 94.5 ... 3 four sedan fwd front 99.8 ... 4 four sedan 4wd front 99.4 ... .. ... ... ... ... ... ... 200 four sedan rwd front 109.1 ... 201 four sedan rwd front 109.1 ... 202 four sedan rwd front 109.1 ... 203 four sedan rwd front 109.1 ... 204 four sedan rwd front 109.1 ...

 engine-size  fuel-system  bore  stroke compression-ratio horsepower  \

0 130 mpfi 3.47 2.68 9.0 111 1 130 mpfi 3.47 2.68 9.0 111 2 152 mpfi 2.68 3.47 9.0 154 3 109 mpfi 3.19 3.4 10.0 102 4 136 mpfi 3.19 3.4 8.0 115 .. ... ... ... ... ... ... 200 141 mpfi 3.78 3.15 9.5 114 201 141 mpfi 3.78 3.15 8.7 160 202 173 mpfi 3.58 2.87 8.8 134 203 145 idi 3.01 3.4 23.0 106 204 141 mpfi 3.78 3.15 9.5 114

 peak-rpm city-mpg highway-mpg  price

0 5000 21 27 13495 1 5000 21 27 16500 2 5000 19 26 16500 3 5500 24 30 13950 4 5500 18 22 17450 .. ... ... ... ... 200 5400 23 28 16845 201 5300 19 25 19045 202 5500 18 23 21485 203 4800 26 27 22470 204 5400 19 25 22625

[205 rows x 26 columns]>

1.2 Quick Check

symboling 0 normalized-losses 0 make 0 fuel-type 0 aspiration 0 num-of-doors 0 body-style 0 drive-wheels 0 engine-location 0 wheel-base 0 length 0 width 0 height 0 curb-weight 0 engine-type 0 num-of-cylinders 0 engine-size 0 fuel-system 0 bore 0 stroke 0 compression-ratio 0 horsepower 0 peak-rpm 0 city-mpg 0 highway-mpg 0 price 0 dtype: int64

1.3 Feature quantity engineering

#normalized-losses a=data[data['normalized-losses']!='?'] b=(a['normalized-losses'].astype(int)).mean() data['normalized-losses']=data['normalized-losses'].replace('?',b).astype(int)

#num-of-doors a=data[data['body-style']=='sedan'] a['num-of-doors'].value_counts() a=data['num-of-doors'].map({'two':2,'four':4,'?':4}) data['num-of-doors']=a

#price a=data[data['price']!='?'] b=(a['price'].astype(int)).mean() data['price']=data['price'].replace('?',b).astype(int)

#horse power a=data[data['horsepower']!='?'] b=(a['horsepower'].astype(int)).mean() data['horsepower']=data['horsepower'].replace('?',b).astype(int)

#bore a=data[data['bore']!='?'] b=(a['bore'].astype(float)).mean() data['bore']=data['bore'].replace('?',b).astype(float)

#stroke a=data[data['stroke']!='?'] b=(a['stroke'].astype(float)).mean() data['stroke']=data['stroke'].replace('?',b).astype(float)

#peak-rpm a=data[data['peak-rpm']!='?'] b=(a['peak-rpm'].astype(float)).mean() data['peak-rpm']=data['peak-rpm'].replace('?',b).astype(float)

#num-of-cylinders a=data['num-of-cylinders'].map({'four':4,'five':5,'six':6,'?':4}) data['num-of-doors']=a

data.describe()

symboling normalized-losses num-of-doors wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
count 205.000000 205.000000 194.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 0.834146 122.000000 4.304124 98.756585 174.049268 65.907805 53.724878 2555.565854 126.907317 3.329751 3.255423 10.142537 104.253659 5125.369458 25.219512 30.751220 13207.126829
std 1.245307 31.681008 0.679289 6.021776 12.337289 2.145204 2.443522 520.680204 41.642693 0.270844 0.313597 3.972040 39.519219 476.979093 6.542142 6.886443 7868.768212
min -2.000000 65.000000 4.000000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 2.540000 2.070000 7.000000 48.000000 4150.000000 13.000000 16.000000 5118.000000
25% 0.000000 101.000000 4.000000 94.500000 166.300000 64.100000 52.000000 2145.000000 97.000000 3.150000 3.110000 8.600000 70.000000 4800.000000 19.000000 25.000000 7788.000000
50% 1.000000 122.000000 4.000000 97.000000 173.200000 65.500000 54.100000 2414.000000 120.000000 3.310000 3.290000 9.000000 95.000000 5200.000000 24.000000 30.000000 10595.000000
75% 2.000000 137.000000 4.000000 102.400000 183.100000 66.900000 55.500000 2935.000000 141.000000 3.580000 3.410000 9.400000 116.000000 5500.000000 30.000000 34.000000 16500.000000
max 3.000000 256.000000 6.000000 120.900000 208.100000 72.300000 59.800000 4066.000000 326.000000 3.940000 4.170000 23.000000 288.000000 6600.000000 49.000000 54.000000 45400.000000

1.4 Understanding the range of important features¶

import matplotlib.pyplot as plt plt.style.use('seaborn') import seaborn as sns

data[['engine-size','peak-rpm','curb-weight','horsepower','price','highway-mpg']].hist(figsize=(10,8),bins=6,color='g',linewidth='3',edgecolor='k') plt.tight_layout()

1.5 Getting the Price of car based on make

plt.rcParams['figure.figsize']=(23,10) ax = sns.boxplot(x="make", y="price", data=data,width=0.8,linewidth=5) ax.set_xlabel('Make of Car',fontsize=30) ax.set_ylabel('Price in $',fontsize=30) plt.title('Price of Car Based on Make',fontsize=40) ax.tick_params(axis='x',labelsize=20,rotation=90)

../../_images/tutorials_compiler_tutorial_automobile_13_0.png

1.6 Data split

from sklearn.model_selection import train_test_split

x = data[['curb-weight','engine-size','horsepower','width']] y = data['price']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

x_train.to_csv("train_data.csv", index=False) x_train.head()

curb-weight engine-size horsepower width
165 2265 98 112 64.0
41 2465 110 101 65.2
120 1967 90 68 63.8
161 2122 98 70 64.4
140 2240 108 73 63.8

1.7 Model Training

import pickle from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(x_train,y_train)

pkl = open('model.pkl', 'wb') pickle.dump(model, pkl) model = None

1.5 ML load model and evaluation

model_pkl = open('model.pkl', 'rb') model = pickle.load(model_pkl)

accuracy = round(model.score(x_train, y_train) * 100, 2) print("Model Accuracy: ", accuracy)


2. Involve Contextual AI complier

import os import sys sys.path.append('../../../') from xai.compiler.base import Configuration, Controller

2.1 Specify config file

json_config = 'basic-report.json'

2.2 Initial compiler controller with config

controller = Controller(config=Configuration(json_config)) print(controller.config)

{'name': 'Report for Automobile Dataset', 'overview': True, 'content_table': True, 'contents': [{'title': 'Feature Importance Analysis', 'desc': 'This section provides the analysis on feature', 'sections': [{'title': 'Feature Importance Ranking', 'component': {'_comment': 'refer to document section xxxx', 'class': 'FeatureImportanceRanking', 'attr': {'trained_model': 'model.pkl', 'train_data': 'train_data.csv'}}}]}, {'title': 'Data Statistics Analysis', 'desc': 'This section provides the analysis on data', 'sections': [{'title': 'Simple Data Statistic', 'component': {'_comment': 'refer to document section xxxx', 'class': 'DataStatisticsAnalysis', 'attr': {'data': 'Automobile_data.csv'}}}]}], 'writers': [{'class': 'Pdf', 'attr': {'name': 'automobile-basic-report'}}]}

2.2 Finally compiler render

../../../xai/data/helper.py:189: UserWarning: Warning: the feature [price] is suspected to be identifiable feature. [Examples]: ['13495', '16500', '16500', '13950', '17450']

'[Examples]: %s\n' % (column, col_data.tolist()[:5]))


Result

print("report generated : %s/automobile-basic-report.pdf" % os.getcwd())

report generated : /Users/i062308/Development/Explainable_AI/tutorials/compiler/automobile/automobile-basic-report.pdf