Machine Learning Pipeline for Multi-Class Text Classification (original) (raw)

Machine Learning (ML) pipeline is a sequential step that orchestrates the flow of data from data preprocessing to model training and prediction. This paper presents the development of a ML pipeline based on Natural Language Processing (NLP) for multi-class text classification using the 20 newsgroups text dataset. The study experimented the performance of six classifiers which are Multinominal Naïve Bayes (MNB), Logistic Regression (LR), K Nearest Neighbors (KNN), Random Forest (RF), eXtreme Gradient Boosting (XGB), and Stochastic Gradient Descent (SGD) in Google Colab. Experimental results show that TF-IDF Vectorizer performed better than Count Vectorizer when used as the vectorizer in most cases. KNN consistently had the least performance in most of the cases. MNB and SGD had the best performance with an accuracy of 76% and 74% and a computation speed of 10min 14s and 1h 28min 21s respectively. The study suggests that improved accuracy can be obtained using a hybrid model or deep learning approach.