Utilizing Confusion Matrix and Accuracy for Testing Classification Models

| Machine Learning-Assisted Exploratory Data Analysis

Kyounggu Yeo | Data Analyst

canadian-cheese-logo.jpeg

1. Introduction

We have been tasked with developing a system that can determine whether a cheese has a higher or lower fat content. To accomplish this, we will create a model through a process known as training in machine learning. The main objective of training is to produce a precise model that can accurately answer our questions most of the time. However, to train the model, we need to obtain data that we can use. This is where our journey begins.

We will be collecting data from the Canadian Cheeses dataset, which contains information on various aspects of cheese such as flavor and shape. For our purposes, we will focus on two straightforward factors: the type of milk used and the percentage of moisture content. We hope that by examining these two features alone, we can categorize our cheese samples into higher and lower fat levels. From this point on, we will refer to these features as milk type and moisture. We have gathered data from the following resources for our exploratory data analysis.

Data Source and Overview

The original dataset can be accessed on the Government of Canada's Open Government Portal, under Open Data and Agriculture categories.

Government of Canada's Open Government Portal > Open Data > Agriculture.

The UBC Data Science faculty conducted data wrangling and cleaning on the original dataset and provided us with a modified version of it.

The dataset, called cheese_data, is a table with 13 columns: cheeseId, ManufacturerProvCode, ManufacturingTypeEn, MoisturePercent, FlavorEn, CharacteristicsEn, Organic, CategoryTypeEn, MilkTypeEn, MilkTreatmentTypeEn, RindTypeEn, CheeseName, FatLevel, stored in a .csv file.

Data schema

The columns in the dataset are:

For our prediction model, we will only utilize three columns: Moisturepercent, MilkTypeEn, and FatLevel.

2. Preprocessing Data through Cleaning

After completing the setup, our initial task in machine learning is to gather data, which is a crucial step as the quality and quantity of the collected data will influence the accuracy of our predictive model. Our focus will be on collecting data regarding the milk type and moisture content for each cheese, which will enable us to create a table consisting of milk type, moisture content, and the cheese's fat content - high or low. This table will serve as our training data.

It's time for us to move on to the next phase of machine learning, which is data preparation. During this stage, we will load our data into a suitable location and prepare it for use in our machine learning training. Our first task will be to gather all of our data and randomize its order. Additionally, we will need to divide the data into two parts: the Train set, which will make up the majority of our dataset and be used to train our model, and the Test set, which will be used to evaluate the performance of our trained model.

The code above creates feature vectors and a target variable from the "cheese_df" DataFrame. The feature vectors are stored in a DataFrame called "X", which includes all columns except for "FatLevel". The target variable is stored in a Series called "y", which includes only the "FatLevel" column.

The train_test_split() function is then used to split the data into training and testing sets, with a test size of 0.3 and a random state of 123. The training data is stored in "X_train" and "y_train", while the testing data is stored in "X_test" and "y_test".

3. Analysis of Data through Descriptive Statistics

We'll check to see if there are any noteworthy findings in the X_train dataset.

The code above prints the shape of the training set feature vectors "X_train", which represents the number of rows and columns in the training set after splitting the data using train_test_split() function.

The output will be in the form of a tuple (rows, columns).

4. Representation of Data through Visualization

At this point, it would be beneficial to create data visualizations to identify any significant relationships between variables and detect any data imbalances that may exist.

The chart is a scatter plot that shows the relationship between the moisture percentage and fat level of different types of cheese.

The x-axis represents the moisture percentage, while the y-axis represents the fat level. Each data point is colored based on its fat level, and hovering over a point will display the fat level value in a tooltip.

The chart is interactive, allowing the user to zoom and pan to explore the data. It also includes a title that describes the purpose of the chart.

The chart is a scatter plot that shows the relationship between the type of milk used and the fat level of different types of cheese.

The x-axis represents the type of milk used in the cheese, while the y-axis represents the fat level. Each data point is colored based on its fat level, and hovering over a point will display the fat level value in a tooltip.

5. Assessment of Classification Models in Machine Learning

The subsequent stage in our workflow is to select a suitable model from a variety of options available to researchers and data scientists.

In our case, since we have only two features, milk type and moisture percentage, and a categorical predicted (target) value indicating either "higher fat" or "lower fat", we can employ a classification model to predict the target value.

Take into account the following machine learning tasks:

The output for each scenario is binary, with Yes or No being the possible outcomes. Positive outcomes are typically assigned the value of 1, while negative outcomes are represented by 0.

5-1. Dummy Classifier

A dummy classifier is a type of classifier that makes predictions using simple rules, rather than using a complex model. It is typically used as a baseline for comparison to more sophisticated models, as it can provide insight into the performance of a model that simply guesses randomly.

A dummy classifier may be used as a point of reference to measure the effectiveness of a more advanced model.

The code above creates a baseline model using the DummyClassifier from scikit-learn with the strategy parameter set to "prior". This strategy always predicts the most frequent class in the training set.

The cross_validate function from scikit-learn is used to calculate cross-validation scores for the baseline model. The function performs 5-fold cross-validation (cv=5) and returns training and testing scores for each fold.

The resulting scores are stored in a pandas DataFrame called dummy_scores. The DataFrame contains columns for the training and testing scores for each fold, as well as columns for the fit and score times.

"The score of our dummy classifier ranges from 65%"

5-2. Random Forest Classifier

Random forest classifier is a type of ensemble learning method that creates a large number of decision trees and combines their predictions to make more accurate predictions.

The Random Forest Classifier is a useful tool for quickly identifying significant information from large datasets. One of its key advantages is that it aggregates multiple decision trees to arrive at a solution, which can lead to better accuracy compared to other classification algorithms.

"The score of our random forest classifier ranges from 81% to 86%."

5-3. Evaluating Classification Models

Consequently, the Random Forest Classifier model outperforms the Dummy Classifier model.

Machine learning algorithms have hyperparameters that can be adjusted to optimize their performance on a particular dataset. When the number of dimensions is small, RandomizedSearchCV is considered the most effective parameter search technique.

The optimal value for the n_estimators hyperparameter is 21.

Based on the hyperparameter tuning, the best max_depth value for the random forest classifier is 21, and the best cross-validation accuracy score achieved is 0.8133255633255633.

Upon completion of hyperparameter tuning, we will receive the pipeline object containing the most favorable amalgamation of hyperparameters.

The final test score on the best model is 0.8025889967637541.

At this point, we have trained a Random Forest Classifier model on the cheese dataset, tuned its hyperparameters using Randomized Search Cross Validation, and evaluated its performance using accuracy score and classification report. We also made predictions on the training and test sets.

5-4. Confusion Matrix

A confusion matrix is a table used to evaluate the performance of a machine learning model for classification problems. It summarizes the number of correct and incorrect predictions made by the model on a set of test data.

The matrix consists of four terms: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).

confusion-matrix-2.png

6. Accuracy

Accuracy is the simplest metric to measure the performance of a classification model. It answers the below question:

What percentage of predictions did the model get right?

We know that True Positives and True Negatives are the outcomes when the expected result matches the model prediction. Their sum is the total number of correct outcomes. We divide this count by the total number of predictions to get the Accuracy.

accuracy.png

Let’s do the calculation for our model:

calculation.png

Our model's accuracy can be stated as 80%.

7. Remarks

According to Cross-validation, RandomForestClassifier model performs better than DummyClassifier model.

The final test score on the best model is 0.8025889967637541.

Besides, the CountVectorizer() transformer could be used in our pipeline with the LogisticRegression model for better outcomes by comparison with current models.

8. References