Kyounggu Yeo | Data Analyst
We have been tasked with developing a system that can determine whether a cheese has a higher or lower fat content. To accomplish this, we will create a model through a process known as training in machine learning. The main objective of training is to produce a precise model that can accurately answer our questions most of the time. However, to train the model, we need to obtain data that we can use. This is where our journey begins.
We will be collecting data from the Canadian Cheeses dataset, which contains information on various aspects of cheese such as flavor and shape. For our purposes, we will focus on two straightforward factors: the type of milk used and the percentage of moisture content. We hope that by examining these two features alone, we can categorize our cheese samples into higher and lower fat levels. From this point on, we will refer to these features as milk type and moisture. We have gathered data from the following resources for our exploratory data analysis.
The original dataset can be accessed on the Government of Canada's Open Government Portal, under Open Data and Agriculture categories.
Government of Canada's Open Government Portal > Open Data > Agriculture.
The UBC Data Science faculty conducted data wrangling and cleaning on the original dataset and provided us with a modified version of it.
The dataset, called cheese_data, is a table with 13 columns: cheeseId, ManufacturerProvCode, ManufacturingTypeEn, MoisturePercent, FlavorEn, CharacteristicsEn, Organic, CategoryTypeEn, MilkTypeEn, MilkTreatmentTypeEn, RindTypeEn, CheeseName, FatLevel, stored in a .csv file.
The columns in the dataset are:
For our prediction model, we will only utilize three columns: Moisturepercent, MilkTypeEn, and FatLevel.
# Import the necessary libraries for EDA
import pandas as pd
import altair as alt
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, SVR
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import (FunctionTransformer, Normalizer, OneHotEncoder, StandardScaler, normalize, scale)
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import make_scorer
from scipy.stats import lognorm, loguniform, randint
# Import the cheese data file
cheese = pd.read_csv("data/cheese_data.csv")
After completing the setup, our initial task in machine learning is to gather data, which is a crucial step as the quality and quantity of the collected data will influence the accuracy of our predictive model. Our focus will be on collecting data regarding the milk type and moisture content for each cheese, which will enable us to create a table consisting of milk type, moisture content, and the cheese's fat content - high or low. This table will serve as our training data.
# Obtain data from the cheese dataset
cheese_df = cheese.drop(columns=['CheeseId', 'Organic', 'ManufacturerProvCode', 'ManufacturingTypeEn', 'FlavourEn', 'CharacteristicsEn', 'CategoryTypeEn', 'MilkTreatmentTypeEn', 'RindTypeEn', 'CheeseName']).dropna()
# Display the first few rows of the obtained data
cheese_df.head()
| MoisturePercent | MilkTypeEn | FatLevel | |
|---|---|---|---|
| 0 | 47.0 | Ewe | lower fat |
| 1 | 47.9 | Cow | lower fat |
| 2 | 54.0 | Cow | lower fat |
| 3 | 47.0 | Cow | lower fat |
| 4 | 49.4 | Cow | lower fat |
# Display the dimension of the cheese data
cheese_df.shape
(1027, 3)
It's time for us to move on to the next phase of machine learning, which is data preparation. During this stage, we will load our data into a suitable location and prepare it for use in our machine learning training. Our first task will be to gather all of our data and randomize its order. Additionally, we will need to divide the data into two parts: the Train set, which will make up the majority of our dataset and be used to train our model, and the Test set, which will be used to evaluate the performance of our trained model.
# Create feature vectors and target variable
X = cheese_df.drop(columns=["FatLevel"])
y = cheese_df["FatLevel"]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
# Print the head of the training set
X_train.head()
| MoisturePercent | MilkTypeEn | |
|---|---|---|
| 687 | 54.0 | Cow |
| 885 | 55.0 | Goat |
| 861 | 46.0 | Cow |
| 967 | 39.0 | Cow |
| 940 | 42.0 | Goat |
The code above creates feature vectors and a target variable from the "cheese_df" DataFrame. The feature vectors are stored in a DataFrame called "X", which includes all columns except for "FatLevel". The target variable is stored in a Series called "y", which includes only the "FatLevel" column.
The train_test_split() function is then used to split the data into training and testing sets, with a test size of 0.3 and a random state of 123. The training data is stored in "X_train" and "y_train", while the testing data is stored in "X_test" and "y_test".
# Print the shape of the training set
X_train.shape
(718, 2)
We'll check to see if there are any noteworthy findings in the X_train dataset.
# Generate summary statistics for the dataframe
X_train.describe()
| MoisturePercent | |
|---|---|
| count | 718.000000 |
| mean | 47.083705 |
| std | 9.785916 |
| min | 20.000000 |
| 25% | 40.000000 |
| 50% | 46.000000 |
| 75% | 52.000000 |
| max | 92.000000 |
# Display information about the dataframe
X_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 718 entries, 687 to 111 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MoisturePercent 718 non-null float64 1 MilkTypeEn 718 non-null object dtypes: float64(1), object(1) memory usage: 16.8+ KB
The code above prints the shape of the training set feature vectors "X_train", which represents the number of rows and columns in the training set after splitting the data using train_test_split() function.
The output will be in the form of a tuple (rows, columns).
At this point, it would be beneficial to create data visualizations to identify any significant relationships between variables and detect any data imbalances that may exist.
# Create a data visualization chart
cheese_plot1 = alt.Chart(cheese_df, width=500, height=300).mark_point().encode(
x='MoisturePercent',
y='FatLevel',
color='FatLevel',
tooltip=['FatLevel']
).interactive().properties(title='The relationship between moisture and fat level')
# Display the chart
cheese_plot1
The chart is a scatter plot that shows the relationship between the moisture percentage and fat level of different types of cheese.
The x-axis represents the moisture percentage, while the y-axis represents the fat level. Each data point is colored based on its fat level, and hovering over a point will display the fat level value in a tooltip.
The chart is interactive, allowing the user to zoom and pan to explore the data. It also includes a title that describes the purpose of the chart.
# Create a supporting data visualization chart
cheese_plot2 = alt.Chart(cheese_df, width=500, height=300).mark_point().encode(
x='MilkTypeEn',
y='FatLevel',
color='FatLevel',
tooltip=['FatLevel']
).interactive().properties(title='The relationship between milk type and fat level')
# Display the chart
cheese_plot2
The chart is a scatter plot that shows the relationship between the type of milk used and the fat level of different types of cheese.
The x-axis represents the type of milk used in the cheese, while the y-axis represents the fat level. Each data point is colored based on its fat level, and hovering over a point will display the fat level value in a tooltip.
The subsequent stage in our workflow is to select a suitable model from a variety of options available to researchers and data scientists.
In our case, since we have only two features, milk type and moisture percentage, and a categorical predicted (target) value indicating either "higher fat" or "lower fat", we can employ a classification model to predict the target value.
Take into account the following machine learning tasks:
The output for each scenario is binary, with Yes or No being the possible outcomes. Positive outcomes are typically assigned the value of 1, while negative outcomes are represented by 0.
# Identify the numeric, categorical, and binary columns
numeric_feats = ["MoisturePercent"]
categorical_feats = ["MilkTypeEn"]
# Preprocessing for numerical data: Impute missing values with median and scale features
numeric_transformer = make_pipeline(SimpleImputer(strategy='median'),
StandardScaler()
)
# Preprocessing for categorical data: Impute missing values with most frequent and encode categories as one-hot
categorical_transformer = make_pipeline(SimpleImputer(strategy='most_frequent', fill_value='missing'),
OneHotEncoder(handle_unknown="ignore")
)
# Combine preprocessing steps using ColumnTransformer
preprocessor = make_column_transformer(
(numeric_transformer, numeric_feats),
(categorical_transformer, categorical_feats),
remainder="passthrough")
A dummy classifier is a type of classifier that makes predictions using simple rules, rather than using a complex model. It is typically used as a baseline for comparison to more sophisticated models, as it can provide insight into the performance of a model that simply guesses randomly.
A dummy classifier may be used as a point of reference to measure the effectiveness of a more advanced model.
# Create the beseline model
dummy_clf = DummyClassifier(strategy="prior")
# Calculate cross-validation scores for the baseline model.
dummy_scores = pd.DataFrame(cross_validate(dummy_clf, X_train, y_train, cv=5, return_train_score=True))
dummy_scores
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| 0 | 0.004391 | 0.004534 | 0.652778 | 0.651568 |
| 1 | 0.002260 | 0.002185 | 0.652778 | 0.651568 |
| 2 | 0.002279 | 0.001087 | 0.652778 | 0.651568 |
| 3 | 0.003698 | 0.005693 | 0.650350 | 0.652174 |
| 4 | 0.002368 | 0.001239 | 0.650350 | 0.652174 |
The code above creates a baseline model using the DummyClassifier from scikit-learn with the strategy parameter set to "prior". This strategy always predicts the most frequent class in the training set.
The cross_validate function from scikit-learn is used to calculate cross-validation scores for the baseline model. The function performs 5-fold cross-validation (cv=5) and returns training and testing scores for each fold.
The resulting scores are stored in a pandas DataFrame called dummy_scores. The DataFrame contains columns for the training and testing scores for each fold, as well as columns for the fit and score times.
mean_dummy_training_score = dummy_scores['train_score'].mean()
mean_dummy_cv_score = dummy_scores['test_score'].mean()
mean_dummy_scores = pd.DataFrame({'Mean': ['training_score', 'cv_score'],
'Scores': [mean_dummy_training_score, mean_dummy_cv_score]
})
mean_dummy_scores
| Mean | Scores | |
|---|---|---|
| 0 | training_score | 0.651810 |
| 1 | cv_score | 0.651807 |
"The score of our dummy classifier ranges from 65%"
Random forest classifier is a type of ensemble learning method that creates a large number of decision trees and combines their predictions to make more accurate predictions.
The Random Forest Classifier is a useful tool for quickly identifying significant information from large datasets. One of its key advantages is that it aggregates multiple decision trees to arrive at a solution, which can lead to better accuracy compared to other classification algorithms.
# Create the main pipeline with preprocessor and Random Forest Classifier
main_pipe = make_pipeline(preprocessor, RandomForestClassifier(random_state=77, class_weight='balanced'))
# Calculate the scores of the main pipeline using cross-validation
scores_df = pd.DataFrame(cross_validate(main_pipe, X_train, y_train, cv=5, return_train_score=True))
scores_df
| fit_time | score_time | test_score | train_score | |
|---|---|---|---|---|
| 0 | 0.333828 | 0.036371 | 0.833333 | 0.862369 |
| 1 | 0.252244 | 0.022920 | 0.805556 | 0.865854 |
| 2 | 0.247132 | 0.022925 | 0.833333 | 0.871080 |
| 3 | 0.256660 | 0.023254 | 0.811189 | 0.866087 |
| 4 | 0.246029 | 0.034455 | 0.783217 | 0.871304 |
mean_rfc_training_score = scores_df['train_score'].mean()
mean_rfc_cv_score = scores_df['test_score'].mean()
mean_rfc_scores = pd.DataFrame({'Mean': ['training_score', 'cv_score'],
'Scores': [mean_rfc_training_score, mean_rfc_cv_score]
})
mean_rfc_scores
| Mean | Scores | |
|---|---|---|
| 0 | training_score | 0.867339 |
| 1 | cv_score | 0.813326 |
"The score of our random forest classifier ranges from 81% to 86%."
Consequently, the Random Forest Classifier model outperforms the Dummy Classifier model.
Machine learning algorithms have hyperparameters that can be adjusted to optimize their performance on a particular dataset. When the number of dimensions is small, RandomizedSearchCV is considered the most effective parameter search technique.
# Tuning the hyperparameters
param_grid = {
"randomforestclassifier__max_depth": range(1,101,10)
}
depth_search = RandomizedSearchCV(main_pipe, param_grid, cv=5, n_iter=5, return_train_score=True, random_state=77, verbose=2)
depth_search.fit(X_train, y_train)
Fitting 5 folds for each of 5 candidates, totalling 25 fits [CV] END ...............randomforestclassifier__max_depth=21; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=21; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=21; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=21; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=21; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=11; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=11; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=11; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=11; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=11; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=91; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=91; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=91; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=91; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=91; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=61; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=61; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=61; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=61; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=61; total time= 0.3s [CV] END ................randomforestclassifier__max_depth=1; total time= 0.2s [CV] END ................randomforestclassifier__max_depth=1; total time= 0.2s [CV] END ................randomforestclassifier__max_depth=1; total time= 0.2s [CV] END ................randomforestclassifier__max_depth=1; total time= 0.3s [CV] END ................randomforestclassifier__max_depth=1; total time= 0.2s
RandomizedSearchCV(cv=5,
estimator=Pipeline(steps=[('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('pipeline-1',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
['MoisturePercent']),
('pipeline-2',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore'))]),
['MilkTypeEn'])])),
('randomforestclassifier',
RandomForestClassifier(class_weight='balanced',
random_state=77))]),
n_iter=5,
param_distributions={'randomforestclassifier__max_depth': range(1, 101, 10)},
random_state=77, return_train_score=True, verbose=2)
# Store the grid search results in a dataframe
grid_results = pd.DataFrame(depth_search.cv_results_, columns=['mean_test_score', 'param_randomforestclassifier__max_depth', 'mean_fit_time', 'rank_test_score'])
grid_results = grid_results.sort_values(by='rank_test_score')
grid_results
| mean_test_score | param_randomforestclassifier__max_depth | mean_fit_time | rank_test_score | |
|---|---|---|---|---|
| 0 | 0.813326 | 21 | 0.280368 | 1 |
| 1 | 0.813326 | 11 | 0.275888 | 1 |
| 2 | 0.813326 | 91 | 0.263958 | 1 |
| 3 | 0.813326 | 61 | 0.262938 | 1 |
| 4 | 0.806352 | 1 | 0.200702 | 5 |
The optimal value for the n_estimators hyperparameter is 21.
# Find the best parameters and scores
best_parameters = depth_search.best_params_
best_score = depth_search.best_score_
print("Best parameters:", best_parameters)
print("Best score:", best_score)
Best parameters: {'randomforestclassifier__max_depth': 21}
Best score: 0.8133255633255633
Based on the hyperparameter tuning, the best max_depth value for the random forest classifier is 21, and the best cross-validation accuracy score achieved is 0.8133255633255633.
# Find the best model
best_model = depth_search.best_estimator_
best_model
Pipeline(steps=[('columntransformer',
ColumnTransformer(remainder='passthrough',
transformers=[('pipeline-1',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
['MoisturePercent']),
('pipeline-2',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='missing',
strategy='most_frequent')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore'))]),
['MilkTypeEn'])])),
('randomforestclassifier',
RandomForestClassifier(class_weight='balanced', max_depth=21,
random_state=77))])
Upon completion of hyperparameter tuning, we will receive the pipeline object containing the most favorable amalgamation of hyperparameters.
# Print classification report
report = classification_report(y_test, best_model.predict(X_test))
print(report)
precision recall f1-score support
higher fat 0.74 0.65 0.69 105
lower fat 0.83 0.88 0.86 204
accuracy 0.80 309
macro avg 0.78 0.76 0.77 309
weighted avg 0.80 0.80 0.80 309
# Evaluate the accuracy score of the best model on the test set
test_score = best_model.score(X_test, y_test)
test_score
0.8025889967637541
The final test score on the best model is 0.8025889967637541.
# Make predictions using the best model
tr_pred = depth_search.predict(X_train)
ts_pred = depth_search.predict(X_test)
At this point, we have trained a Random Forest Classifier model on the cheese dataset, tuned its hyperparameters using Randomized Search Cross Validation, and evaluated its performance using accuracy score and classification report. We also made predictions on the training and test sets.
# Perform probability using the model
tr_prob = best_model.predict_proba(X_train)
ts_prob = best_model.predict_proba(X_test)
A confusion matrix is a table used to evaluate the performance of a machine learning model for classification problems. It summarizes the number of correct and incorrect predictions made by the model on a set of test data.
The matrix consists of four terms: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).
# Plot the confusion matrix for the best model on the test set
plot_confusion_matrix(best_model, X_test, y_test, values_format="d", cmap="Blues")
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f96427fc370>
Accuracy is the simplest metric to measure the performance of a classification model. It answers the below question:
What percentage of predictions did the model get right?
We know that True Positives and True Negatives are the outcomes when the expected result matches the model prediction. Their sum is the total number of correct outcomes. We divide this count by the total number of predictions to get the Accuracy.
Let’s do the calculation for our model:
Our model's accuracy can be stated as 80%.
According to Cross-validation, RandomForestClassifier model performs better than DummyClassifier model.
The final test score on the best model is 0.8025889967637541.
Besides, the CountVectorizer() transformer could be used in our pipeline with the LogisticRegression model for better outcomes by comparison with current models.