Kyounggu Yeo | Data Analyst
We have been tasked with developing a system that can determine whether a cheese has a higher or lower fat content. To accomplish this, we will create a model through a process known as training in machine learning. The main objective of training is to produce a precise model that can accurately answer our questions most of the time. However, to train the model, we need to obtain data that we can use. This is where our journey begins.
We will be collecting data from the Canadian Cheeses dataset, which contains information on various aspects of cheese such as flavor and shape. For our purposes, we will focus on two straightforward factors: the type of milk
used and the percentage of moisture content
. We hope that by examining these two features alone, we can categorize our cheese samples into higher
and lower fat levels
. From this point on, we will refer to these features as milk type
and moisture
. We have gathered data from the following resources for our exploratory data analysis.
The original dataset can be accessed on the Government of Canada's Open Government Portal, under Open Data and Agriculture categories.
Government of Canada's Open Government Portal > Open Data > Agriculture.
The UBC Data Science faculty conducted data wrangling and cleaning on the original dataset and provided us with a modified version of it.
The dataset, called cheese_data
, is a table with 13 columns: cheeseId
, ManufacturerProvCode
, ManufacturingTypeEn
, MoisturePercent
, FlavorEn
, CharacteristicsEn
, Organic
, CategoryTypeEn
, MilkTypeEn
, MilkTreatmentTypeEn
, RindTypeEn
, CheeseName
, FatLevel
, stored in a .csv file.
The columns in the dataset are:
For our prediction model, we will only utilize three columns: Moisturepercent
, MilkTypeEn
, and FatLevel
.
# Import the necessary libraries for EDA
import pandas as pd
import altair as alt
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC, SVR
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import (FunctionTransformer, Normalizer, OneHotEncoder, StandardScaler, normalize, scale)
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import make_scorer
from scipy.stats import lognorm, loguniform, randint
# Import the cheese data file
cheese = pd.read_csv("data/cheese_data.csv")
After completing the setup, our initial task in machine learning is to gather data, which is a crucial step as the quality and quantity of the collected data will influence the accuracy of our predictive model. Our focus will be on collecting data regarding the milk type
and moisture content
for each cheese, which will enable us to create a table consisting of milk type, moisture content, and the cheese's fat content - high or low. This table will serve as our training data.
# Obtain data from the cheese dataset
cheese_df = cheese.drop(columns=['CheeseId', 'Organic', 'ManufacturerProvCode', 'ManufacturingTypeEn', 'FlavourEn', 'CharacteristicsEn', 'CategoryTypeEn', 'MilkTreatmentTypeEn', 'RindTypeEn', 'CheeseName']).dropna()
# Display the first few rows of the obtained data
cheese_df.head()
MoisturePercent | MilkTypeEn | FatLevel | |
---|---|---|---|
0 | 47.0 | Ewe | lower fat |
1 | 47.9 | Cow | lower fat |
2 | 54.0 | Cow | lower fat |
3 | 47.0 | Cow | lower fat |
4 | 49.4 | Cow | lower fat |
# Display the dimension of the cheese data
cheese_df.shape
(1027, 3)
It's time for us to move on to the next phase of machine learning, which is data preparation. During this stage, we will load our data into a suitable location and prepare it for use in our machine learning training. Our first task will be to gather all of our data and randomize its order. Additionally, we will need to divide the data into two parts: the Train set
, which will make up the majority of our dataset and be used to train our model, and the Test set
, which will be used to evaluate the performance of our trained model.
# Create feature vectors and target variable
X = cheese_df.drop(columns=["FatLevel"])
y = cheese_df["FatLevel"]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
# Print the head of the training set
X_train.head()
MoisturePercent | MilkTypeEn | |
---|---|---|
687 | 54.0 | Cow |
885 | 55.0 | Goat |
861 | 46.0 | Cow |
967 | 39.0 | Cow |
940 | 42.0 | Goat |
The code above creates feature vectors and a target variable from the "cheese_df" DataFrame. The feature vectors are stored in a DataFrame called "X", which includes all columns except for "FatLevel". The target variable is stored in a Series called "y", which includes only the "FatLevel" column.
The train_test_split() function is then used to split the data into training
and testing sets
, with a test size of 0.3 and a random state of 123. The training data is stored in "X_train" and "y_train", while the testing data is stored in "X_test" and "y_test".
# Print the shape of the training set
X_train.shape
(718, 2)
We'll check to see if there are any noteworthy findings in the X_train
dataset.
# Generate summary statistics for the dataframe
X_train.describe()
MoisturePercent | |
---|---|
count | 718.000000 |
mean | 47.083705 |
std | 9.785916 |
min | 20.000000 |
25% | 40.000000 |
50% | 46.000000 |
75% | 52.000000 |
max | 92.000000 |
# Display information about the dataframe
X_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 718 entries, 687 to 111 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MoisturePercent 718 non-null float64 1 MilkTypeEn 718 non-null object dtypes: float64(1), object(1) memory usage: 16.8+ KB
The code above prints the shape of the training set feature vectors "X_train", which represents the number of rows and columns in the training set after splitting the data using train_test_split() function.
The output will be in the form of a tuple (rows, columns).
At this point, it would be beneficial to create data visualizations to identify any significant relationships between variables and detect any data imbalances that may exist.
# Create a data visualization chart
cheese_plot1 = alt.Chart(cheese_df, width=500, height=300).mark_point().encode(
x='MoisturePercent',
y='FatLevel',
color='FatLevel',
tooltip=['FatLevel']
).interactive().properties(title='The relationship between moisture and fat level')
# Display the chart
cheese_plot1
The chart is a scatter plot that shows the relationship between the moisture percentage
and fat level
of different types of cheese.
The x-axis represents the moisture percentage, while the y-axis represents the fat level. Each data point is colored based on its fat level, and hovering over a point will display the fat level value in a tooltip.
The chart is interactive, allowing the user to zoom and pan to explore the data. It also includes a title that describes the purpose of the chart.
# Create a supporting data visualization chart
cheese_plot2 = alt.Chart(cheese_df, width=500, height=300).mark_point().encode(
x='MilkTypeEn',
y='FatLevel',
color='FatLevel',
tooltip=['FatLevel']
).interactive().properties(title='The relationship between milk type and fat level')
# Display the chart
cheese_plot2
The chart is a scatter plot that shows the relationship between the type of milk
used and the fat level
of different types of cheese.
The x-axis represents the type of milk used in the cheese, while the y-axis represents the fat level. Each data point is colored based on its fat level, and hovering over a point will display the fat level value in a tooltip.
The subsequent stage in our workflow is to select a suitable model from a variety of options available to researchers and data scientists.
In our case, since we have only two features, milk type
and moisture percentage
, and a categorical predicted (target) value indicating either "higher fat" or "lower fat", we can employ a classification model
to predict the target value.
Take into account the following machine learning tasks:
The output for each scenario is binary, with Yes or No being the possible outcomes. Positive outcomes are typically assigned the value of 1, while negative outcomes are represented by 0.
# Identify the numeric, categorical, and binary columns
numeric_feats = ["MoisturePercent"]
categorical_feats = ["MilkTypeEn"]
# Preprocessing for numerical data: Impute missing values with median and scale features
numeric_transformer = make_pipeline(SimpleImputer(strategy='median'),
StandardScaler()
)
# Preprocessing for categorical data: Impute missing values with most frequent and encode categories as one-hot
categorical_transformer = make_pipeline(SimpleImputer(strategy='most_frequent', fill_value='missing'),
OneHotEncoder(handle_unknown="ignore")
)
# Combine preprocessing steps using ColumnTransformer
preprocessor = make_column_transformer(
(numeric_transformer, numeric_feats),
(categorical_transformer, categorical_feats),
remainder="passthrough")
A dummy classifier is a type of classifier that makes predictions using simple rules, rather than using a complex model. It is typically used as a baseline for comparison to more sophisticated models, as it can provide insight into the performance of a model that simply guesses randomly.
A dummy classifier may be used as a point of reference to measure the effectiveness of a more advanced model.
# Create the beseline model
dummy_clf = DummyClassifier(strategy="prior")
# Calculate cross-validation scores for the baseline model.
dummy_scores = pd.DataFrame(cross_validate(dummy_clf, X_train, y_train, cv=5, return_train_score=True))
dummy_scores
fit_time | score_time | test_score | train_score | |
---|---|---|---|---|
0 | 0.004391 | 0.004534 | 0.652778 | 0.651568 |
1 | 0.002260 | 0.002185 | 0.652778 | 0.651568 |
2 | 0.002279 | 0.001087 | 0.652778 | 0.651568 |
3 | 0.003698 | 0.005693 | 0.650350 | 0.652174 |
4 | 0.002368 | 0.001239 | 0.650350 | 0.652174 |
The code above creates a baseline model using the DummyClassifier from scikit-learn with the strategy parameter set to "prior". This strategy always predicts the most frequent class in the training set.
The cross_validate function from scikit-learn is used to calculate cross-validation scores for the baseline model. The function performs 5-fold cross-validation (cv=5) and returns training and testing scores for each fold.
The resulting scores are stored in a pandas DataFrame called dummy_scores
. The DataFrame contains columns for the training and testing scores for each fold, as well as columns for the fit and score times.
mean_dummy_training_score = dummy_scores['train_score'].mean()
mean_dummy_cv_score = dummy_scores['test_score'].mean()
mean_dummy_scores = pd.DataFrame({'Mean': ['training_score', 'cv_score'],
'Scores': [mean_dummy_training_score, mean_dummy_cv_score]
})
mean_dummy_scores
Mean | Scores | |
---|---|---|
0 | training_score | 0.651810 |
1 | cv_score | 0.651807 |
"The score of our dummy classifier ranges from 65%"
Random forest classifier is a type of ensemble learning method that creates a large number of decision trees and combines their predictions to make more accurate predictions.
The Random Forest Classifier is a useful tool for quickly identifying significant information from large datasets. One of its key advantages is that it aggregates multiple decision trees to arrive at a solution, which can lead to better accuracy compared to other classification algorithms.
# Create the main pipeline with preprocessor and Random Forest Classifier
main_pipe = make_pipeline(preprocessor, RandomForestClassifier(random_state=77, class_weight='balanced'))
# Calculate the scores of the main pipeline using cross-validation
scores_df = pd.DataFrame(cross_validate(main_pipe, X_train, y_train, cv=5, return_train_score=True))
scores_df
fit_time | score_time | test_score | train_score | |
---|---|---|---|---|
0 | 0.333828 | 0.036371 | 0.833333 | 0.862369 |
1 | 0.252244 | 0.022920 | 0.805556 | 0.865854 |
2 | 0.247132 | 0.022925 | 0.833333 | 0.871080 |
3 | 0.256660 | 0.023254 | 0.811189 | 0.866087 |
4 | 0.246029 | 0.034455 | 0.783217 | 0.871304 |
mean_rfc_training_score = scores_df['train_score'].mean()
mean_rfc_cv_score = scores_df['test_score'].mean()
mean_rfc_scores = pd.DataFrame({'Mean': ['training_score', 'cv_score'],
'Scores': [mean_rfc_training_score, mean_rfc_cv_score]
})
mean_rfc_scores
Mean | Scores | |
---|---|---|
0 | training_score | 0.867339 |
1 | cv_score | 0.813326 |
"The score of our random forest classifier ranges from 81% to 86%."
Consequently, the Random Forest Classifier
model outperforms the Dummy Classifier
model.
Machine learning algorithms have hyperparameters that can be adjusted to optimize their performance on a particular dataset. When the number of dimensions is small, RandomizedSearchCV
is considered the most effective parameter search technique.
# Tuning the hyperparameters
param_grid = {
"randomforestclassifier__max_depth": range(1,101,10)
}
depth_search = RandomizedSearchCV(main_pipe, param_grid, cv=5, n_iter=5, return_train_score=True, random_state=77, verbose=2)
depth_search.fit(X_train, y_train)
Fitting 5 folds for each of 5 candidates, totalling 25 fits [CV] END ...............randomforestclassifier__max_depth=21; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=21; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=21; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=21; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=21; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=11; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=11; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=11; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=11; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=11; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=91; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=91; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=91; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=91; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=91; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=61; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=61; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=61; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=61; total time= 0.3s [CV] END ...............randomforestclassifier__max_depth=61; total time= 0.3s [CV] END ................randomforestclassifier__max_depth=1; total time= 0.2s [CV] END ................randomforestclassifier__max_depth=1; total time= 0.2s [CV] END ................randomforestclassifier__max_depth=1; total time= 0.2s [CV] END ................randomforestclassifier__max_depth=1; total time= 0.3s [CV] END ................randomforestclassifier__max_depth=1; total time= 0.2s
RandomizedSearchCV(cv=5, estimator=Pipeline(steps=[('columntransformer', ColumnTransformer(remainder='passthrough', transformers=[('pipeline-1', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), ['MoisturePercent']), ('pipeline-2', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), ['MilkTypeEn'])])), ('randomforestclassifier', RandomForestClassifier(class_weight='balanced', random_state=77))]), n_iter=5, param_distributions={'randomforestclassifier__max_depth': range(1, 101, 10)}, random_state=77, return_train_score=True, verbose=2)
# Store the grid search results in a dataframe
grid_results = pd.DataFrame(depth_search.cv_results_, columns=['mean_test_score', 'param_randomforestclassifier__max_depth', 'mean_fit_time', 'rank_test_score'])
grid_results = grid_results.sort_values(by='rank_test_score')
grid_results
mean_test_score | param_randomforestclassifier__max_depth | mean_fit_time | rank_test_score | |
---|---|---|---|---|
0 | 0.813326 | 21 | 0.280368 | 1 |
1 | 0.813326 | 11 | 0.275888 | 1 |
2 | 0.813326 | 91 | 0.263958 | 1 |
3 | 0.813326 | 61 | 0.262938 | 1 |
4 | 0.806352 | 1 | 0.200702 | 5 |
The optimal value for the n_estimators
hyperparameter is 21.
# Find the best parameters and scores
best_parameters = depth_search.best_params_
best_score = depth_search.best_score_
print("Best parameters:", best_parameters)
print("Best score:", best_score)
Best parameters: {'randomforestclassifier__max_depth': 21} Best score: 0.8133255633255633
Based on the hyperparameter tuning, the best max_depth
value for the random forest classifier
is 21, and the best cross-validation accuracy
score achieved is 0.8133255633255633.
# Find the best model
best_model = depth_search.best_estimator_
best_model
Pipeline(steps=[('columntransformer', ColumnTransformer(remainder='passthrough', transformers=[('pipeline-1', Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')), ('standardscaler', StandardScaler())]), ['MoisturePercent']), ('pipeline-2', Pipeline(steps=[('simpleimputer', SimpleImputer(fill_value='missing', strategy='most_frequent')), ('onehotencoder', OneHotEncoder(handle_unknown='ignore'))]), ['MilkTypeEn'])])), ('randomforestclassifier', RandomForestClassifier(class_weight='balanced', max_depth=21, random_state=77))])
Upon completion of hyperparameter tuning, we will receive the pipeline object containing the most favorable amalgamation of hyperparameters.
# Print classification report
report = classification_report(y_test, best_model.predict(X_test))
print(report)
precision recall f1-score support higher fat 0.74 0.65 0.69 105 lower fat 0.83 0.88 0.86 204 accuracy 0.80 309 macro avg 0.78 0.76 0.77 309 weighted avg 0.80 0.80 0.80 309
# Evaluate the accuracy score of the best model on the test set
test_score = best_model.score(X_test, y_test)
test_score
0.8025889967637541
The final test score on the best model is 0.8025889967637541.
# Make predictions using the best model
tr_pred = depth_search.predict(X_train)
ts_pred = depth_search.predict(X_test)
At this point, we have trained a Random Forest Classifier
model on the cheese dataset, tuned its hyperparameters using Randomized Search Cross Validation
, and evaluated its performance using accuracy score
and classification report. We also made predictions on the training and test sets.
# Perform probability using the model
tr_prob = best_model.predict_proba(X_train)
ts_prob = best_model.predict_proba(X_test)
A confusion matrix
is a table used to evaluate the performance of a machine learning model for classification problems. It summarizes the number of correct and incorrect predictions made by the model on a set of test data.
The matrix consists of four terms: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN).
# Plot the confusion matrix for the best model on the test set
plot_confusion_matrix(best_model, X_test, y_test, values_format="d", cmap="Blues")
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f96427fc370>
Accuracy is the simplest metric to measure the performance of a classification model. It answers the below question:
What percentage of predictions did the model get right?
We know that True Positives and True Negatives are the outcomes when the expected result matches the model prediction. Their sum is the total number of correct outcomes. We divide this count by the total number of predictions to get the Accuracy.
Let’s do the calculation for our model:
Our model's accuracy can be stated as 80%.
According to Cross-validation, RandomForestClassifier
model performs better than DummyClassifier
model.
The final test score on the best model is 0.8025889967637541.
Besides, the CountVectorizer()
transformer could be used in our pipeline with the LogisticRegression
model for better outcomes by comparison with current models.