≺ Back to Posts

Coding a Naive Bayes Classifier in Python

The Naive Bayes Classifier is very simple, effective and one of the most efficient classifiers available. It's often a good test model to throw at a problem due to it's speed and ability to train effectively from a small sample of data.


It utilises Bayes Theorem together with some naive independence assumptions between the attributes and, even though this assumption almost never holds, it performs surprisingly well on complex problems. Feel free to follow this link for a deep dive on why Naive Bayes works well, despite not satisfying it's assumption.


Today our goal will be to code a Gaussian Naive Bayes classifier from scratch in Python! We'll start by taking a brief look at the mathematical theory, then we will look at how the theory can be turned into an implementation, and finally we will carry out this implementation and test it on a simple dataset.


Today we will:


Peruvian Desert

Naive Bayes Classifier coded from scratch in Python
Photo by Frank_am_Main, some rights reserved.


Theory


The Naive Bayes learning algorithms are derived through the use of Bayes Theorem together with conditional independence assumptions. We'll talk about the condition soon but for now independent just means that all the attributes (columns of the dataset) do not influence each other at all.


Let's take a look at Bayes Theorem. It states that for events AA and BB:


P(AB)=P(BA)P(A)P(B),where P(B)0P(A|B) = \dfrac{P(B|A)P(A)}{P(B)}, \text{where } P(B) \not= 0

Our goal is to determine the probability that, when given some data, it belongs to one of the possible classes. We can see that by setting A = class, B = data, Bayes Theorem becomes applicable to our problem. P(class|data) is the probability that the data belongs to the class. Perfect!


However a piece of data in a dataset will be represented by a feature vector - a vector where the ith element corresponds to the value in the ith column of the dataset. So we could say:


data=(x1,x2,,xn)data = (x_1, x_2, \cdots, x_n)

So given a particular datadata, the probability that it belongs to the class cc is:


P(cx1,x2,,xn)=P(x1,x2,,xnc)P(c)P(x1,x2,,xn)P(c|x_1, x_2, \cdots, x_n) = \dfrac{P(x_1, x_2, \cdots, x_n|c)P(c)}{P(x_1, x_2, \cdots, x_n)}

The naive conditional independence assumption states that, given a class cc, all of the features in the data (each of the xix_i) are independent. That is to say:


P(x1,x2,,xnc)=i=1nP(xic)P(x_1, x_2, \cdots, x_n|c) = \prod\limits_{i=1}^{n}P(x_i|c)

So we can substitute this into the previous equation to obtain:


P(cx1,x2,,xn)=P(c)i=1nP(xic)P(x1,x2,,xn)P(c|x_1, x_2, \cdots, x_n) = \dfrac{P(c)\prod_{i=1}^{n}P(x_i|c)}{P(x_1, x_2, \cdots, x_n)}

But then given some fixed (x1,x2,,xn)(x_1, x_2, \cdots, x_n), the quantity given by P(x1,x2,,xn)P(x_1, x_2, \cdots, x_n) is constant too, so we can further simplify:


P(cx1,x2,,xn)P(c)i=1nP(xic)P(c|x_1, x_2, \cdots, x_n) \varpropto P(c)\prod\limits_{i=1}^{n}P(x_i|c)

Now given a feature vector, this formula can calculate the probability that it belongs in each class... well not exactly. When we simplified by removing the denominator, we suddenly stopped representing the actual probability value. However for our purposes, all we need is the proportionality. So we compute all the relevant quantities and then compare them to find the largest. As such, the natural classification rule would be to assign the feature vector (data) to the class with the highest calculated value. If we denote the predicted class as c^\hat{c}, then this rule can be written down as:


c^=argmaxcP(c)i=1nP(xic)\hat{c} = \underset{c}{\mathrm{argmax}}P(c)\prod\limits_{i=1}^{n}P(x_i|c)

This is the formula we need to develop an algorithm. If we can figure out how to calculate the right hand side, then we have the predicted class for a data point.


Developing The Algorithm


Our sole task for the algorithm is to figure out how to enumerate:


argmaxcP(c)i=1nP(xic)\underset{c}{\mathrm{argmax}}P(c)\prod\limits_{i=1}^{n}P(x_i|c)

P(c)P(c) is the probability of the class cc. We can calculate this by counting how many samples in our dataset belong to class cc, and then dividing this by the total number of samples in the dataset.


To evaluate each term in the product we need to understand how to calculate the probability of observing the given value xix_i. This is where the various Naive Bayes models differ because we have to make an assumption on what the distribution of P(xic)P(x_i|c) is. In our case, we will assume a Gaussian distribution (which gives rise to the Gaussian Naive Bayes model), but other applications may use a Multinomial or Bernoulli distribution, for example. The Gaussian distribution is also known as the Normal distribution or, informally, 'bell curve'.


So to determine P(xic)P(x_i|c) we need to understand how to use the Gaussian probability density function. This is how we find the probability of observing the given real number. So the probability of observing xx is:


f(x)=1σ2πe12(xμσ)2f(x) = \dfrac{1}{\sigma \sqrt{2\pi}}e^{-\dfrac{1}{2}\left(\dfrac{x-\mu}{\sigma}\right)^{2}}

However you may be wondering how we make this number conditional on the class cc, as we require. That problem is easily solvable by using the above equation, but substituting different values of σ\sigma (the standard deviation of the class) and μ\mu (the mean of the class), corresponding to which class we are considering.


So if we had three different classes A, B, and C, each with four columns, and we were trying to classify the data point xx, we would need to individually calculate the mean and standard deviation for each column in A, then in B, then in C, and store all these twenty-six values separately (3 classes * 4 columns in each class * 2 values to calculate for each column = 26). Then we could correctly plug these parameters into the formula above and find the probability of observing xx, conditionally on each class.


Let's quickly remind ourselves how to calculate the mean and standard deviation:


For a set of values xix_i, the mean is given by:


μ=i=1NxiN\mu = \dfrac{\sum_{i=1}^{N}x_i}{N}

The mean is then used to calculate the Bessel corrected standard deviation, which is given by:


σ=1N1i=1N(xiμ)2\sigma = \sqrt{\dfrac{1}{N-1}\sum\limits_{i=1}^{N}(x_i-\mu)^{2}}

Let's combine everything we've talked about into a numbered list:


  1. Group all samples in the dataset by class
  2. Within each class, for each feature (column), calculate the mean, standard deviation, and sample count (this number is needed to calculate the proportion of samples in each class, P(c)), from the samples present. Calculating these parameters encompasses the training process for the model.
  3. For new data, use the mean and standard deviation corresponding to each feature within each class in turn to calculate the conditional probability that this data belongs to each class (this uses the formula above). We'll then need to compare these probabilities between all classes and choose the class with the highest probability associated with it. This will be our model's final prediction.
  4. Not part of the algorithm, but we should implement a model evaluation function to ensure we've implemented the algorithm correctly.

This whole list encompasses the algorithm we will be coding!


Are We Using Libraries?


Short answer: yes, but only pandas and sklearn.model_selection.train_test_split


This is an important question because it affects the scope of where the challenge lies in this project. My goal that I've chosen is to code a Gaussian Naive Bayes classifier by hand. For this reason I don't want to also focus on coding train/test splits, CSV readers, or data manipulation algorithms by hand. These problems could be separate projects. As such, I will import pandas and scikit-learn's train_test_split function, but if you want to do the whole project without libraries that is very possible, but I leave that as a challenge for you!


Loading The Test Dataset


If we’re coding a classifier then we need some data to test it! As I just mentioned, the focus of the project is coding a classifier by hand so we don’t want to focus much on the data or have to do any real processing with it. As such the best choice is to use the Iris Flower Dataset because it is very small and very simple. Let’s load it up and take a peek at the data.


1import pandas as pd
2 
3dataFilepath = 'YOUR FILEPATH/iris.data'
4columns = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
5dataset = pd.read_csv(dataFilepath, names=columns)
6 
7print(dataset.head())

This gives us the first five lines of data:


   sepal-length  sepal-width  petal-length  petal-width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

Now we know the basic structure of the set. We have four numerical attributes along with the ‘class’ column.


Let’s check the number of data points we have, and also how the classes are distributed.


1...
2print(dataset.shape)
3print(dataset['class'].value_counts())

This outputs:


(150, 5)
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50

So we have 150 data points in the full set, and the classes are completely balanced; each class contains 50 members.


Great! Now we have some data and can start producing the algorithm!


Implementing The Algorithm


In order to start developing and testing our algorithm, I’m going to first trim down the dataset into a ‘toy’ size, consisting of just the first five elements from each class. This lets us see an entire (toy) dataset at once and allows us to better understand what our code is doing.


1...
2smallDataset = pd.concat([dataset[0:5], dataset[50:55], dataset[100:105]]).reset_index(drop=True)
3print(smallDataset)

This shows our full new ‘toy’ set:


    sepal-length  sepal-width  petal-length  petal-width            class
0            5.1          3.5           1.4          0.2      Iris-setosa
1            4.9          3.0           1.4          0.2      Iris-setosa
2            4.7          3.2           1.3          0.2      Iris-setosa
3            4.6          3.1           1.5          0.2      Iris-setosa
4            5.0          3.6           1.4          0.2      Iris-setosa
5            7.0          3.2           4.7          1.4  Iris-versicolor
6            6.4          3.2           4.5          1.5  Iris-versicolor
7            6.9          3.1           4.9          1.5  Iris-versicolor
8            5.5          2.3           4.0          1.3  Iris-versicolor
9            6.5          2.8           4.6          1.5  Iris-versicolor
10           6.3          3.3           6.0          2.5   Iris-virginica
11           5.8          2.7           5.1          1.9   Iris-virginica
12           7.1          3.0           5.9          2.1   Iris-virginica
13           6.3          2.9           5.6          1.8   Iris-virginica
14           6.5          3.0           5.8          2.2   Iris-virginica

Let’s use this to start going through our numbered list.


1. Group Samples By Class


To group samples by the ‘class’ column, I will be using the pandas ‘groupby()’ function. This will initially give us a dictionary of indices:


1...
2dataByClass = smallDataset.groupby('class').groups
3print(dataByClass)

Outputs:


{'Iris-setosa': [0, 1, 2, 3, 4], 
 'Iris-versicolor': [5, 6, 7, 8, 9], 
 'Iris-virginica': [10, 11, 12, 13, 14]}

So we will now need to access the actual value stored at these indices. Let’s combine all this into a function:


1def SeparateClasses(dataset):
2    # Function separates the classes in a dataset, saving the result as a dictionary with classes as the keys.
3    separateClassesDict = {}
4    dataByClass = dataset.groupby('class')
5 
6    for group, data in dataByClass:
7        separateClassesDict[group] = data
8     
9    return separateClassesDict
10 
11...
12 
13dataByClass = SeparateClasses(smallDataset)
14 
15for x in dataByClass.keys():
16    print(x)
17    print(dataByClass[x])

This now outputs:


Iris-setosa
   sepal-length  sepal-width  petal-length  petal-width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa
Iris-versicolor
   sepal-length  sepal-width  petal-length  petal-width            class
5           7.0          3.2           4.7          1.4  Iris-versicolor
6           6.4          3.2           4.5          1.5  Iris-versicolor
7           6.9          3.1           4.9          1.5  Iris-versicolor
8           5.5          2.3           4.0          1.3  Iris-versicolor
9           6.5          2.8           4.6          1.5  Iris-versicolor
Iris-virginica
    sepal-length  sepal-width  petal-length  petal-width           class
10           6.3          3.3           6.0          2.5  Iris-virginica
11           5.8          2.7           5.1          1.9  Iris-virginica
12           7.1          3.0           5.9          2.1  Iris-virginica
13           6.3          2.9           5.6          1.8  Iris-virginica
14           6.5          3.0           5.8          2.2  Iris-virginica

You can see how each class now holds its samples in separate datasets. This is exactly what we wanted to achieve so let’s move to the next objective!


2. Calculating Mean and Standard Deviation For Columns


The natural place to start would be coding functions to calculate mean and standard deviation. Refer to the equations above if you need a refresher on the formulas. Here are the two functions:


1def Mean(data):
2    return sum(data)/len(data)
3 
4def Std(data):
5    mu = Mean(data)
6    newList = [(x-mu)**2 for x in data]
7    return math.sqrt((1/(len(data)-1))*sum(newList))

These follow precisely from the formulas so hopefully the implementation makes sense. For clarity, I’ve separated out a variable to hold the list comprehension in the ‘Std’ function. This list just holds the squared, mean-corrected values of the input data. For robustness, I should do checks for things like empty lists, but since this is just a personal project I’m not worrying about covering all cases. You can try coding this check yourself!


Let’s test them to make sure they work as intended:


1print('Mean: ' + str(Mean([1,2,3,4,5,6,7,8])))
2print('Std: ' + str(Std([1,2,3,4,5,6,7,8])))

This outputs:


Mean: 4.5
Std: 2.449489742783178

If we calculate these statistics by hand we see the same numbers, hence our implementation seems correct.


Now we want to call these functions to calculate the mean and standard deviation for each column of each class. Moreover, we want to keep a count of how many samples are contained in each class so that we can calculate the class probability. Below is a more graphical explanation. Note that I just show the ‘Iris-setosa’ class at the top but the same happens to all three classes:


Printout of the first 4 rows of the dataset with arrows linking each column to a list of its mean, standard deviation and sample count

Printout of lists of mean, standard deviation and sample count for each of the columns of each of the three different classes in the set

The lists here are [mean, standard deviation, number of samples].


The red box here shows what our trained model will look like. It is a dictionary where each key represents a class and holds four lists (corresponding to four attributes), each containing the mean, standard deviation and sample count of a different column in that class.


So our first job is to write a function that takes in a single class and calculates, for each column, the mean, standard deviation and sample count. The following code handles this task:


1def CalculateColumnStatsForGivenClass(singleClassDataset):
2    # Function calculates the mean, std and sample count for each column, given a single class
3    requiredList = []
4 
5    # Get all the column names in this class as a list
6    cols = list(singleClassDataset.columns)
7 
8    # Loop through the list of column names, calculating mean, std, and sample count
9    for x in list(cols[0:len(cols)-1]): # -1 because we don't want to include the column for 'class'
10        requiredList.append([Mean(singleClassDataset[x]), Std(singleClassDataset[x]), len(singleClassDataset[x])])
11    return requiredList

Here I am simply looping through all the columns in the given set and, for each one, appending a list of the stats we want to ‘requiredList’.


Let’s test this out on the ‘Iris-versicolor’ class:


1...
2dataByClass = SeparateClasses(smallDataset)
3classStats = CalculateColumnStatsForGivenClass(dataByClass['Iris-versicolor'])
4print(classStats)

This outputs:


[[6.459999999999999, 0.594138031100518, 5], 
 [2.9200000000000004, 0.38340579025361643, 5], 
 [4.540000000000001, 0.3361547262794323, 5], 
 [1.44, 0.08944271909999157, 5]]

This seems good! Recall that the first sub-list here corresponds to the mean, standard deviation and sample count of the first attribute (column) of ‘Iris-versicolor’ (which is ‘sepal-length’), and the second sub-list is these stats for the second attribute (which is ‘sepal-width’), and so on for all four attributes.


Now we need to call this function on every class and save each list produced in a dictionary. This will produce the full model. Let’s write another function to do that:


1def TrainModel(dataset):
2    # Function produces a dictionary representing the Gaussian naive bayes model. 
3    result = {}
4 
5    # First separate the classes
6    dataSeparatedByClassDict = SeparateClasses(dataset)
7 
8    # Loop through each class
9    for x in list(dataSeparatedByClassDict.keys()):
10        # Calculate the mean and std for each attribute, within each class
11        result[x] = CalculateColumnStatsForGivenClass(dataSeparatedByClassDict[x])
12    return result

This code will loop through all the keys in the dictionary (ie. all the classes in the dataset) and then produce a new dictionary with the same keys, but now the keys correspond to the lists of [mean, standard deviation, sample count], rather than just the raw data.


So lets call this on our small dataset:


1...
2model = TrainModel(smallDataset)
3for x in model.keys():
4    print(x)
5    print(model[x])

This outputs:


Iris-setosa
[[4.859999999999999, 0.2073644135332772, 5], [3.28, 0.2588435821108957, 5], [1.4, 0.07071067811865474, 5], [0.2, 0.0, 5]]
Iris-versicolor
[[6.459999999999999, 0.594138031100518, 5], [2.9200000000000004, 0.38340579025361643, 5], [4.540000000000001, 0.3361547262794323, 5], [1.44, 0.08944271909999157, 5]]
Iris-virginica
[[6.4, 0.4690415759823429, 5], [2.98, 0.21679483388678789, 5], [5.680000000000001, 0.35637059362410944, 5], [2.1, 0.2738612787525831, 5]]

I’ll put the full ‘smallDataset’ here as well so you can compare where these numbers have come from.


    sepal-length  sepal-width  petal-length  petal-width            class
0            5.1          3.5           1.4          0.2      Iris-setosa
1            4.9          3.0           1.4          0.2      Iris-setosa
2            4.7          3.2           1.3          0.2      Iris-setosa
3            4.6          3.1           1.5          0.2      Iris-setosa
4            5.0          3.6           1.4          0.2      Iris-setosa
5            7.0          3.2           4.7          1.4  Iris-versicolor
6            6.4          3.2           4.5          1.5  Iris-versicolor
7            6.9          3.1           4.9          1.5  Iris-versicolor
8            5.5          2.3           4.0          1.3  Iris-versicolor
9            6.5          2.8           4.6          1.5  Iris-versicolor
10           6.3          3.3           6.0          2.5   Iris-virginica
11           5.8          2.7           5.1          1.9   Iris-virginica
12           7.1          3.0           5.9          2.1   Iris-virginica
13           6.3          2.9           5.6          1.8   Iris-virginica
14           6.5          3.0           5.8          2.2   Iris-virginica

This brings us to the end of the second objective. We can now train the parameters of our Naive Bayes model! Here is the full code so far:


1import pandas as pd
2import math
3 
4def SeparateClasses(dataset):
5    # Function separates the classes in a dataset, saving the result as a dictionary with classes as the keys.
6    separateClassesDict = {}
7    dataByClass = dataset.groupby('class')
8 
9    for group, data in dataByClass:
10        separateClassesDict[group] = data
11     
12    return separateClassesDict
13 
14def Mean(data):
15    return sum(data)/len(data)
16 
17def Std(data):
18    mu = Mean(data)
19    newList = [(x-mu)**2 for x in data]
20    return math.sqrt((1/(len(data)-1))*sum(newList))
21 
22def CalculateColumnStatsForGivenClass(singleClassDataset):
23    # Function calculates the mean, std and sample count for each column, given a single class
24    requiredList = []
25 
26    # Get all the column names in this class as a list
27    cols = list(singleClassDataset.columns)
28 
29    # Loop through the list of column names, calculating mean, std, and sample count
30    for x in list(cols[0:len(cols)-1]): # -1 because we don't want to include the column for 'class'
31        requiredList.append([Mean(singleClassDataset[x]), Std(singleClassDataset[x]), len(singleClassDataset[x])])
32    return requiredList
33 
34def TrainModel(dataset):
35    # Function produces a dictionary representing the Gaussian naive bayes model. 
36    result = {}
37 
38    # First separate the classes
39    dataSeparatedByClassDict = SeparateClasses(dataset)
40 
41    # Loop through each class
42    for x in list(dataSeparatedByClassDict.keys()):
43        # Calculate the mean and std for each attribute, within each class
44        result[x] = CalculateColumnStatsForGivenClass(dataSeparatedByClassDict[x])
45    return result
46 
47if __name__ == '__main__':
48    # Load the dataset
49    dataFilepath = 'YOUR FILEPATH/iris.data'
50    columns = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
51    dataset = pd.read_csv(dataFilepath, names=columns)
52 
53    # Create toy dataset
54    smallDataset = pd.concat([dataset[0:5], dataset[50:55], dataset[100:105]]).reset_index(drop=True)
55 
56    # Train the model on the dataset
57    model = TrainModel(smallDataset)
58    for x in model.keys():
59        print(x)
60        print(model[x])

3. Calculating Classification Probability for New Data


To calculate the probability that new data belongs to each class we need to finally invoke the theory we looked at previously. In particular, we are interested in calculating:


argmaxcP(c)i=1nP(xic)\underset{c}{\mathrm{argmax}}P(c)\prod\limits_{i=1}^{n}P(x_i|c)

The first term, P(c)P(c), is the probability of observing a class which we can calculate as the proportion of each class in the set. So for our case we know this value is about 33% for all classes since they are balanced. However we will code it to deal with the general case and assume we don't know this. Here is a function that achieves this:


1def CalculateProbabilityOfClassC(c, model):
2    # Function computes the probability of a class c, using counts of how many samples are in each class.
3    numberOfSamplesInClassC = model[c][0][2]
4    numberOfSamplesInTotal = sum([model[x][0][2] for x in list(model.keys())])
5    return numberOfSamplesInClassC/numberOfSamplesInTotal

There are multiple layers of indexing here which is not too intuitive so let me break it down. Recall that a given key in the dictionary called ‘model’ contains a list that looks like this:


[[6.459999999999999, 0.594138031100518, 5],
 [2.9200000000000004, 0.38340579025361643, 5],
 [4.540000000000001, 0.3361547262794323, 5],
 [1.44, 0.08944271909999157, 5]]

So this example list would correspond to model[c] in the function. Then, since we are interested in how many samples of this class exist, we want to access the 3rd element of each sub-list (ie. 5). But all of these 3rd elements are the same so I just take the first sub-list. So extracting the number 5 here corresponds to model[c][0][2].


Then we do this same process within a list comprehension so that we can add up how many samples there are over all classes in the whole dataset. For our small dataset, I will print what that list looks like:


[5, 5, 5]

So summing the elements of this list tells us how many samples are in the whole set.


Finally we just divide these numbers to find the proportion of samples belonging to class c in the whole set.


Let’s test an example:


1...
2model = TrainModel(smallDataset)
3print(CalculateProbabilityOfClassC('Iris-setosa', model))

Outputs:


0.3333333333333333

As expected!


Now we need to cover the second term which is the scary-looking product. Fortunately this is relatively straightforward! Since we are implementing a Gaussian Naive Bayes model, we will calculate this using a Gaussian PDF. The calculation will require implementing the following formula:


P(xic)=1σci2πe12(xiμciσci)2P(x_i|c) = \dfrac{1}{\sigma_{c_i} \sqrt{2\pi}}e^{-\dfrac{1}{2}\left(\dfrac{x_i-\mu_{c_i}}{\sigma_{c_i}}\right)^{2}}

Where xx is a new row of data, and xix_i is the value in the ith attribute (column) of that data, cc is a given class, μci\mu_{c_i} is the mean of the values in the ith attribute of class cc, and σci\sigma_{c_i} is the standard deviation of the values in the ith attribute of class cc.


Let's begin by implementing a function to calculate a Gaussian PDF:


1def GaussianPDF(number, mu, sigma):
2    return (1/(sigma*math.sqrt(2*math.pi)))*math.exp(-0.5*((number - mu)/sigma)**2)

As a sanity check we can plot this function over some range and it should produce the typical ‘bell curve’ shape:


1...
2import numpy
3import matplotlib.pyplot as plt
4...
5if __name__ == '__main__':
6    ...
7    values = numpy.linspace(-5,5,100)
8    pdfvalues = [GaussianPDF(x,0,1) for x in values]
9    plt.plot(values, pdfvalues)
10    plt.show()

Our generation of a Gaussian PDF

This is the shape we expect so it is likely we have implemented the formula correctly.


Now we need a function that combines everything together. We need the Gaussian PDF value for every xix_i, as we talked about above. Then we must multiply all these probabilities together, remembering to also include the P(c)P(c) term. Here is a function that achieves this:


1def PredictClass(model, newData):
2    # This function uses the trained model to make a prediction on one new sample of data
3    probabilityOfBelongingToEachClassDict = {}
4 
5    classes = list(model.keys())
6 
7    # Loop through all the possible classes the data could belong to
8    for c in classes:
9        listOfProbabilitiesToMultiply = []
10 
11        # First calculate the probability of the class (proportion of that class in training data)
12        listOfProbabilitiesToMultiply.append(CalculateProbabilityOfClassC(c,model))
13 
14        # Loop through all attributes of new data and calculate the gaussian pdf value using relevant mean and std.
15        for i in range(0, len(newData)):
16            listOfProbabilitiesToMultiply.append(GaussianPDF(newData[i], model[c][i][0], model[c][i][1]))
17 
18        # Multiply all calculated values together to form probability data belongs to each class
19        probabilityOfBelongingToEachClassDict[c] = math.prod(listOfProbabilitiesToMultiply)
20 
21    keys = list(probabilityOfBelongingToEachClassDict.keys())
22    values = list(probabilityOfBelongingToEachClassDict.values())
23 
24    # Choose the class with highest probaiblity associated with it
25    predictedClass = keys[values.index(max(values))]
26 
27    return predictedClass

We initially calculate the probability of the class cc, using the function we wrote earlier, and append it to a list. Then we go through each attribute of the new row of data and calculate P(xic)P(x_i|c), and also append these to the same list. Then we multiply that whole list together to produce the number that we will use to estimate the probability that the data belongs to that class. We then repeat this process over all classes. Finally we find the class corresponding to the highest probability calculated (not actual probability however), and choose this as our prediction.


Let's test it! Testing requires producing a train/test split from our full dataset so that we can verify whether the model is able to predict properly or not. Let's import scikit-learn's train_test_split function for this:


1from sklearn.model_selection import train_test_split
2...
3if __name__ == '__main__':
4    # Load the dataset
5    dataFilepath = 'YOUR FILEPATH/iris.data'
6    columns = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
7    dataset = pd.read_csv(dataFilepath, names=columns)
8 
9    # Separate tout the target variable
10    X = dataset.drop('class', axis = 1)
11    y = dataset['class']
12 
13    # Create an 80/20 train/test split
14    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
15     
16    # Concatenate the training sets
17    trainingSet = pd.concat([X_train, y_train], axis = 1)

Here I have created a split where 20% (30 samples) of the data is for testing, and 80% (120 samples) is for training. I concatenate the X_train and y_train sets back together because I implemented my functions assuming that the ‘class’ column was included in the data.


So let’s create the model and train it on the training set:


1...
2model = TrainModel(trainingSet)

We can have a look at it to recall the structure of the model too:


1...
2for x in model.keys():
3    print(x)
4    print(model[x])

Outputs:


Iris-setosa
[[5.02051282051282, 0.36431586004995814, 39], [3.4025641025641025, 0.381470271647932, 39], [1.4615384615384615, 0.144395991735054, 39], [0.2384615384615384, 0.10910011002788757, 39]]
Iris-versicolor
[[5.886486486486487, 0.5207698159467398, 37], [2.7621621621621624, 0.3268973261772905, 37], [4.216216216216216, 0.486206125420481, 37], [1.324324324324324, 0.2046750885962722, 37]]
Iris-virginica
[[6.638636363636363, 0.6310625624426647, 44], [2.9886363636363638, 0.33216408742143666, 44], [5.5659090909090905, 0.5489707840822666, 44], [2.0318181818181817, 0.2567894144676992, 44]]

Let’s test the model on the first sample in the test set. But first we can look at that sample:


1...
2print(X_test.iloc[0])
3print(y_test.iloc[0])

Outputs:


sepal-length    5.8
sepal-width     2.8
petal-length    5.1
petal-width     2.4
 
Iris-virginica

So now we can finally test our ‘PredictClass’ function, as follows:


1...
2model = TrainModel(trainingSet)
3predictedClass = PredictClass(model, X_test.iloc[0])
4print(predictedClass)

This outputs:


Iris-virginica

It worked! Out of curiosity, I can modify the ‘PredictClass’ function to instead output the full dictionary of estimated probabilities. That looks like this:


{'Iris-setosa': 8.48935365055018e-225,
 'Iris-versicolor': 8.701921085911454e-08,
 'Iris-virginica': 0.027607489541751216}

The differences are enormous orders of magnitude apart! I believe this is because some of the probabilities we calculate are very small, but then we multiply them all together and, as we know, multiplying very small numbers produces absurdly small numbers! But in any case, we don’t really care about the magnitude of these numbers, we just want to know which one is the largest.


As a quick aside, since these numbers can be incredibly small we risk errors in floating-point precision, such as underflow. This may cause the result of our calculation to be set to 0. To avoid this, some implementations of Naive Bayes methods will instead use log-probabilities, and may also incorporate smoothing functions, such as Laplace smoothing, to avoid results of 0. I won’t be coding these here but I wanted to bring up the concepts.


We’ve done great so far! The last thing to implement is an evaluation function.


4. Calculating our Model’s Accuracy


We can use classification accuracy as a metric here because the classes in our set are balanced. Hence, let’s write a function that predicts the class for each sample in the test set and then checks against ‘y_test’ to see if it was correct.


1def EvaluateModelAccuracy(model, X_test, y_test):
2    # This function uses the trained model to make predictions over a test set, then calculates the accuracy
3    predictions = []
4 
5    # Loop through all the test data
6    for i in range(0,len(X_test)):
7 
8        # Make prediction for each new data point
9        currentPrediction = PredictClass(model, list(X_test.iloc[i]))
10 
11        # Save whether or not the prediction was correct
12        if currentPrediction == y_test.iloc[i]:
13            predictions.append(1)
14        else:
15            predictions.append(0)
16 
17    # Calculate accuracy
18    accuracy = sum(predictions)/len(X_test)
19    return accuracy

Now if we call this as follows:


1...
2if __name__ == '__main__':
3    # Load the dataset
4    dataFilepath = 'YOUR FILEPATH/iris.data'
5    columns = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
6    dataset = pd.read_csv(dataFilepath, names=columns)
7 
8    # Separate out the target variable
9    X = dataset.drop('class', axis = 1)
10    y = dataset['class']
11 
12    # Create a train/test split
13    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
14 
15    # Concatenate the target training variable back onto the training set
16    trainingSet = pd.concat([X_train, y_train], axis = 1)
17 
18    # Train the model on training set
19    model = TrainModel(trainingSet)
20 
21    # Use the trained model to make predictions on test set, and calculate accuracy of predictions
22    modelAccuracy = EvaluateModelAccuracy(model, X_test, y_test)
23 
24    print(modelAccuracy)

We get an accuracy rating of:


0.9666666666666667

This is a fantastic result! We must of course remember that we’re operating on a very small set (30 test samples) so the scores are not completely reliable but it still shows our model has a great level of skill. Due to the class distribution, if our model had no skill we would expect a baseline accuracy of around 33%. We’re way above that!


And with that, we’re done coding the model and evaluating its performance 🙂


Here is the full code:


1import pandas as pd
2import math
3from sklearn.model_selection import train_test_split
4 
5def Mean(data):
6    return sum(data)/len(data)
7 
8def Std(data):
9    mu = Mean(data)
10    newList = [(x-mu)**2 for x in data]
11    return math.sqrt((1/(len(data)-1))*sum(newList))
12 
13def GaussianPDF(number, mu, sigma):
14    return (1/(sigma*math.sqrt(2*math.pi)))*math.exp(-0.5*((number - mu)/sigma)**2)
15 
16def CalculateProbabilityOfClassC(c, model):
17    # Function computes the probability of a class c, using counts of how many samples are in each class.
18    numberOfSamplesInClassC = model[c][0][2]
19    numberOfSamplesInTotal = sum([model[x][0][2] for x in list(model.keys())])
20    return numberOfSamplesInClassC/numberOfSamplesInTotal
21 
22def SeparateClasses(dataset):
23    # Function separates the classes in a dataset, saving the result as a dictionary with classes as the keys.
24    separateClassesDict = {}
25    dataByClass = dataset.groupby('class')
26 
27    for group, data in dataByClass:
28        separateClassesDict[group] = data
29     
30    return separateClassesDict
31 
32def CalculateColumnStatsForGivenClass(singleClassDataset):
33    # Function calculates the mean, std and sample count for each column, given a single class
34    requiredList = []
35 
36    # Get all the column names in this class as a list
37    cols = list(singleClassDataset.columns)
38 
39    # Loop through the list of column names, calculating mean, std, and sample count
40    for x in list(cols[0:len(cols)-1]): # -1 because we don't want to include the column for 'class'
41        requiredList.append([Mean(singleClassDataset[x]), Std(singleClassDataset[x]), len(singleClassDataset[x])])
42    return requiredList
43 
44def TrainModel(dataset):
45    # Function produces a dictionary representing the Gaussian naive bayes model. 
46    result = {}
47 
48    # First separate the classes
49    dataSeparatedByClassDict = SeparateClasses(dataset)
50 
51    # Loop through each class
52    for x in list(dataSeparatedByClassDict.keys()):
53        # Calculate the mean and std for each attribute, within each class
54        result[x] = CalculateColumnStatsForGivenClass(dataSeparatedByClassDict[x])
55    return result
56 
57def PredictClass(model, newData):
58    # This function uses the trained model to make a prediction on one new sample of data
59    probabilityOfBelongingToEachClassDict = {}
60 
61    classes = list(model.keys())
62 
63    # Loop through all the possible classes the data could belong to
64    for c in classes:
65        listOfProbabilitiesToMultiply = []
66 
67        # First calculate the probability of the class (proportion of that class in training data)
68        listOfProbabilitiesToMultiply.append(CalculateProbabilityOfClassC(c,model))
69 
70        # Loop through all attributes of new data and calculate the gaussian pdf value using relevant mean and std.
71        for i in range(0, len(newData)):
72            listOfProbabilitiesToMultiply.append(GaussianPDF(newData[i], model[c][i][0], model[c][i][1]))
73 
74        # Multiply all calculated values together to form probability data belongs to each class
75        probabilityOfBelongingToEachClassDict[c] = math.prod(listOfProbabilitiesToMultiply)
76 
77    keys = list(probabilityOfBelongingToEachClassDict.keys())
78    values = list(probabilityOfBelongingToEachClassDict.values())
79 
80    # Choose the class with highest probaiblity associated with it
81    predictedClass = keys[values.index(max(values))]
82 
83    return predictedClass
84 
85def EvaluateModelAccuracy(model, X_test, y_test):
86    # This function uses the trained model to make predictions over a test set, then calculates the accuracy
87    predictions = []
88 
89    # Loop through all the test data
90    for i in range(0,len(X_test)):
91 
92        # Make prediction for each new data point
93        currentPrediction = PredictClass(model, list(X_test.iloc[i]))
94 
95        # Save whether or not the prediction was correct
96        if currentPrediction == y_test.iloc[i]:
97            predictions.append(1)
98        else:
99            predictions.append(0)
100 
101    # Calculate accuracy
102    accuracy = sum(predictions)/len(X_test)
103    return accuracy
104 
105if __name__ == '__main__':
106    # Load the dataset
107    dataFilepath = 'YOUR FILEPATH/iris.data'
108    columns = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
109    dataset = pd.read_csv(dataFilepath, names=columns)
110 
111    # Separate out the target variable
112    X = dataset.drop('class', axis = 1)
113    y = dataset['class']
114 
115    # Create a train/test split
116    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
117 
118    # Concatenate the target training variable back onto the training set
119    trainingSet = pd.concat([X_train, y_train], axis = 1)
120 
121    # Train the model on training set
122    model = TrainModel(trainingSet)
123 
124    # Use the trained model to make predictions on test set, and calculate accuracy of predictions
125    modelAccuracy = EvaluateModelAccuracy(model, X_test, y_test)
126 
127    print(modelAccuracy)

Extension Tasks


We have successfully coded all the basic functions of a Gaussian Naive Bayes classifier, however there are a variety of extensions you could choose to implement at this point.



Conclusion


We have covered the theory behind the Naive Bayes methods and even implemented one by hand in Python! We used a very simple dataset as a way to prove our implementation works, and we achieved 96% classification accuracy!


If you’ve not already started, I encourage you to try implementing this model yourself in Python! Or if you’d rather study and play around with my code, you can copy it from above or find it at this GitHub link.


Thanks for reading, hope you enjoyed!


≺ Back to Posts