Gaussian Mixture Models with Scikit-learn in Python

In simpler statistical models, we usually assume that our data come from a single distribution. For example, to model height, we can assume that each observation comes from a single Gaussian distribution with a given mean and variance. However, we often find ourselves in a scenario where this assumption does not hold and our data is more complex. Taking the same example of height, we can easily see that the height of men and women can come from two different Gaussian distributions (with different averages).

Gaussian Mixed Models

Mixture models are an extremely useful statistical/ML method for such applications. Mixture models operate on the assumption that each observation in the data set comes from a particular distribution. Gaussian mixture models assume that each observation in the data set comes from a Gaussian distribution with different means and variances. By fitting the data to a Gaussian mixture model, an attempt is made to estimate the parameters of the Gaussian distribution from the data.

This paper uses simulated data with sharp clusters to illustrate how to fit a Gaussian mixture model with scikit-learn in Python.

Let us download the libraries you need. Besides pandas, seaborn and numpy, we use some modules from scikit-learn.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.mixture import GaussianMixture
import numpy as np
sns.set_context(talk, font_scale=1.5)

Clustered data modelling

We will use the make_blobs function of sklearn.datasets to create a simulated dataset with 4 different clusters. The argument center=4 defines four clusters. We also use the cluster_std argument to specify the cluster density.

X, y = make_blobs(n_samples=500,
centers=4,
cluster_std=2,
random_state=2021)

The make_blob functions give us the simulated data as a numbering array and the labels as a vector. Let’s save the data as a Pandas data frame.

data = pd.DataFrame(X)
data.columns=[X1,X2]
data[cluster]=y
data.head()

Our simulated data looks like this.

X1 X2 cluster
0 -0.685085 4.217225 0
1 11.455507 -5.728207 2
2.230017 5.938229 0
3 3.705751 1.875764 0
4 -3.478871 -2.518452 1

Let’s visualize the simulated data with the Seaborn scatter plot and color the data points with their cluster labels.

plt.figure(figsize=(9,7))
sns.scatterplot(data=data,
x=X1,
y=X2,
hue=cluster,
palette=[red,blue,green, purple])
plt.savefig(Data_for_fitting_Gaussian_Mixture_Models_Python.png,
format=png,dpi=150)

We can clearly see that our data comes from four clusters.

Data for the adjustment of Gaussian mixture models of Python

Definition of the Gaussian mixture model with the GaussianMixture() function ofScikit-learn

We can use Scikit-learn’s GaussianMixture() function to fit our data to the mixture models. One of the most important parameters in fitting the Gaussian mixture model is the number of clusters in the data set.

For this example, we build a Gaussian mixture model with 3 clusters. Since we simulated the data with four clusters, we know this is wrong, but let’s fit the data to the Gaussian mixture model.

gmm = GaussianMixture(3,
covariance_type=full,
random_state=0).fit(data[[X1,X2]]).

For the identified clusters, the location of the mean values can be determined using the means_ method in GaussianMixture.

gmm.means_
array([[-2.16398445, 4.84860401]],
[ 9.97980069, -7.42299498],
[-7.28420067, -3.86530606]]).

The function preict() can also be used to predict the labels of data points. In this example, we obtain the predicted laboratory values for the input data.

labels = gmm.predict(data[[X1,X2]])

Let’s add predictive labels to our data frame.

data[[predicted_cluster]]=marker

And then visualize the data by coloring the data points with predictive labels.

plt.figure(figsize=(9,7))
sns.scatterplot(data=data,
x=X1,
y=X2,
hue=predicted_cluster,
palette=[red,blue,green])
plt.savefig(fitting_Gaussian_Mixture_Models_with_3_components_scikit_learn_Python.png,
format=png,dpi=150)

We can clearly see that the fit of the three-cluster model is wrong. The model combined the two clusters into one.

3-component Gaussian mixtures

Determine the number of clusters in the data with respect to the model

The main problem is often that we do not know the number of clusters in the dataset. The number of clusters must be correct. One way to achieve this is to fit a Gaussian mixture model with a plurality of clusters, say from 1 to 20.

And then do a model comparison to find the model that best fits the data. For example, a Gaussian mixture model with 4 clusters fits better or a model with 3 clusters fits better. We can then choose the best model with a certain number of clusters that fits the data.

AIC or BIC scores are typically used to compare models and choose the best model to fit the data. To be clear, one of the values is good enough to do a model comparison. In this post we calculate both values, just to see how they behave.

So let’s fit the data to a Gaussian mixture model with a different number of clusters.

n_components = np.arange(1, 21)
models = [GaussianMixture(n,
covariance_type=full, random_state=0).fit(X) for n in n_components].

Models [0:5].

Gaussian mixture(random state=0),
Gaussian mixture(n_components=2, random state=0),
Gaussian mixture(n_components=3, random state=0),
Gaussian mixture(n_components=4, random state=0),
Gaussian mixture(n_components=5, random state=0)].

We can easily calculate the AIC/BIC scores with scikit-learn. Here we use one of the models and calculate the BIC and AIC scores.

models[0].bic(X)
6523.618150329507

models[0].aic(X)
6502.545109837397

To compare the variation of the BIC/AIC scores as a function of the number of components used in the Gaussian mixture model, we create a data frame with the BIC and AIC scores and the number of components.

gmm_model_comparisons=pd.DataFrame({n_components : n_components,
BIC : [m.bic(X) for m in models],
AIC : [m.aic(X) for m in models] })

gmm_model_comparisons.head()
n_components BIC AIC
0 1 6523.618150 6502.545110
1 2 6042.308396 5995.947707
2 3 5759.725951 5688.077613
3 4 5702.439121 5605.503135
4 5 5739.478377 5617.254742

We can now create a line graph of the AIC/BIC versus the digital components.

plt.figure(figsize=(8,6))
sns.lineplot(data=gmm_model_comparisons[[BIC,AIC]])
plt.xlabel(Number of clusters)
plt.ylabel(Score)
plt.savefig(GMM_model_comparison_with_AIC_BIC_Scores_Python.png,
format=png,dpi=150)

We find that BIC and AIC have the lowest value when the number of components is 4. Therefore, the model with n=4 is the best.
Comparison of the GMM model with AIC and BIC scores

Now that we know the number of components needed to fit the model, we can build the model and extract the predicted labels for visualization.

n=4
gmm = GaussianMixture(n, covariance_type=full, random_state=0).fit(data[[X1,X2]])
labels = gmm.predict(data[[X1,X2]])
data[[predicted_cluster]]=labels

Ideally, a scatter diagram created with Seaborn that marks the data points with the predicted labels.

plt.figure(figsize=(9,7))
sns.scatterplot(data=data,
x=X1,
y=X2,
hue=predicted_cluster,
palette=[red,blue,green, purple])
plt.savefig(fitting_Gaussian_Mixture_Models_with_4_components_scikit_learn_Python.png,
format=png,dpi=150)

Modification of 4-component Gaussian blends

The post Gaussian mixed models with scikit-trained in Python appeared first on .

Related Tags:

gaussian mixture model python github1d gaussian mixture model pythonbayesian gaussian mixture modelwhen to use gaussian mixture modelgaussian mixture model tutorialgaussian mixture model clustering python example,People also search for,gaussian mixture model clustering python example,gaussian mixture model python github,gaussian mixture model python from scratch,gaussian mixture model image segmentation python,1d gaussian mixture model python,bayesian gaussian mixture model,when to use gaussian mixture model,gaussian mixture model tutorial

Leave a Reply

Your email address will not be published.