Local Fisher Discriminant Analysis on Beer Style Clustering
AUTHORS
Yuan Tang
Data Scientist
Jason Cohen
Founder & Chief Executive Officer
Abstract
At Gastrograph AI, we apply various metric learning methods to perform preprocessing. LFDA (Local Fisher Discriminant Analysis) is one of the bestperforming supervised dimensionalityreducing metric learning method among the others. It automatically learns an appropriate transformation matrix to capture the characteristics of the original data, including possible multimodality of the existing clusters. In this paper, LFDA as well as a nonlinear version of LFDA called Kernel LFDA are introduced. We apply these methods for beer style clustering with 3D visualizations of metrictransformed beer data and then illustrate some comparisons of metric learning to standard statistical sensory methods.
1. Motivation
Traditionally, when dealing with data with attributes in different scales, one will have to center and scale the data set so that each attribute plays an equal role in the following data science analysis and model building. However, certain features can be more important than others when performing different tasks. For example, in the context of facial recognition, attributes collected from around the face will be more important than those collected from the clothes, body positions, etc. Then comes the necessity of supervised scaling for different attributes.
In Gastrograph AI, we have beer sensory data that comes from our mobile app from many different reviewers. Each sensory review contains 24 flavor attributes such as sugar, bitterness, spices, etc. If we want to perform a classification on beer styles, we'll have to generate different scaling standards on those different flavor attributes since, for some styles, certain flavors are more important than others.
Metric learning comes in handy in this situation, which learns a distance function tuned to a particular task. One of the metric learning algorithms we are using is called local Fisher discriminant analysis (LFDA), which performs supervised learning of a transformation/distance matrix that can be used to transform the original data set according to its learned metrics for different attributes  in our case, the 24 flavor attributes.
2. Introduction
Fisher discriminant analysis (FDA) is a popular choice to reduce the dimensionality of the original data set. It maximizes betweenclass scatter and minimizes withinclass scatter. It works really well in practice, however, lacks some considerations for multimodality. Multimodality is within many applications, such as disease diagnosis, where there may be multiple causes for a particular disease. In this situation, FDA cannot capture the multimodal characteristics of the clusters. To deal with multimodality, localpreserving projection (LPP) plays a major role in preserving the local structure of the data. It keeps nearby data pairs in the original data space close in the embedding space so that multimodal data can be embedded and its local structure will not be lost.
Later on, a new dimensionality reduction method called local Fisher discriminant analysis (LFDA) was proposed to combine both advantages of FDA and LPP in a way that betweenclass separability is maximized while withinclass separability is minimized and its local structure is preserved. Furthermore, by the help of kernel trick, LFDA can also be extended to deal with nonlinear dimensionality reduction situations. We will discuss both methods in this paper.
3. Theoretical Background
3.1 Fisher Linear Discriminant
Let S^{(w)} and S^{(b)} be the withinclass scatter matrix and the betweenclass scatter matrix defined by the following:
and
where μ_{i} is the mean of samples in the class i and μ is the mean of all samples, such that:
The FDA transformation matrix, denoted as T_{FDA}, is defined as follows:
In other words, T is found so that betweenclass scatter is maximized and withinclass scatter is minimized. In order to optimize this, T_{FDA} can be found by the following:
where 'ϕ<sub>i</sub>'^{d}_{i=1} are the generalized eigenvectors associated to the generalized eigenvalues λ1 ≥ λ2 ≥⋯≥ λd of the generalized eigenvalue problem:
3.2 LocalityPreserving Projection
Let A be the affinity matrix where its (i,j)th element is the affinity between two points x_{i} and x_{j}. The elements in A are larger if two points are close to each other. We define A_{i,j} = 1 if x_{j} is the knearest neighbor of x_{i} or vice versa while A_{i,j} = 0otherwise. The LPP transformation matrix can then be defined as the following:
subject to
where D is the diagonal matrix with ith diagonal element being
In other words, T_{LPP} is found such that nearby data pairs in the original data space are kept close in the embedding space. T_{LPP} is given by
where {ϕi}di=1 are the eigenvectors associated to the eigenvalues λ1 ≥ λ2 ≥⋯≥ λd of the following eigenvalue problem
where
3.3 Local Fisher Discriminant Analysis
Let us combine the concepts for both FDA and LPP to define our local Fisher discriminant analysis. A local withinclass scatter matrix
where
The LFDA transformation matrix T_{LFDA} is then defined as the following:
In other words, the transformation matrix T_{LFDA} is found so that data pairs that are near each other in the same class become closer and data pairs in different classes become far away from each other. Note that data pairs in the same class that are far away from each other do not come closer.
In order to efficiently compute this LFDA transformation matrix, one can look into the references in the last section of the paper.
3.4 Kernel LFDA
LFDA can be nonlinearized by the kernel trick, which enables the algorithm to operate in a highdimensional feature space but avoids computing the coordinates of the data in that space. It works by computing the inner products between all pairs of data in the feature space. Kernel trick also greatly reduces the costs of computation of the coordinates.
One particular choice of kernel function is the Gaussian kernel defined as the following:
with σ > 0. K is now the kernel matrix where its (i,j)th element is given by
which we then use to perform the regular LFDA.
4. Example Results in Beer Data
Below is a 3D plot for the metrictransformed beer data using LFDA. Note that in order to make the visualization intuitive, we've reduced the dimension to three. Each color represents a distinct style of beer. Note that it's really hard to transform the beer data into linearly separable regions, at least not separable within three dimensions. Empirically, it's very hard to separate data with high dimensions (about 28 different attributes) into clusters in low dimensions, even with the magic of LFDA since the relationships are clearly not simply linear. It's not easy to visualize the effect of LFDA within only 3 dimensions here but it's proven to greatly improve the accuracy of machine learning models in our company.
The next plot is 3D visualization for the metrictransformed beer data using KLFDA. Each color represents a different style with the style name as the label. Note that the clusters for each different style are well separated nonlinearly.
5. Comparisons to Standard Statistical Sensory Methods
Many other statistical methods are currently being used in food sensory research, such as studies in evaluating flavor, texture, flavor release, chemical senses, etc. The most popular one in practice is principal components analysis (PCA), which is a statistical method that groups correlated variables into a new variable, called a principal component, in place of the original variables, in order to reduce dimensions of the original data set while maintaining the characteristics of the original data features. However, PCA does not preserve the multimodality of the original data set. For example, in order to classify beer styles even for beer that is flawed/contaminated by different causes, PCA will transform the original data set that might tend to interpret the flawed beer data into a different style instead.
Fortunately, as we explained earlier, metric learning methods, particularly LFDA, preserves multimodality of the data. Intuitively, it keeps the data points that belong to the same styles close to each other while maintaining the local multimodality structure  beer data that are contaminated by different flawes are still kept in different subclusters with some distance.
Additionally, in practice PCA sometimes generates bad principal components that cannot explain a great amount of variance in the original data set. For example, if the original data set has 6 dimensions and we reduce the dimension to 3 using PCA, the resulting 3 principal components might not capture some important features and variance in the original data features. Using this result from PCA sometimes lead to very bad accuracy for machine learning models due to the poor preprocessing that losses a lot of the essential information in the original data features. On the other hand, metric learning methods such as LFDA, in particular, can, surprisingly, enhance the distinctive characteristics of the original data set and pull data points that have similar characteristics close to each other.
Furthermore, PCA normalizes the original data features to a common scale. Dijksterhuis explored the PCA approach in greater detail by studying timeintensity bitterness curves (Dijksterhuis, 1993). He noted that PCA was not discriminating of different bitter stimuli. If dimension reduction are being performed based on a common scale, some important stimuli will have the same amount of effect on producing bitterness.
6. Significance and Future Goals
Metric learning methods, especially LFDA and KLFDA included in this papar, can be very useful when it comes to dimension reduction, visualization, denoising, and preprocessing for data in many domains  in particular computer vision and signal processing. It's also proven to greatly improve the accuracy for beer flaw detection in our company.
However, these methods are rather supervised in the sense that they are based on the known class labels. In the future, we will keep experimenting and applying these methods on other data sets for different tasks. We will also try to explore and implement other metric learning methods, especially unsupervised metric learning methods. Unsupervised methods are preferred because some style labels are labeled incorrectly by producers in a sense that those labels don't reflect the real flavors of the beer.
7. References

Dijksterhuis, Garmt B. "Principal Component Analysis of TimeIntensity Bitterness Curves1." Journal of Sensory Studies 8.4 (1993): 317328.

Sugiyama, Masashi, and Gatsby Tea Talk. "Local fisher discriminant analysis." (2006).

Sugiyama, Masashi. "Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis." The Journal of Machine Learning Research 8 (2007): 10271061.
Create a bettertasting world
Get in touch to speak with our team so we can learn about the challenges you're facing.