Clustering for more effective model training

Stuart Colianni, Quantitative Researcher at Lucena Research

The Goal

Producing a good model for stock forecasting is never easy. Many quants initialize their favorite classifier and directly train on raw data – only to learn after many frustrating iterations that there are still monumental obstacles to overcome. As the saying goes, “garbage in, garbage out”. For this reason, data pre-processing such as cleaning, normalization, feature selection, etc. are essential prerequisites for creating predictive models. AI models in general, and deep learning models in particular, exhibit an insatiable appetite for data. The reality however is that data is sparse and not always available in adequate abundance for quality deep learning research. In this blog post, we will examine clustering assets, and the important role it can play in expanding quality training data sets with a much reduced signal to noise ratio. Finally, we will discuss why clunky categorizations such as the GICS sector are unable to provide the same data-driven insights as clustering by inherit descriptive factors

What is Clustering?

Clustering is an unsupervised learning technique that seeks to segment a dataset into distinct groups. A “good” cluster is generally defined as being composed of data that are more similar to each other than out-of-cluster data. Clustering is therefore a tool to identify naturally occuring groupings within a dataset.

Clustering the S&P 500

Training models for financial markets is tricky. Companies exist in different industries, produce different products, depend on different raw materials, are guided by different management philosophies, etc. Because of these factors, a great deal of diversity exists within any sizable universe of stocks.

Clustering to the rescue! Leveraging clustering we can uncover groups of stocks whose price behaviors exhibit a great deal of similarity. In Figure 1 below, these relationships are represented spatially in a spring force graph – the connections between stocks whose prices are highly correlated cause tight-knit clusters to form. Additionally, the various colorings correspond to groupings determined via K-Means clustering.Figure 1: K-Means clustering applied to the S&P 500

Certain stocks form well-defined, monochromatic clusters. This demonstrates that the K-Means algorithm was clearly successful at identifying and labeling key structures of interest. As for the nebulous clusters that “bleed” together and lack distinct boundaries – we will return to those in a minute!

Figure 2: Stock symbols associated with the pink cluster

Zooming in, we examine the stock symbols associated with elements of the pink cluster. Here we see company names like Duke Energy (DUK), Exelon (EXC), Southern Company (SO), Pacific Gas & Electric (PCG), and the list goes on. This cluster clearly corresponds with the energy stocks.

Zooming in on the green cluster, we see names like American Express (AXP), JP Morgan Chase & Co. (JPM), and PNC Financial Services Group (PNC). This cluster clearly corresponds to banking and financial services.

Figure 3: Stock symbols associated with the green cluster

Ambiguous Custers

As noted earlier, clustering often produces non-distinct boundaries that “bleed” together. What is the best approach to handle ambiguous groupings? By eliminating data that contributes to the problematic boundaries, the remaining groupings will become more precisely defined. Here we shall examine two approaches: removing problematic clusters directly, and removing problematic data points.

First we examine removing troublesome data on a per cluster basis. Knowing which clusters to remove requires a heuristic by which the “health” of a cluster can be analyzed. In our example the health of a cluster is measured using a metric that captures the closeness of points to their assigned versus non-assigned centroids. Upon removing malignant clusters, the K-Means algorithm is reapplied to produce new groupings. This process is repeated until convergence.

Figure 5: Removing bad clusters improves the K-Means clustering of the S&P 500

As can be seen in Figure 5 above, removing bad clusters greatly improves the quality of the remaining data. Compared to Figure 1, the resulting groupings are more separable with far less “bleeding” between clusters.

An alternative approach is to evaluate and remove data on a point by point basis. The general procedure is analogous to the one described above, except for that the heuristic is used to evaluate points individually. Once again, upon removing malignant data K-Means clustering is reapplied to produce new groupings. This process is repeated until convergence.Figure 6: Removing bad data points improves the K-Means clustering of the S&P 500

As can be seen in Figure 6 above, iteratively removing bad data points greatly improves the quality of the remaining data. Compared to both Figure 1 and Figure 5, the resulting groupings have greater separability, as represented by distance between clusters. This approach has the added benefit of being less “heavy handed” than eliminating data on a per-cluster basis. By removing only problematic data points, no useful information is ever erroneously excluded from a future round of clustering.

The Challenge in Clustering

Although we target similar price behavior, correlation without causation is analogous to overfitting. Identifying correlated behavior in sample, may not be sustainable out of sample. The trick is to look at how correlated stocks react to certain events together. In other words how inflection in a fundamental feature of one stock not only moves that particular stock but also others potential cluster members. For example, we often hear the term “sympathy” in stock price action. How, for example, an earning surprise of one stock may affect other related stocks who have yet to report their earnings.

Benefits of Clustering

There are several advantages to clustering data prior to training models. First and foremost, creating clusters reduces noise and improves data quality. By excluding out-of-cluster stocks when training a model, a far stronger signal to noise ratio exists in the underlying data. Additionally, because a cluster consists of multiple stock symbols, models have a larger set of data to train on and can generalize more effectively. Finally, training has increased efficiency as each model utilizes only a relevant subset of stocks.

A second advantage lies in that clustering is a data-driven approach. The groupings determined therefore sidestep the pitfalls of depending on human-determined labels. For example, a company like Tesla might be in the same GICS sector as GM and Ford, but have price behavior more similarly resembling a technology company like Amazon or Apple. Leveraging clustering allows for these relationships to be determined in an automated fashion, without the drawbacks of relying on human-determined labels.

Conclusion

Leveraging clustering is an excellent approach to improve data quality and reduce noise. As such, it is an excellent tool to include in the model building lifecycle to decrease training time and increase predictive performance. Applied on a rolling basis, clustering provides a dynamic procedure to capture the ever changing relationships and correlations between assets. Finally, while our examples focused primarily on price behavior, the benefits of clustering are easily extended to capture inter-asset relationships of other technical, fundamental, or alternative data based features. For this reason, clustering is the perfect tool to help an analyst organize stocks in a meaningful manner.

Questions about clustering and model training? Drop them below or contact us.