Feature Engineering for Machine Learning Stock Forecasting

Erez Katz, CEO and Co-founder Lucena Research

Extracting Insights From Data for Machine Learning Trading

As machine learning and data science continue to evolve at a high rate, many struggle where to place the human element in the predictive modeling lifecycle. Those of us who deal with big data and the applications of deep-learning technology are often conflicted with where the human intellect and skills fit.

On one end, if you insert human bias into the discovery phase of machine learning, you are bound to limit your scope to models discoverable and understood by humans. On the other hand, if you truly set to unleash the power of deep learning by letting it ride free of guidelines, you risk never discovering any commercially viable models.

Feature engineering is one area where the distinction between tasks suitable for human vs machines is not always clear. In this article I hope to underscore the importance of deploying a hybrid human-machine blend for effective feature engineering. Further, I hope that you will realize the importance of domain knowledge and human intellect in the discovery and construction of successful predictive models.

What Is Feature Engineering?

If you look up the definition of feature engineering on Wikipedia, you will find little help.

Let me take a stab here:

In the process of ingesting and integrating big-data into quantitative research, feature engineering is the process of converting raw data into explanatory variables, features. The purpose of introducing features is to break the raw data into informative timeseries representations that can hint of expected future values.

Example: Imagine you were presented with a weekly timeseries data that represents the US retail trade and food services dollar volume over time, 1992 to present.

To the naked eye, you can observe two seasonality trends (weekly and monthly), you can also observe upwards trend over time. While the raw data on its own seems fairly predictive of future value, it would be nice to create a new feature (a feature engineered time series) that would convert the raw data into a more predictive timeseries that considers both the weekly, and monthly seasonality.

Exponential weighted moving average (EWMA) is the process of converting raw timeseries into a timeseries of average scores where by recent values are given more weight vs older values.

By determining the expected feature value of raw data, the machine can more easily infer correlation to important key performance indicators such as producer price index (PPI), or asset prices such as XRT (SPDR S&P Retail ETF).

Simply stated, the machine can forecast future values based on the following:

– Examine trend – Observe timeseries trends over time and predict expected behavior.
– Examine anomalies – Observe how significant a new value in a timeseries deviates from its expected behavior.

For example, in the above trend line if next month’s value falls way below the expected EWMA value, we infer a negative surprise in next month’s CPI or PPI reads.

There are additional sophisticated methods to feature engineer raw data in order to denoise or de-season a raw timeseries representation. A worthy mention is Holt-Winters method by which we compute the distance between three distinct exponential weighted moving averages in order to predict the next values in a highly seasonal timeseries representation such as depicted above.

Applying Holt-Winters transformation on the raw data. See forecast vs. observed (purple vs. blue) timeseries.

Can Derived Features Generate Algorithmically?

The features described above are derivatives of raw timeseries values (also called primitives) and can be generated algorithmically. As you can imagine, there could be very many configurations of technical features derived from raw data and there are plenty of open source libraries such as scikit-learn (sklearn) for feature manipulation and feature extraction. The question remains where lies the human element?

There is an earlier step in feature engineering that precedes the tabular timeseries format described above. This step involves identifying and combining multiple independent data sets in a meaningful way for machine learning research. More about how machine learning can validate your data for forecasting.

Although deep learning is fully capable of finding synergies between orthogonal data sources, making the proper datasets available to the learner is where human domain knowledge and intellect can come handy.

How Human Intellect and Domain Knowledge Are Used In Feature Engineering

Take for example, credit card transactions data that has been anonymized and aggregated. Imagine we have time series data that describes the total Visa spend by billing zip code for every day in the past 10 years. Now, if we were to total the amount spent every day across all zip codes, we can determine how consumer spending fluctuates over time and determine if current Visa spending is higher than expected (seasonally adjusted).

That information by itself is meaningful for Visa stock itself but also for macro features forecasting such as PPI o CPI. Now imagine we wanted to get a little more granular and determine what people spent their money on.

We may want to cross reference the billing zip codes with a new unrelated data set that measures the average household income by zip code. If we were to break household income to 5 classes: (Low income, Mid income, Upper middle income, High income, and Ultra high income) we can determine if excess spending is mainly attributable to high earners which may have direct impact on consumer discretionary sales vs. consumer staples. This type of identification of additional data sets, cross referencing them and extrapolating new features requires human intellect and domain knowledge.

Feature Engineering in the Data Validation Process

With recent advancements in machine learning and science, we seem to still be lightyears away from throwing a bunch of data at a deep learning network and expecting it to come up with meaningful outcome.

Feature engineering holds a combination of both art and science. It is an important and vital step within a broader and comprehensive multi-step process of applying data science and machine learning for predictive analytics.

The above represents our process at Lucena of conditioning data for machine learning research. Feature engineering normally occurs on the entire data set before deploying any machine learning disciplines.

Watch more about the data validation process in “The Journey of an Alternative Data Signal”

How to Maximize Feature Engineering

For maximum benefit, one must consider a combination of domain expertise, business acumen, and strong technical skills suited to automate feature extraction, feature creation, and feature selection. At the end of such process only a subset of selected features is presented to the deep learning network for training. The combination of features data with labeled data (corresponding outcomes) is the oil that fuels the AI learning process.

A model is trained by traversing through many thousands of iterations looking to identify the non-linear relationships between the features and the outcome over time. Seasonally adjusted, normalized, and uniformly distributed features are just a few examples of how base features transform into something more meaningful for machine learning research.

Ultimately, it all starts from a sound human driven feature engineering before we pass the baton to the machines to complete the job.

See how we use machine learning to create winning

Model Portfolios.