Data preparation can make or break your machine learning algorithm. In fact, about 80% of an AI project is spent on data collection and preparation. In our “feature engineering” webinar series, we've covered ways to prepare and enrich your dataset in order to make it ready for training in a machine learning model. You can watch the full webinar replay here. We’ve put together this blog post as a summary of the topics we have tackled.
Feature engineering is the process of augmenting a dataset by computing new dimensions from existing series. There are many ways one can go about engineering new features. We have narrowed down three approaches:
- From insight: based on economics or fundamental analysis knowledge, empirically discovered leading indicators, experience with screening financial data, or just a hunch, combine series to derive new series. For instance, you can create relative baskets by grouping prices by regions to obtain an “America index” or “Asia index”. You can further take the ratio of these baskets to add a dimension of relative movements across geographies.
- From endogenous transformations: turn one series into multiple series that give information about its past behaviour. Introduce lags to add a sense of chronology, differences to materialise absolute past performance on different time horizons or ratios for relative past performance. Use technical analysis to enrich your features.
- From exogenous transformation: compute dissimilarity (for .e.g. cross-correlations for linear relationships, dynamic-time warping for non-linear relationships) between pairs. Use clustering to compare baskets rather than pairs.
I have to highlight that extreme care is required when adding features as this is prone to introducing forward-looking biases (a.k.a. data leakage in data science jargon). When adding a feature for a data point at date T, you can only use data at dates prior to T to create your feature. It is unfortunately very easy to breach this rule; it could be as genuine as computing an average for instance. Depending on your machine learning model, you have to normalise your data so that the model can learn. In such cases, normalisation can introduce look ahead bias too if conducted the naive way.
Machine Learning Models
When it comes to machine learning, it matters a great deal to select the model that fits your objectives. Consequently, the first step is to define what you want your model to learn. Is your task supervised, meaning that you will show the model some target values to match, or unsupervised, i.e. the model has to find a way to group data in meaningful clusters?
In our webinar, we have looked at supervised learning. There are a lot of supervised models, some of the most famous being neural networks, random forests, XGBoost and their derivatives. We also decided to perform a classification task: learn how to predict the direction of trends, rather than their exact magnitude, which would be a regression problem. For this purpose, tree-based models can do the job. They have a few advantages over other techniques: they are fast to train, they do not require extra processing of data (like normalisation for e.g.), they are somewhat easy to fine tune but yet, they are powerful. In the case of random forests, they are not prone to overfitting, but XGBoost is. And finally, they can be explained.
Overall, they have a lot of positive attributes, making them a good starting point for building models. One of their drawbacks though is to be unable to extrapolate data. However, as we are just looking into directions, they are fit for purpose.
Another critical point to tackle when it comes to supervised learning is to define what to learn. At the end of the day, a model only learns what it's told to learn. Hence, precisely defining targets is actually less trivial than it sounds, especially when it comes to financial series. Take the 1 day change for instance, and you will end up with a very difficult task to do for tree-based models. As a reminder of an earlier comment, take the 1 day change but forget to lag it and there is a high probability you have introduced a forward-looking bias if your time series belong to different time zones.
One advantage of tree-based models is that you can interpret the model, which is a hot topic around ML models. Both Random Forest and XGBoost packages allow you to display features by order of importance in your model. This is typically done by comparing the dispersion (typically gini or entropy) of the model prior and after cutting the dataset based on a particular feature. The bigger the drop, the higher explanatory power the feature. Thanks to this tool, you can refine your model by excluding unnecessary features. Often, you can reach the same level of performance from a much lighter model. This in turn allows you to train it faster or to combine it to other models more easily.
Take a deeper look at our videos in order to implement feature engineering step by step. Start a new thread on the forum or get in touch with us on AlphaChat shall you want to discuss the subject, we’ll be happy to take your strategy one step further.