In a machine learning strategy, we choose which features to give the model but don't choose how to sift and combine those features to generate alpha, as that is the model's job. In that respect, the machine learning model does some of our research work for us. However, we must still choose which model to use and what hyper-parameters to set; hyper-parameters give us some control over the model's inner workings.
Here, we will use a random forest model from scikit-learn. Most scikit-learn algorithms offer variants for regression problems (where we want to predict continuous variables) and classification problems (where we want to predict discrete variables). We want a classification model since the target we are predicting is a boolean (1 or 0) indicating whether the stock's return is above the cross-sectional median.
To use a random forest classification model in our walk-forward optimization, we instantiate the model with the desired hyper-parameters and use joblib
to save the model to disk:
from sklearn.ensemble import RandomForestClassifier
import joblib
clf = RandomForestClassifier(n_estimators=300, max_depth=20, n_jobs=-1, verbose=0, random_state=0)
joblib.dump(clf, "raf_model.joblib")
['raf_model.joblib']
A description of the available hyper-parameters is beyond the scope of this tutorial. Consult the Scikit-learn user guide. However, note the random_state
parameter: if this is omitted, fitting the model is non-deterministic, which means that re-running two identical backtests may yield different results. As this can be confusing, we set the random_state
. Alternatively, you could omit it to see if your model is robust to the introduction of randomness.
Now that we've selected a Scikit-learn model, let's highlight one other section of the demo strategy. By default, MoonshotML generates predictions by calling the model's predict()
method. A classification model will predict which class the sample will belong to - in our case, the model will predict whether the stock's return will be above or below the median (1 or 0). However, some models, including the RandomForestClassifier
we have chosen, also provide a predict_proba()
method that instead predicts the probability that a sample belongs to a class; in our case this means the probability that a stock's return will be above the median. predict_proba()
can be used when there are only two classes (1 or 0); it gives us more information than predict()
because it assigns not just a class but the probability of belonging to class label 1.
To use predict_proba()
, we override the model as shown below so that when MoonshotML calls model.predict()
, it actually runs model.predict_proba()
. (Overriding methods and functions at runtime like this is called monkey-patching.) We place this code in prices_to_features
so that the override is in place before predictions are made.
...
# when Moonshot calls predict(), we want it to actually call predict_proba()
# see https://www.quantrocket.com/docs/#ml-predict-probabilities
if self.model:
self.model.predict = self.model.predict_proba
return features, targets