The file kitchensink_ml.py contains the strategy code.
Due to the large number of features, the strategy's prices_to_features
calls a variety of helper methods to create the various categories of features. Not only does this improve code readability but it also allows intermediate DataFrames to be garbage-collected more frequently, reducing memory usage.
The method add_fundamental_features
adds various fundamental values and ratios. For each fundamental field, we choose to rank the stocks and use the rank as the feature, rather than the raw fundamental value. This is meant to ensure more uniform scaling of features. For example:
features["enterprise_multiples_ranks"] = enterprise_multiples.rank(axis=1, pct=True).fillna(0.5)
The parameter pct=True
causes Pandas to rank the stocks along a continuum from 0 to 1, nicely scaling the data. We use fillna(0.5)
to place NaNs at the center rather than at either extreme, so that the model does not interpret them as having either a good or bad rank.
The method add_quality_features
adds additional fundamental features related to quality as defined in the Piotroski F-Score. We add the nine individual F-score components as well as the daily F-score ranks.
The method add_price_and_volume_features
adds a number of features derived from price and volume including:
The method add_technical_indicator_features
calculates several technical indicators for each stock in the universe:
Each indicator can have a value between 0 and 1. In the case of Bollinger Bands, where the price could exceed the band, resulting in a value less than 0 or greater than 1, we choose to winsorize the price at the upper and lower bands in order to keep the range between 0 and 1.
winsorized_closes = closes.where(closes > lower_bands, lower_bands).where(closes < upper_bands, upper_bands)
The method add_securities_master_features
adds a few features from the securities master database: whether the stock is an ADR, and what sector it belongs to. Note that sectors must be one-hot encoded, resulting in a boolean DataFrame for each sector indicating whether the stock belongs to that particular sector. See the usage guide for more on one-hot encoding.
The method add_market_features
adds several market-wide features to help the model know what is happening in the broader market, including:
The first 3 of these features are derived from the index data collected from IBKR. We query this database for the date range of our prices DataFrame. (Note that we identified the database as the BENCHMARK_DB
so that we can use SPY as the backtest benchmark, see the usage guide for more on benchmarks.)
# Get prices for SPY, VIX, TRIN-NYSE
market_prices = get_prices(self.BENCHMARK_DB,
fields="Close",
start_date=closes.index.min(),
end_date=closes.index.max())
market_closes = market_prices.loc["Close"]
Using SPY as an example, we extract the Series of SPY prices from the DataFrame and perform our calculations.
# Is S&P above its 200-day?
spy_closes = market_closes[self.SPY_SID]
spy_200d_mavg = spy_closes.rolling(200).mean()
spy_above_200d = (spy_closes > spy_200d_mavg).astype(int)
Now that we have a Series indicating whether SPY is above its moving average, we need to reshape the Series like our prices DataFrame, so that the SPY indicator is provided to the model as a feature for each stock. First, we reindex the Series like the prices DataFrame, in case there are any differences in dates between the two data sources (we don't expect there to be a difference but it is possible when using data from different sources). Then, we use apply
to broadcast the Series along each column (i.e. each security) of the prices DataFrame:
# Must reindex like closes in case indexes differ
spy_above_200d = spy_above_200d.reindex(closes.index, method="ffill")
features["spy_above_200d"] = closes.apply(lambda x: spy_above_200d)
The McClellan oscillator is a market breadth indicator which we calculate using the Sharadar data, counting the daily advancers and decliners then calculating the indicator from these Series:
# McClellan oscillator
total_issues = closes.count(axis=1)
returns = closes.pct_change()
advances = returns.where(returns > 0).count(axis=1)
declines = returns.where(returns < 0).count(axis=1)
net_advances = advances - declines
pct_net_advances = net_advances / total_issues
ema_19 = pct_net_advances.ewm(span=19).mean()
ema_39 = pct_net_advances.ewm(span=39).mean()
mcclellan_oscillator = (ema_19 - ema_39) * 10
# Winsorize at 50 and -50
mcclellan_oscillator = mcclellan_oscillator.where(mcclellan_oscillator < 50, 50).where(mcclellan_oscillator > -50, -50)
As with the SPY indicator, we lastly broadcast the Series with apply
to shape the indicator like the prices DataFrame:
features["mcclellan_oscillator"] = closes.apply(lambda x: mcclellan_oscillator).fillna(0)
Having created all of our features, in prices_to_features
we create our targets by asking the model to predict the one-week forward return:
def prices_to_features(self, prices: pd.DataFrame):
...
# Target to predict: next week return
one_week_returns = (closes - closes.shift(5)) / closes.shift(5).where(closes.shift(5) > 0)
targets = one_week_returns.shift(-5)
...
The features and targets will be fed to the machine learning model during training. During backtesting or live trading, the features (but not the targets) will be fed to the machine learning model to generate predictions. The model's predictions will in turn be fed to the predictions_to_signals
method, which creates buy signals for the 10 stocks with the highest predicted return and sell signals for the 10 stocks with the lowest predicted return, provided they have adequate dollar volume:
We choose to train our model on all securities regardless of dollar volume but only trade securities with adequate dollar volume. We could alternatively have chosen to only train on the set of liquid securities we were willing to trade.
def predictions_to_signals(self, predictions: pd.DataFrame, prices: pd.DataFrame):
...
# Buy (sell) stocks with best (worst) predicted return
have_best_predictions = predictions.where(have_adequate_dollar_volumes).rank(ascending=False, axis=1) <= 10
have_worst_predictions = predictions.where(have_adequate_dollar_volumes).rank(ascending=True, axis=1) <= 10
signals = have_best_predictions.astype(int).where(have_best_predictions, -have_worst_predictions.astype(int).where(have_worst_predictions, 0))
...
Capital is divided equally among the signals, with weekly rebalancing:
def signals_to_target_weights(self, signals: pd.DataFrame, prices: pd.DataFrame):
# Allocate equal weights
daily_signal_counts = signals.abs().sum(axis=1)
weights = signals.div(daily_signal_counts, axis=0).fillna(0)
# Rebalance weekly
# Resample daily to weekly, taking the first day's signal
# For pandas offset aliases, see https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
weights = weights.resample("W").first()
# Reindex back to daily and fill forward
weights = weights.reindex(prices.loc["Close"].index, method="ffill")
...