By default, a pipeline performs computations on every asset in the bundle. As we learned in an earlier lesson, an optional screen
argument (consisting of a Filter) can be applied that limits the pipeline output to a subset of assets. Under the hood, screens are applied as the last step of a pipeline computation. If we screen for assets with dollar volume above $1MM, the pipeline will compute dollar volume for every asset in the bundle, then filter out any asset/date combinations where dollar volume falls below the threshold.
This means that we are often performing computations on assets that don't ultimately interest us. Often, this extra computational work is necessary because we don't know in advance which assets will pass the screen. We don't know which assets have high dollar volume until we compute dollar volume for all assets. However, sometimes we do know in advance that certain assets can be excluded. If we are screening for certain kinds of stocks but our bundle includes stocks and ETFs, does it make sense to perform computations on the ETFs? Wouldn't it be better to exclude the ETFs entirely?
One way to exclude ETFs would be to use masking. We could create a filter that returns False
for ETFs and pass that filter as the mask
argument to any factors we want to use. However, a better approach in this case is to exclude ETFs from the initial universe that our pipeline considers. This can be done with the initial_universe
parameter to the Pipeline
class:
from zipline.pipeline import Pipeline, master
# SecuritiesMaster.Etf is a boolean column, and the unary operator (~)
# negates it
are_not_etfs = ~master.SecuritiesMaster.Etf.latest
pipeline = Pipeline(
initial_universe=are_not_etfs
)
In this example, we import the SecuritiesMaster
Dataset (which points to QuantRocket's securities master database), create a filter that negates the Etf
column (a boolean column indicating whether the asset is an ETF), and pass the filter to our Pipeline as initial_universe
. Any columns we add to the above Pipeline will only be computed on assets that are not ETFs. ETFs will not even be loaded into the Pipeline workspace, resulting in a speed improvement compared to using screen
or mask
.
The filter passed to initial_universe
can derive from any column of the SecuritiesMaster
Dataset. The filter can combine multiple columns as long as they are ANDed together using &
(filters ORed together with |
are not supported for initial_universe
). In the following example, we limit the initial universe to common stocks (thus excluding not only ETFs but also REITs, ADRs, preferred stocks, LPs, etc.) and, for stocks that have multiple share classes, we limit the universe to the primary share class:
# Equities listed as common stock (not preferred stock, ETF, ADR, LP, etc)
common_stock = master.SecuritiesMaster.usstock_SecurityType2.latest.eq('Common Stock')
# Filter for primary share equities; primary shares can be identified by a
# null usstock_PrimaryShareSid field (i.e. no pointer to a primary share)
is_primary_share = master.SecuritiesMaster.usstock_PrimaryShareSid.latest.isnull()
pipeline = Pipeline(
initial_universe=(common_stock & is_primary_share)
)
In addition to accepting filters created from the SecuritiesMaster
Dataset, initial_universe
also accepts the four filters imported below, which reference static lists of assets.
To see the docstrings for these filters, click on the filter name and press Control in JupyterLab, or consult the API Reference.
from zipline.pipeline.filters import (
SingleAsset,
StaticAssets,
StaticSids,
StaticUniverse
)
The initial_universe
parameter does not accept any other filters besides the ones listed above, because these are the only filters that represent static lists of assets or (in the case of the SecuritiesMaster
Dataset) static characteristics of assets. Other filters represent dynamic characteristics of assets that change over time (such as price, volume, or fundamentals) and require loading the asset's data to see if it passes the filter. If you try to use an unsupported filter with initial_universe
, you will receive an error message.
initial_universe
¶The main reason that Pipeline supports an initial_universe
argument is to speed up computation. Since screen
supports any filter while initial_universe
only supports a limited set of filters, we could rely entirely on screen
if we didn't care about speed. But since we care about speed, a general rule of thumb is to use initial_universe
when possible and use screen
for filters that initial_universe
doesn't support.
Let's run a similar pipeline with screen
and then with initial_universe
to demonstrate the speed benefit of using initial_universe
. The speed benefit is greater, the fewer assets we are interested in. Suppose we want to get the rolling linear regression of two assets, Apple and Microsoft, versus SPY, using the built-in RollingLinearRegressionOfReturns
factor. (This factor is computationally expensive and thus a good choice for this demonstration, but we will omit discussion of its parameters and multiple outputs; see the factor's docstring if you want to learn more about it.)
from zipline.pipeline.factors import RollingLinearRegressionOfReturns
from zipline.research import symbol
spy = symbol('SPY')
aapl = symbol('AAPL')
msft = symbol('MSFT')
regression_factor = RollingLinearRegressionOfReturns(
target=spy,
returns_length=2,
regression_length=10,
)
First, let's see how long it takes to run this pipeline using screen
:
%%time
from zipline.pipeline.filters import StaticAssets
from zipline.research import run_pipeline
pipeline = Pipeline(
columns={
'alpha': regression_factor.alpha,
'beta': regression_factor.beta,
},
screen=StaticAssets([aapl, msft]) # limit output to Apple and Microsoft
)
results = run_pipeline(pipeline, start_date='2010-01-05', end_date='2010-06-05')
results.head()
CPU times: user 36.3 s, sys: 71.9 ms, total: 36.4 s Wall time: 36.4 s
alpha | beta | ||
---|---|---|---|
date | asset | ||
2010-01-05 | Equity(FIBBG000B9XRY4 [AAPL]) | 0.007732 | 0.982085 |
Equity(FIBBG000BPH459 [MSFT]) | 0.000622 | 1.157044 | |
2010-01-06 | Equity(FIBBG000B9XRY4 [AAPL]) | 0.006409 | 0.959515 |
Equity(FIBBG000BPH459 [MSFT]) | -0.001242 | 1.052271 | |
2010-01-07 | Equity(FIBBG000B9XRY4 [AAPL]) | 0.004031 | 1.082494 |
Despite only returning data for two assets, this pipeline had to compute regression factors for every asset in the bundle, causing a longer runtime. Now let's see how long it takes to run the same pipeline using initial_universe
:
%%time
pipeline = Pipeline(
columns={
'alpha': regression_factor.alpha,
'beta': regression_factor.beta,
},
initial_universe=StaticAssets([aapl, msft, spy]), # limit universe to Apple, Microsoft, and SPY
screen=StaticAssets([aapl, msft]), # limit output to Apple and Microsoft
)
results = run_pipeline(pipeline, start_date='2010-01-05', end_date='2010-06-05')
results.head()
CPU times: user 75.2 ms, sys: 4.02 ms, total: 79.2 ms Wall time: 83.9 ms
alpha | beta | ||
---|---|---|---|
date | asset | ||
2010-01-05 | Equity(FIBBG000B9XRY4 [AAPL]) | 0.007732 | 0.982085 |
Equity(FIBBG000BPH459 [MSFT]) | 0.000622 | 1.157044 | |
2010-01-06 | Equity(FIBBG000B9XRY4 [AAPL]) | 0.006409 | 0.959515 |
Equity(FIBBG000BPH459 [MSFT]) | -0.001242 | 1.052271 | |
2010-01-07 | Equity(FIBBG000B9XRY4 [AAPL]) | 0.004031 | 1.082494 |
Runtimes will vary based on your hardware, but this pipeline should run much faster because it ignores all but the few assets we are interested in.
Note in the last example that we must include SPY in our initial_universe
so we can regress AAPL and MSFT against it, but we then use screen
to limit the output to AAPL and MSFT (not SPY). This illustrates another point about initial_universe
and screen
: they can be used together, with initial_universe
limiting the size of the computational universe and screen
further filtering the results.
Next Lesson: The TradableStocksUS Universe