QuantRocket logo
Disclaimer


Pipeline Tutorial › Lesson 12: Initial Universe


Initial Universe¶

By default, a pipeline performs computations on every asset in the bundle. As we learned in an earlier lesson, an optional screen argument (consisting of a Filter) can be applied that limits the pipeline output to a subset of assets. Under the hood, screens are applied as the last step of a pipeline computation. If we screen for assets with dollar volume above $1MM, the pipeline will compute dollar volume for every asset in the bundle, then filter out any asset/date combinations where dollar volume falls below the threshold.

This means that we are often performing computations on assets that don't ultimately interest us. Often, this extra computational work is necessary because we don't know in advance which assets will pass the screen. We don't know which assets have high dollar volume until we compute dollar volume for all assets. However, sometimes we do know in advance that certain assets can be excluded. If we are screening for certain kinds of stocks but our bundle includes stocks and ETFs, does it make sense to perform computations on the ETFs? Wouldn't it be better to exclude the ETFs entirely?

One way to exclude ETFs would be to use masking. We could create a filter that returns False for ETFs and pass that filter as the mask argument to any factors we want to use. However, a better approach in this case is to exclude ETFs from the initial universe that our pipeline considers. This can be done with the initial_universe parameter to the Pipeline class:

In [1]:
from zipline.pipeline import Pipeline, master

# SecuritiesMaster.Etf is a boolean column, and the unary operator (~)
# negates it
are_not_etfs = ~master.SecuritiesMaster.Etf.latest

pipeline = Pipeline(
    initial_universe=are_not_etfs
)

In this example, we import the SecuritiesMaster Dataset (which points to QuantRocket's securities master database), create a filter that negates the Etf column (a boolean column indicating whether the asset is an ETF), and pass the filter to our Pipeline as initial_universe. Any columns we add to the above Pipeline will only be computed on assets that are not ETFs. ETFs will not even be loaded into the Pipeline workspace, resulting in a speed improvement compared to using screen or mask.

The filter passed to initial_universe can derive from any column of the SecuritiesMaster Dataset. The filter can combine multiple columns as long as they are ANDed together using & (filters ORed together with | are not supported for initial_universe). In the following example, we limit the initial universe to common stocks (thus excluding not only ETFs but also REITs, ADRs, preferred stocks, LPs, etc.) and, for stocks that have multiple share classes, we limit the universe to the primary share class:

In [2]:
# Equities listed as common stock (not preferred stock, ETF, ADR, LP, etc)
common_stock = master.SecuritiesMaster.usstock_SecurityType2.latest.eq('Common Stock')

# Filter for primary share equities; primary shares can be identified by a
# null usstock_PrimaryShareSid field (i.e. no pointer to a primary share)
is_primary_share = master.SecuritiesMaster.usstock_PrimaryShareSid.latest.isnull()

pipeline = Pipeline(
    initial_universe=(common_stock & is_primary_share)
)

In addition to accepting filters created from the SecuritiesMaster Dataset, initial_universe also accepts the four filters imported below, which reference static lists of assets.

To see the docstrings for these filters, click on the filter name and press Control in JupyterLab, or consult the API Reference.

In [3]:
from zipline.pipeline.filters import (
    SingleAsset,
    StaticAssets,
    StaticSids,
    StaticUniverse
)

The initial_universe parameter does not accept any other filters besides the ones listed above, because these are the only filters that represent static lists of assets or (in the case of the SecuritiesMaster Dataset) static characteristics of assets. Other filters represent dynamic characteristics of assets that change over time (such as price, volume, or fundamentals) and require loading the asset's data to see if it passes the filter. If you try to use an unsupported filter with initial_universe, you will receive an error message.

Speed Benefit of initial_universe¶

The main reason that Pipeline supports an initial_universe argument is to speed up computation. Since screen supports any filter while initial_universe only supports a limited set of filters, we could rely entirely on screen if we didn't care about speed. But since we care about speed, a general rule of thumb is to use initial_universe when possible and use screen for filters that initial_universe doesn't support.

Let's run a similar pipeline with screen and then with initial_universe to demonstrate the speed benefit of using initial_universe. The speed benefit is greater, the fewer assets we are interested in. Suppose we want to get the rolling linear regression of two assets, Apple and Microsoft, versus SPY, using the built-in RollingLinearRegressionOfReturns factor. (This factor is computationally expensive and thus a good choice for this demonstration, but we will omit discussion of its parameters and multiple outputs; see the factor's docstring if you want to learn more about it.)

In [4]:
from zipline.pipeline.factors import RollingLinearRegressionOfReturns
from zipline.research import symbol

spy = symbol('SPY')
aapl = symbol('AAPL')
msft = symbol('MSFT')

regression_factor = RollingLinearRegressionOfReturns(
    target=spy,
    returns_length=2,
    regression_length=10,
)

First, let's see how long it takes to run this pipeline using screen:

In [5]:
%%time

from zipline.pipeline.filters import StaticAssets
from zipline.research import run_pipeline

pipeline = Pipeline(
    columns={
        'alpha': regression_factor.alpha,
        'beta': regression_factor.beta,
    },
    screen=StaticAssets([aapl, msft]) # limit output to Apple and Microsoft
)
results = run_pipeline(pipeline, start_date='2010-01-05', end_date='2010-06-05')
results.head()
CPU times: user 36.3 s, sys: 71.9 ms, total: 36.4 s
Wall time: 36.4 s
Out[5]:
alphabeta
dateasset
2010-01-05Equity(FIBBG000B9XRY4 [AAPL])0.0077320.982085
Equity(FIBBG000BPH459 [MSFT])0.0006221.157044
2010-01-06Equity(FIBBG000B9XRY4 [AAPL])0.0064090.959515
Equity(FIBBG000BPH459 [MSFT])-0.0012421.052271
2010-01-07Equity(FIBBG000B9XRY4 [AAPL])0.0040311.082494

Despite only returning data for two assets, this pipeline had to compute regression factors for every asset in the bundle, causing a longer runtime. Now let's see how long it takes to run the same pipeline using initial_universe:

In [6]:
%%time

pipeline = Pipeline(
    columns={
        'alpha': regression_factor.alpha,
        'beta': regression_factor.beta,
    },
    initial_universe=StaticAssets([aapl, msft, spy]), # limit universe to Apple, Microsoft, and SPY
    screen=StaticAssets([aapl, msft]), # limit output to Apple and Microsoft
)
results = run_pipeline(pipeline, start_date='2010-01-05', end_date='2010-06-05')
results.head()
CPU times: user 75.2 ms, sys: 4.02 ms, total: 79.2 ms
Wall time: 83.9 ms
Out[6]:
alphabeta
dateasset
2010-01-05Equity(FIBBG000B9XRY4 [AAPL])0.0077320.982085
Equity(FIBBG000BPH459 [MSFT])0.0006221.157044
2010-01-06Equity(FIBBG000B9XRY4 [AAPL])0.0064090.959515
Equity(FIBBG000BPH459 [MSFT])-0.0012421.052271
2010-01-07Equity(FIBBG000B9XRY4 [AAPL])0.0040311.082494

Runtimes will vary based on your hardware, but this pipeline should run much faster because it ignores all but the few assets we are interested in.

Note in the last example that we must include SPY in our initial_universe so we can regress AAPL and MSFT against it, but we then use screen to limit the output to AAPL and MSFT (not SPY). This illustrates another point about initial_universe and screen: they can be used together, with initial_universe limiting the size of the computational universe and screen further filtering the results.


Next Lesson: The TradableStocksUS Universe