Pipeline Tutorial › Lesson 10: Datasets

Datasets and BoundColumns¶

DataSets are simply collections of objects that tell the Pipeline API where and how to find the inputs to computations. An example of a DataSet that we have already seen is EquityPricing.

A BoundColumn is a column of data that is concretely bound to a DataSet. Instances of BoundColumn are dynamically created upon access to attributes of a DataSet. Inputs to pipeline computations must be of type BoundColumn. An example of a BoundColumn that we have already seen is EquityPricing.close. It is important to understand that DataSets and BoundColumns do not hold actual data. Remember that when computations are created and added to a pipeline, they don't actually perform the computation until the pipeline is run. DataSet and BoundColumn can be thought of in a similar way; they are simply used to identify the inputs of a computation. The data is populated later when the pipeline is run.

dtypes¶

When defining pipeline computations, we need to know the types of our inputs in order to know which operations and functions we can use. The dtype of a BoundColumn tells a computation what the type of the data will be when the pipeline is run. For example, EquityPricing has a float dtype so a factor may perform arithmetic operations on EquityPricing.close (e.g. compute the 5-day mean). The importance of this will become more clear in the next lesson. The dtype of a BoundColumn can also determine the type of a computation. In the case of the latest computation, the dtype determines whether the computation is a factor (float), a filter (bool), or a classifier (string, int).

Pricing Data¶

Equity pricing data is stored in the EquityPricing dataset. EquityPricing provides five columns:

EquityPricing.open
EquityPricing.high
EquityPricing.low
EquityPricing.close
EquityPricing.volume

Each of these columns has a float dtype. The EquityPricing dataset is bound to the particular bundle specified in the call to run_pipeline (or the default bundle if no bundle is specified).

Securities Master Data, Fundamental Data, Short Sale Data, etc.¶

In addition to pricing data, you can access a variety of other datasets in Pipeline, including securities master data, fundamental data, short sale data, and more. For a full list of available datasets, see the usage guide.

Next Lesson: Custom Factors