DataSets are simply collections of objects that tell the Pipeline API where and how to find the inputs to computations. An example of a DataSet
that we have already seen is EquityPricing
.
A BoundColumn
is a column of data that is concretely bound to a DataSet
. Instances of BoundColumn
are dynamically created upon access to attributes of a DataSet
. Inputs to pipeline computations must be of type BoundColumn
. An example of a BoundColumn
that we have already seen is EquityPricing.close
. It is important to understand that DataSet
s and BoundColumn
s do not hold actual data. Remember that when computations are created and added to a pipeline, they don't actually perform the computation until the pipeline is run. DataSet
and BoundColumn
can be thought of in a similar way; they are simply used to identify the inputs of a computation. The data is populated later when the pipeline is run.
When defining pipeline computations, we need to know the types of our inputs in order to know which operations and functions we can use. The dtype of a BoundColumn
tells a computation what the type of the data will be when the pipeline is run. For example, EquityPricing
has a float dtype so a factor may perform arithmetic operations on EquityPricing.close
(e.g. compute the 5-day mean). The importance of this will become more clear in the next lesson. The dtype of a BoundColumn
can also determine the type of a computation. In the case of the latest computation, the dtype determines whether the computation is a factor (float), a filter (bool), or a classifier (string, int).
Equity pricing data is stored in the EquityPricing
dataset. EquityPricing
provides five columns:
EquityPricing.open
EquityPricing.high
EquityPricing.low
EquityPricing.close
EquityPricing.volume
Each of these columns has a float dtype. The EquityPricing
dataset is bound to the particular bundle specified in the call to run_pipeline
(or the default bundle if no bundle is specified).
In addition to pricing data, you can access a variety of other datasets in Pipeline, including securities master data, fundamental data, short sale data, and more. For a full list of available datasets, see the usage guide.
Next Lesson: Custom Factors