© Copyright Quantopian Inc.
© Modifications Copyright QuantRocket LLC
Licensed under the Creative Commons Attribution 4.0.
Disclaimer
Quantopian video for this lecture ↗
Jupyter notebooks allow one to perform a great deal of data analysis and statistical validation. We'll demonstrate a few simple techniques here.
As you can see, each cell can be either code or text. To select between them, choose from the 'Markdown' dropdown menu on the top of the notebook.
A code cell will be evaluated when you press play, or when you press the shortcut, shift-enter. Evaluating a cell evaluates each line of code in sequence, and prints the results of the last line below the cell.
2 + 2
4
Sometimes there is no result to be printed, as is the case with assignment.
X = 2
Remember that only the result from the last line is printed.
2 + 2
3 + 3
6
However, you can print whichever lines you want using the print
statement.
print(2 + 2)
3 + 3
4
6
While a cell is running, a [*]
will display on the left. When a cell has yet to be executed, [ ]
will display. When it has been run, a number will display indicating the order in which it was run during the execution of the notebook [5]
. Try on this cell and note it happening.
#Take some time to run something
c = 0
for i in range(10000000):
c = c + i
c
49999995000000
The vast majority of the time, you'll want to use functions from pre-built libraries. Here I import numpy and pandas, the two most common and useful libraries in quant finance. I recommend copying this import statement to every new notebook.
Notice that you can rename libraries to whatever you want after importing. The as
statement allows this. Here we use np
and pd
as aliases for numpy
and pandas
. This is a very common aliasing and will be found in most code snippets around the web. The point behind this is to allow you to type fewer characters when you are frequently accessing these libraries.
import numpy as np
import pandas as pd
# This is a plotting library for pretty pictures.
import matplotlib.pyplot as plt
Pressing tab will give you a list of Python's best guesses for what you might want to type next. This is incredibly valuable and will save you a lot of time. If there is only one possible option for what you could type next, Python will fill that in for you. Try pressing tab very frequently, it will seldom fill in anything you don't want, as if there is ambiguity a list will be shown. This is a great way to see what functions are available in a library.
Try placing your cursor after the .
and pressing tab.
np.random.
Placing a question mark after a function and executing that line of code will give you the documentation Python has for that function. It's often best to do this in a new cell, as you avoid re-executing other code and running into bugs.
np.random.normal?
Docstring: normal(loc=0.0, scale=1.0, size=None) Draw random samples from a normal (Gaussian) distribution. The probability density function of the normal distribution, first derived by De Moivre and 200 years later by both Gauss and Laplace independently [2]_, is often called the bell curve because of its characteristic shape (see the example below). The normal distributions occurs often in nature. For example, it describes the commonly occurring distribution of samples influenced by a large number of tiny, random disturbances, each with its own unique distribution [2]_. .. note:: New code should use the `~numpy.random.Generator.normal` method of a `~numpy.random.Generator` instance instead; please see the :ref:`random-quick-start`. Parameters ---------- loc : float or array_like of floats Mean ("centre") of the distribution. scale : float or array_like of floats Standard deviation (spread or "width") of the distribution. Must be non-negative. size : int or tuple of ints, optional Output shape. If the given shape is, e.g., ``(m, n, k)``, then ``m * n * k`` samples are drawn. If size is ``None`` (default), a single value is returned if ``loc`` and ``scale`` are both scalars. Otherwise, ``np.broadcast(loc, scale).size`` samples are drawn. Returns ------- out : ndarray or scalar Drawn samples from the parameterized normal distribution. See Also -------- scipy.stats.norm : probability density function, distribution or cumulative density function, etc. random.Generator.normal: which should be used for new code. Notes ----- The probability density for the Gaussian distribution is .. math:: p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }} e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} }, where :math:`\mu` is the mean and :math:`\sigma` the standard deviation. The square of the standard deviation, :math:`\sigma^2`, is called the variance. The function has its peak at the mean, and its "spread" increases with the standard deviation (the function reaches 0.607 times its maximum at :math:`x + \sigma` and :math:`x - \sigma` [2]_). This implies that normal is more likely to return samples lying close to the mean, rather than those far away. References ---------- .. [1] Wikipedia, "Normal distribution", https://en.wikipedia.org/wiki/Normal_distribution .. [2] P. R. Peebles Jr., "Central Limit Theorem" in "Probability, Random Variables and Random Signal Principles", 4th ed., 2001, pp. 51, 51, 125. Examples -------- Draw samples from the distribution: >>> mu, sigma = 0, 0.1 # mean and standard deviation >>> s = np.random.normal(mu, sigma, 1000) Verify the mean and the variance: >>> abs(mu - np.mean(s)) 0.0 # may vary >>> abs(sigma - np.std(s, ddof=1)) 0.1 # may vary Display the histogram of the samples, along with the probability density function: >>> import matplotlib.pyplot as plt >>> count, bins, ignored = plt.hist(s, 30, density=True) >>> plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * ... np.exp( - (bins - mu)**2 / (2 * sigma**2) ), ... linewidth=2, color='r') >>> plt.show() Two-by-four array of samples from the normal distribution with mean 3 and standard deviation 2.5: >>> np.random.normal(3, 2.5, size=(2, 4)) array([[-4.49401501, 4.00950034, -1.81814867, 7.29718677], # random [ 0.39924804, 4.68456316, 4.99394529, 4.84057254]]) # random Type: builtin_function_or_method
We'll sample some random data using a function from numpy
.
# Sample 100 points with a mean of 0 and an std of 1. This is a standard normal distribution.
X = np.random.normal(0, 1, 100)
We can use the plotting library we imported as follows.
plt.plot(X)
[<matplotlib.lines.Line2D at 0xffff351565d0>]
You might have noticed the annoying line of the form [<matplotlib.lines.Line2D at 0x7f72fdbc1710>]
before the plots. This is because the .plot
function actually produces output. Sometimes we wish not to display output, we can accomplish this with the semi-colon as follows.
plt.plot(X);
No self-respecting quant leaves a graph without labeled axes. Here are some commands to help with that.
X = np.random.normal(0, 1, 100)
X2 = np.random.normal(0, 1, 100)
plt.plot(X);
plt.plot(X2);
plt.xlabel('Time') # The data we generated is unitless, but don't forget units in general.
plt.ylabel('Returns')
plt.legend(['X', 'X2']);
Let's use numpy
to take some simple statistics.
np.mean(X)
0.07532704941937889
np.std(X)
0.9703956594408856
Randomly sampled data can be great for testing ideas, but let's get some real data. In QuantRocket, all securities are referenced by sid (short for "security ID") rather than by symbol since symbols can change. So, first, we'll use the get_securities
function to look up the sid for MSFT.
(Notice the use of vendors='usstock'
in the get_securities
function call. This limits the query to securities from the US Stock dataset. This filter isn't necessary if you've only collected US Stock data, but is a best practice when looking up securities by symbol in case you've also collected data from other global exchanges where the same ticker symbols are re-used.)
from quantrocket.master import get_securities
securities = get_securities(symbols='MSFT', fields=['Sid','Symbol','Exchange'], vendors='usstock')
securities
Symbol | Exchange | |
---|---|---|
Sid | ||
FIBBG000BPH459 | MSFT | XNAS |
This returns a pandas dataframe, where sids are stored in the dataframe's index.
Then we use get_prices
to query our data bundle. Although the bundle contains minute data, here we use the data_frequency
parameter to request the data at daily frequency:
MSFT = securities.index[0]
from quantrocket import get_prices
data = get_prices("usstock-free-1min", data_frequency='daily', sids=MSFT, start_date='2012-01-01', end_date='2015-06-01', fields="Close")
Our data is now a dataframe. You can see the datetime index and the colums with different pricing data.
data.head()
Sid | FIBBG000BPH459 | |
---|---|---|
Field | Date | |
Close | 2012-01-03 | 24.260 |
2012-01-04 | 24.835 | |
2012-01-05 | 25.089 | |
2012-01-06 | 25.474 | |
2012-01-09 | 25.143 |
This is a pandas dataframe, so we can index in to just get the closing price for MSFT like this. For more info on pandas, please click here.
X = data.loc['Close'][MSFT]
Because there is now also date information in our data, we provide two series to .plot
. X.index
gives us the datetime index, and X.values
gives us the pricing values. These are used as the X and Y coordinates to make a graph.
plt.plot(X.index, X.values)
plt.ylabel('Price')
plt.legend(['MSFT']);
We can get statistics again on real data.
np.mean(X)
34.492057176196035
np.std(X)
7.310119153924389
We can use the pct_change
function to get returns. Notice how we drop the first element after doing this, as it will be NaN
(nothing -> something results in a NaN percent change).
R = X.pct_change()[1:]
We can plot the returns distribution as a histogram.
plt.hist(R, bins=20)
plt.xlabel('Return')
plt.ylabel('Frequency')
plt.legend(['MSFT Returns']);
Get statistics again.
np.mean(R)
0.0008821403974814055
np.std(R)
0.014380720812394637
Now let's go backwards and generate data out of a normal distribution using the statistics we estimated from Microsoft's returns. We'll see that we have good reason to suspect Microsoft's returns may not be normal, as the resulting normal distribution looks far different.
plt.hist(np.random.normal(np.mean(R), np.std(R), 10000), bins=20)
plt.xlabel('Return')
plt.ylabel('Frequency')
plt.legend(['Normally Distributed Returns']);
pandas
has some nice tools to allow us to generate rolling statistics. Here's an example. Notice how there's no moving average for the first 60 days, as we don't have 60 days of data on which to generate the statistic.
# Take the average of the last 60 days at each timepoint.
MAVG = X.rolling(window=60).mean()
plt.plot(X.index, X.values)
plt.plot(MAVG.index, MAVG.values)
plt.ylabel('Price')
plt.legend(['MSFT', '60-day MAVG']);
This presentation is for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation for any security; nor does it constitute an offer to provide investment advisory or other services by QuantRocket LLC ("QuantRocket"). Nothing contained herein constitutes investment advice or offers any opinion with respect to the suitability of any security, and any views expressed herein should not be taken as advice to buy, sell, or hold any security or as an endorsement of any security or company. In preparing the information contained herein, the authors have not taken into account the investment needs, objectives, and financial circumstances of any particular investor. Any views expressed and data illustrated herein were prepared based upon information believed to be reliable at the time of publication. QuantRocket makes no guarantees as to their accuracy or completeness. All information is subject to change and may quickly become unreliable for various reasons, including changes in market conditions or economic circumstances.