ρk
= cov(xt, xt-k )/var(xt)
= ϒ(k)/ ϒ(0)
In order to see how this arises, we need
to examine the sample time series data. Suppose we have a sample set {xi ,
yi } of n pairs of real valued-data, then the correlation
between them is given by the ratio of the covariance to the square root of the
variance of each variable.
What this does in fact is that it
standardizes the covariance by the average dispersion of each variable in order
to ensure that the sample correlation coefficient, r falls in the range
[-1, 1]. The standard formula used for this is:
If we assume that instead of a pair of
datasets {xi,yi}, we have a set of n values, {xt},
which represents measurements taken at different time periods, t=1,2,3,4….n, for
example, the daily arrival times of flights at an airport or the closing price
of a stock on a daily basis.
A typical stock price time series is
shown in figures below. The red line is the closing stock price for Apple
(AAPL) on each trading day with 0 lag; the blue and green looped lines
highlight the time series for 7 and 14 day intervals or lags, or in other words
the sets {xt,xt+7,xt+14 , xt+21…..}and
{xt, xt+14, xt+28, xt+42, …….}.
Figure
1: AAPL daily closing prices
Figure
2: AAPL 7 day lagged closing prices
Figure
3: AAPL 7 day lagged closing prices
As
we can see there is a gradual smoothing of the lines, hence data as we proceed
with lagging from 0 to 7 to 14 days, which essentially means there are lesser
fluctuations in values with each successive lag level. This helps us identify
broad trends in data over a given period of time. The pattern of such values
recorded and graphed over a period of time like rainfall or stock prices show a
regularity. In such a case there would be a strong correlation between values
on successive days, that is the values that are one step or lag apart from each
other.
We
could in fact take the set of ‘day1’ values for each of the lagged series as
one series, {xt ,1, t=1,2,3,4, n-1 and set of day 2 values as a
second series {xt,2}, t=2,3,4, n-1 and compute the correlation
coefficient for these two series in the same way as for the expression r.
Each of these series has a mean value, which is simply:
and
The
numeric subscript indicates the lag and the dot in the subscripts is indicative
of the mean that is computed across all usable values of t. Using these
two mean values we can calculate a correlation coefficient at lag 1 between the
two successive series. This is the same formula as for r:

..........................................(2)
If
n is reasonably large then the value 1/(n-1) will be very close to 1/n and the
values of the two means and standard deviations will be almost the same. To understand
the same, see the following table, wherein the 1/n, 1/(n-1) and the difference
values are rounded off to 3 decimal places for this rounding off, the
difference is 0 for n=50.
N
|
1/n
|
1/(n-1)
|
difference
|
1
|
1.000
|
#DIV/0!
|
#DIV/0!
|
2
|
0.500
|
1.000
|
-0.500
|
3
|
0.333
|
0.500
|
-0.167
|
4
|
0.250
|
0.333
|
-0.083
|
5
|
0.200
|
0.250
|
-0.050
|
6
|
0.167
|
0.200
|
-0.033
|
7
|
0.143
|
0.167
|
-0.024
|
8
|
0.125
|
0.143
|
-0.018
|
9
|
0.111
|
0.125
|
-0.014
|
10
|
0.100
|
0.111
|
-0.011
|
11
|
0.091
|
0.100
|
-0.009
|
12
|
0.083
|
0.091
|
-0.008
|
13
|
0.077
|
0.083
|
-0.006
|
14
|
0.071
|
0.077
|
-0.005
|
15
|
0.067
|
0.071
|
-0.005
|
16
|
0.063
|
0.067
|
-0.004
|
17
|
0.059
|
0.063
|
-0.004
|
18
|
0.056
|
0.059
|
-0.003
|
19
|
0.053
|
0.056
|
-0.003
|
20
|
0.050
|
0.053
|
-0.003
|
21
|
0.048
|
0.050
|
-0.002
|
22
|
0.045
|
0.048
|
-0.002
|
23
|
0.043
|
0.045
|
-0.002
|
24
|
0.042
|
0.043
|
-0.002
|
25
|
0.040
|
0.042
|
-0.002
|
26
|
0.038
|
0.040
|
-0.002
|
27
|
0.037
|
0.038
|
-0.001
|
28
|
0.036
|
0.037
|
-0.001
|
29
|
0.034
|
0.036
|
-0.001
|
30
|
0.033
|
0.034
|
-0.001
|
31
|
0.032
|
0.033
|
-0.001
|
32
|
0.031
|
0.032
|
-0.001
|
33
|
0.030
|
0.031
|
-0.001
|
34
|
0.029
|
0.030
|
-0.001
|
35
|
0.029
|
0.029
|
-0.001
|
36
|
0.028
|
0.029
|
-0.001
|
37
|
0.027
|
0.028
|
-0.001
|
38
|
0.026
|
0.027
|
-0.001
|
39
|
0.026
|
0.026
|
-0.001
|
40
|
0.025
|
0.026
|
-0.001
|
41
|
0.024
|
0.025
|
-0.001
|
42
|
0.024
|
0.024
|
-0.001
|
43
|
0.023
|
0.024
|
-0.001
|
44
|
0.023
|
0.023
|
-0.001
|
45
|
0.022
|
0.023
|
-0.001
|
46
|
0.022
|
0.022
|
0.000
|
47
|
0.021
|
0.022
|
0.000
|
48
|
0.021
|
0.021
|
0.000
|
49
|
0.020
|
0.021
|
0.000
|
50
|
0.020
|
0.020
|
0.000
|
Similarly, if the samples are large
enough with same size and are drawn from the same population, they are likely
to have same mean and standard deviation. Therefore the above expression can be
simplified under the circumstances to:
This
expression is termed as the sample serial correlation or autocorrelation
coefficient for a lag of 1 time period. It may also be generalized for lags of
2,3,…k steps as follows:
The term
autocorrelation coefficient has been used since the 1950s to describe this
expression. The numerator of this expression resembles a covariance though at
the lag of k. while the denominator resembles a covariance with lag 0.
These two components of the expression are at times also referred to as the
autocovariance at k and 0 lags.
Thus, we have established the concept of autocorrelation and arrived at its mathematical formula.
In the next post we will learn how to plot the autocorraltion of a series through graphic representations termed as correlograms, which further help us identifying patterns in a time series.
Till then happy stat-ing :)






No comments:
Post a Comment