Sunday, 19 August 2018

Autocorrelation 3 - Temporal Auto correlation - Auto correlation in temporal (time based) datasets


     
As we have seen in the previous autocorrelation post, the population autocorrelation coefficient at lag k, ρk is the ratio of the auto-covariance to auto-variance:
 ρk = cov(xt, xt-k )/var(xt)
= ϒ(k)/ ϒ(0)
In order to see how this arises, we need to examine the sample time series data. Suppose we have a sample set {xi , yi } of n pairs of real valued-data, then the correlation between them is given by the ratio of the covariance to the square root of the variance of each variable.
What this does in fact is that it standardizes the covariance by the average dispersion of each variable in order to ensure that the sample correlation coefficient, r falls in the range [-1, 1]. The standard formula used for this is:
……………………………………..(1)
If we assume that instead of a pair of datasets {xi,yi}, we have a set of n values, {xt}, which represents measurements taken at different time periods, t=1,2,3,4….n, for example, the daily arrival times of flights at an airport or the closing price of a stock on a daily basis.
A typical stock price time series is shown in figures below. The red line is the closing stock price for Apple (AAPL) on each trading day with 0 lag; the blue and green looped lines highlight the time series for 7 and 14 day intervals or lags, or in other words the sets {xt,xt+7,xt+14 , xt+21…..}and {xt, xt+14, xt+28, xt+42, …….}.

Figure 1: AAPL daily closing prices

   Figure 2: AAPL 7 day lagged closing prices
Figure 3: AAPL 7 day lagged closing prices
 


As we can see there is a gradual smoothing of the lines, hence data as we proceed with lagging from 0 to 7 to 14 days, which essentially means there are lesser fluctuations in values with each successive lag level. This helps us identify broad trends in data over a given period of time. The pattern of such values recorded and graphed over a period of time like rainfall or stock prices show a regularity. In such a case there would be a strong correlation between values on successive days, that is the values that are one step or lag apart from each other. 

We could in fact take the set of ‘day1’ values for each of the lagged series as one series, {xt ,1, t=1,2,3,4, n-1 and set of day 2 values as a second series {xt,2}, t=2,3,4, n-1 and compute the correlation coefficient for these two series in the same way as for the expression r. Each of these series has a mean value, which is simply:


 












and


The numeric subscript indicates the lag and the dot in the subscripts is indicative of the mean that is computed across all usable values of t. Using these two mean values we can calculate a correlation coefficient at lag 1 between the two successive series. This is the same formula as for r:
  
                                                                                 ..........................................(2)
If n is reasonably large then the value 1/(n-1) will be very close to 1/n and the values of the two means and standard deviations will be almost the same. To understand the same, see the following table, wherein the 1/n, 1/(n-1) and the difference values are rounded off to 3 decimal places for this rounding off, the difference is 0 for n=50. 



N
1/n
1/(n-1)
difference
1
1.000
#DIV/0!
#DIV/0!
2
0.500
1.000
-0.500
3
0.333
0.500
-0.167
4
0.250
0.333
-0.083
5
0.200
0.250
-0.050
6
0.167
0.200
-0.033
7
0.143
0.167
-0.024
8
0.125
0.143
-0.018
9
0.111
0.125
-0.014
10
0.100
0.111
-0.011
11
0.091
0.100
-0.009
12
0.083
0.091
-0.008
13
0.077
0.083
-0.006
14
0.071
0.077
-0.005
15
0.067
0.071
-0.005
16
0.063
0.067
-0.004
17
0.059
0.063
-0.004
18
0.056
0.059
-0.003
19
0.053
0.056
-0.003
20
0.050
0.053
-0.003
21
0.048
0.050
-0.002
22
0.045
0.048
-0.002
23
0.043
0.045
-0.002
24
0.042
0.043
-0.002
25
0.040
0.042
-0.002
26
0.038
0.040
-0.002
27
0.037
0.038
-0.001
28
0.036
0.037
-0.001
29
0.034
0.036
-0.001
30
0.033
0.034
-0.001
31
0.032
0.033
-0.001
32
0.031
0.032
-0.001
33
0.030
0.031
-0.001
34
0.029
0.030
-0.001
35
0.029
0.029
-0.001
36
0.028
0.029
-0.001
37
0.027
0.028
-0.001
38
0.026
0.027
-0.001
39
0.026
0.026
-0.001
40
0.025
0.026
-0.001
41
0.024
0.025
-0.001
42
0.024
0.024
-0.001
43
0.023
0.024
-0.001
44
0.023
0.023
-0.001
45
0.022
0.023
-0.001
46
0.022
0.022
0.000
47
0.021
0.022
0.000
48
0.021
0.021
0.000
49
0.020
0.021
0.000
50
0.020
0.020
0.000
 
        Similarly, if the samples are large enough with same size and are drawn from the same population, they are likely to have same mean and standard deviation. Therefore the above expression can be simplified under the circumstances to:
 .............................................(3)


This expression is termed as the sample serial correlation or autocorrelation coefficient for a lag of 1 time period. It may also be generalized for lags of 2,3,…k steps as follows:
……………………………(4)     

The term autocorrelation coefficient has been used since the 1950s to describe this expression. The numerator of this expression resembles a covariance though at the lag of k. while the denominator resembles a covariance with lag 0. These two components of the expression are at times also referred to as the autocovariance at k and 0 lags.


Thus, we have established the concept of autocorrelation and arrived at its mathematical formula. 
In the next post we will learn how to plot the autocorraltion of a series through graphic representations termed as correlograms, which further help us identifying patterns in a time series.  

Till then happy stat-ing :)  

No comments:

Post a Comment