Skip to Main Content

Time Series Analysis: Basics in Stata

This guide provides code for conducting basic time series analysis in Stata.

Basics in Stata

1. Formating Date Variables

Time series data refers to data collected over time for specific variables. To work with this data, first ensure that the date or time variable is in the desired frequency: yearly, monthly, quarterly, daily, etc. Next, verify that it is in the correct format and, if necessary, convert it into the required form. In this section, we will use an example dataset to demonstrate techniques for converting date/time variables into the expected formats.

First, get an example dataset with "raw" date variables. Use the following code:

use "https://dss.princeton.edu/training/time_series_1.dta", clear

Here is a snapshot of the dataset:

We see that the three columns represent the same date with different types of formatting. The first two columns are characters written in different ways, and the last is integers. 

Now, we will show how to convert these three columns into date variables. 

Step 1: convert the string 'date1' into a date variable called 'new_date1': The year here is a two-digit number. 

gen new_date1 = date(date1,"DMY", 2099)
format new_date1 %td

Note: (i) add 2099 as the year corresponds to 2000-2099. (ii) We use %td to indicate daily data.

Step 2: convert the string 'date2' into a date variable called 'new_date2': The year here is a four-digit number. 

gen new_date2 = date(date2,"MDY")
format new_date2 %td

Step 3: convert the numeric/integer 'date3' into a date variable called 'date3_new'. The process requires following several steps, each described in the corresponding starred line.

*convert numeric to string
tostring date3, gen(date3_str)
*create character length, year, month, and day columns
gen leng = length(date3_str)
gen year = substr(date3_str,1,4)
*when length=6, month is in 5th position and day in 6th
gen month = substr(date3_str,5,1) if leng == 6
gen day = substr(date3_str,6,1) if leng == 6
*when length=7, it is hard to distinguish month/day, so we skip
*when length=8, month is in 5th/6th position and day in 7th/8th
replace month = substr(date3_str,5,2) if leng == 8
replace day = substr(date3_str,7,2) if leng == 8
destring month day year, replace
*creating date3_new
gen date3_new = mdy(month,day,year)
format date3_new %td
*filling in the missing dates
replace date3_new = date3_new[_n-1] + 1 if date3_new == .

We can see the newly created dates variable from the following snapshot of the corresponding resultant dataset.

From daily/monthly date variable to quarterly

First, get an example dataset with "raw" date variables. Use the following code:

use "https://dss.princeton.edu/training/time_series_1.dta", clear

To get the quarterly date from the daily date, type:

gen datevar = date(date2, "MDY")
format datevar %td
gen quarterly = qofd(datevar)
format quarterly %tq

Stata will create the following quarterly variable in the dataset.

To get the monthly date from the daily date, and the quarterly date from the monthly date, type:

gen month = month(datevar)
gen day=day(datevar)
gen year=year(datevar)
gen monthly = ym(year,month)
format monthly %tm
gen quarterly_new = qofd(dofm(monthly))
format quarterly_new %tq

Stata will create the following quarterly_new variable in the dataset.

From daily to weekly and getting yearly

First, get an example dataset with "raw" date variables. Use the following code:

use "https://dss.princeton.edu/training/time_series_1.dta", clear

From daily to weekly 

gen datevar = date(date2, "MDY", 2099)
format datevar %td
gen year= year(datevar)
gen w = week(datevar)
gen weekly = yw(year,w)
format weekly %tw

Stata will create the following weekly variable in the dataset.

Getting yearly variable

* From daily to yearly

gen year1 = year(datevar)

* From quarterly to yearly

gen quarterly = qofd(datevar)
format quarterly %tq
gen year2 = yofd(dofq(quarterly))

* From weekly to yearly

gen year3 = yofd(dofw(weekly))

Browse the selected variables and see the resultant year variables.

browse datevar weekly quarterly year1 year2 year3

2. Setting-up Time Series Data

Setting as time series: tsset

Once you have the date variable in a ‘date format’, you need to declare your data as time series in order to use the time series operators. In Stata type:

*get the data
use https://dss.princeton.edu/training/tsdata.dta, clear
*format the date variable
gen date1=substr(date,1,7) 
gen datevar=quarterly(date1,"yq")
format datevar %tq
*set data as a time series
tsset datevar

We will get the following output in the Stata output window

Time variable: datevar, 1957q1 to 2005q1
        Delta: 1 quarter

If you have gaps in your time series, for example there may not be data available for weekends. This complicates the analysis by using lags for those missing dates. In this case, you may want to create a continuous time trend as follows:

gen time = _n

Then use the newly created "time" variable it to set the time series:

tsset time

In the case of cross-sectional time series type:

sort panelID date
by panelID: gen time = _n
xtset panelID time

Subsetting tin/twithin

 With tsset (time series set) you can use two time series commands: tin (‘times in’, from a to b) and twithin (‘times within’, between a and b, it excludes a and b). If you have yearly data, just include the years.

Let's run regressions using tin and twithin options

First, set data as time series

tsset datevar

Run a regression using the tin option

regress unemp interest cpi if tin(1957q1, 1999q4)

Run a regression using the twithin option

regress unemp interest cpi if twithin(2000q1, 2005q1)

3. Constructing Lag, Lead, Difference, and Seasonal Variable

In time series analysis, we often examine how past values influence current ones. We use tools such as lags, leads, differences, and seasonal operators to achieve this. In this section, we will show how to construct these operators.

Step 1: get the data:

use https://dss.princeton.edu/training/tsdata.dta, clear

Step 2: format the date variable
gen date1=substr(date,1,7) 
gen datevar=quarterly(date1,"yq")
format datevar %tq

Step 3: set data as time series
tsset datevar

Lag operators (lag)

To generate values with past values, use the “L” operator. Let's create lag1, lag2, and lag3 of the unemp variable. Type:

generate unempL1=L1.unemp 
generate unempL2=L2.unemp
generate unempL3=L3.unemp

To check the newly created lag variables, list the first 10 rows of the selected variables.

list datevar unemp unempL1 unempL2 unempL3 in 1/10

Stata will give us the following table:


     +-----------------------------------------------------+
     | datevar      unemp    unempL1    unempL2    unempL3 |
     |-----------------------------------------------------|
  1. |  1957q1   3.933333          .          .          . |
  2. |  1957q2        4.1   3.933333          .          . |
  3. |  1957q3   4.233333        4.1   3.933333          . |
  4. |  1957q4   4.933333   4.233333        4.1   3.933333 |
  5. |  1958q1        6.3   4.933333   4.233333        4.1 |
     |-----------------------------------------------------|
  6. |  1958q2   7.366667        6.3   4.933333   4.233333 |
  7. |  1958q3   7.333333   7.366667        6.3   4.933333 |
  8. |  1958q4   6.366667   7.333333   7.366667        6.3 |
  9. |  1959q1   5.833333   6.366667   7.333333   7.366667 |
 10. |  1959q2        5.1   5.833333   6.366667   7.333333 |
     +-----------------------------------------------------+

 

When you run regressions with lag variables, you do not need to construct the lag variables separately. Instead, you can use lag operators within the regressors. For instance, to add lag1-lag3 of the unemp variable in a regression, run the following code:

regress gdp unemp L1.unemp L2.unemp L3.unemp

To add lag1-lag5 of the unempl variable in a regression, run the following code:

regress gdp unemp L(1/5).unemp

Lag operators (forward)

To generate forward or lead values, use the “F” operator. Let's create lead1, lead2, and lead3 of the unemp variable. Type:

generate unempF1=F1.unemp 
generate unempF2=F2.unemp
generate unempF3=F3.unemp

To check the newly created forward variables, list the first 10 rows of the selected variables.
list datevar unemp unempF1 unempF2 unempF3 in 1/10

Stata will give us the following table:

    +-----------------------------------------------------+
     | datevar      unemp    unempF1    unempF2    unempF3 |
     |-----------------------------------------------------|
  1. |  1957q1   3.933333        4.1   4.233333   4.933333 |
  2. |  1957q2        4.1   4.233333   4.933333        6.3 |
  3. |  1957q3   4.233333   4.933333        6.3   7.366667 |
  4. |  1957q4   4.933333        6.3   7.366667   7.333333 |
  5. |  1958q1        6.3   7.366667   7.333333   6.366667 |
     |-----------------------------------------------------|
  6. |  1958q2   7.366667   7.333333   6.366667   5.833333 |
  7. |  1958q3   7.333333   6.366667   5.833333        5.1 |
  8. |  1958q4   6.366667   5.833333        5.1   5.266667 |
  9. |  1959q1   5.833333        5.1   5.266667        5.6 |
 10. |  1959q2        5.1   5.266667        5.6   5.133333 |
     +-----------------------------------------------------+

 

When you run regressions with forward variables, you do not need to construct the forward variables separately. Rather, you can use lead operators within the regressors. For instance, to add lead1-lead3 of the unemp variable in a regression, run the following code:

regress gdp unemp F1.unemp F2.unemp F3.unemp

To add lag1-lag5 of the unempl variable in a regression, run the following code:

regress gdp unemp F(1/5).unemp

Lag operators (difference)

To generate the difference between current and previous values, use the “D” operator. Let's create difference1 and difference2 of the unemp variable. Type:

generate unempD1=D1.unemp 

generate unempD2=D2.unemp

Note 1: D1 = yt – yt-1 

Note 2:  D2 = (yt – yt-1) – (yt-1 – yt-2) 

To check the newly created difference variables, list the first 10 rows of the selected variables.

list datevar unemp unempD1 unempD2 in 1/10
Stata will give us the following table:

   +--------------------------------------------+
     | datevar      unemp     unempD1     unempD2 |
     |--------------------------------------------|
  1. |  1957q1   3.933333           .           . |
  2. |  1957q2        4.1    .1666665           . |
  3. |  1957q3   4.233333    .1333332   -.0333333 |
  4. |  1957q4   4.933333    .7000003    .5666671 |
  5. |  1958q1        6.3    1.366667    .6666665 |
     |--------------------------------------------|
  6. |  1958q2   7.366667    1.066667   -.3000002 |
  7. |  1958q3   7.333333   -.0333333        -1.1 |
  8. |  1958q4   6.366667   -.9666667   -.9333334 |
  9. |  1959q1   5.833333   -.5333333    .4333334 |
 10. |  1959q2        5.1   -.7333336   -.2000003 |
     +--------------------------------------------+

 

When you run regressions with difference variables, you do not need to construct the difference variables separately. Instead, you can use difference operators within the regressors. For instance, to add difference1-difference2 of the unemp variable in a regression, run the following code:

regress gdp unemp D1.unemp D2.unemp 

Creating a seasonal variable based on quarterly data

Step 1: generate a variable for the quarter by extracting quarter (1, 2, 3, or 4) from the datevar.

generate quarter = quarter(dofq(datevar))

Step 2: create a season variable and assign labels
generate season = ""
replace season = "Winter" if quarter == 1  // Q1
replace season = "Spring" if quarter == 2  // Q2
replace season = "Summer" if quarter == 3  // Q3
replace season = "Fall"   if quarter == 4  // Q4

Step 3: check the season variable
list datevar quarter season

Stata will give us the following table (part of the table is presented here)

    +----------------------------+
     | datevar   quarter   season |
     |----------------------------|
  1. |  1957q1         1   Winter |
  2. |  1957q2         2   Spring |
  3. |  1957q3         3   Summer |
  4. |  1957q4         4     Fall |
  5. |  1958q1         1   Winter |
     |----------------------------|
  6. |  1958q2         2   Spring |
  7. |  1958q3         3   Summer |
  8. |  1958q4         4     Fall |
  9. |  1959q1         1   Winter |
 10. |  1959q2         2   Spring |
     |----------------------------|
 11. |  1959q3         3   Summer |
 12. |  1959q4         4     Fall |
 13. |  1960q1         1   Winter |
 14. |  1960q2         2   Spring |
 15. |  1960q3         3   Summer |
     |----------------------------|

4. Correlograms: Autocorrelation and Cross-correlations

Step 1: get the data:

use https://dss.princeton.edu/training/tsdata.dta, clear

Step 2: format the date variable
gen date1=substr(date,1,7) 
gen datevar=quarterly(date1,"yq")
format datevar %tq

Step 3: set data as time series
tsset datevar

Correlograms: autocorrelation

To explore autocorrelation, which is the correlation between a variable and its previous values, use the command corrgram. The number of lags depends on theory, AIC/BIC process, or experience. The output includes autocorrelation coefficient and partial correlation coefficients used to specify an ARIMA model.

corrgram unemp, lags(12)

Stata will give us the following output:

                                          -1       0       1 -1       0       1
 LAG       AC       PAC      Q     Prob>Q  [Autocorrelation]  [Partial autocor]
-------------------------------------------------------------------------------
1        0.9641   0.9650    182.2  0.0000          |-------           |------- 
2        0.8921  -0.6305   339.02  0.0000          |-------      -----|        
3        0.8045   0.1091   467.21  0.0000          |------            |        
4        0.7184   0.0424   569.99  0.0000          |-----             |        
5        0.6473   0.0836   653.86  0.0000          |-----             |        
6        0.5892  -0.0989   723.72  0.0000          |----              |        
7        0.5356  -0.0384   781.77  0.0000          |----              |        
8        0.4827   0.0744   829.17  0.0000          |---               |        
9        0.4385   0.1879    868.5  0.0000          |---               |-       
10       0.3984  -0.1832   901.14  0.0000          |---              -|        
11       0.3594  -0.1396   927.85  0.0000          |--               -|        
12       0.3219   0.0745    949.4  0.0000          |--                |        

Interpretation

AC column: AC shows that the correlation between the current value of unemp and its value three quarters ago is 0.8045. AC can be use to define the q in MA(q) only in stationary series.

PAC column: PAC shows that the correlation between the current value of unemp and its value three quarters ago is 0.1091 without the effect of the two previous lags. PAC can be used to define the p in AR(p) only in stationary series.

Q column Prob>Q column: Box-Pierce’ Q statistic tests the null hypothesis that all correlation up to lag k are equal to 0. This series show significant autocorrelation as shown in the Prob>Q value which at any k are less than 0.05, therefore rejecting the null that all lags are not autocorrelated.

Graphic view of AC which shows a slow decay in the trend, suggesting non-stationarity. See also the ac command.

Graphic view of PAC which does not show spikes after the second lag which suggests that all other lags are mirrors of the second lag. See the pac command.

Correlograms: cross-correlation

The explore the relationship between two time series use the command xcorr. The graph below shows the correlation between GDP quarterly growth rate and unemployment. When using xcorr list the independent variable first and the dependent variable second. type

xcorr gdp unemp, lags(10) xlabel(-10(1)10,grid)

Stata will give us the following graph:

 

To get cross-correlation graph, type:

xcorr gdp unemp, lags(10) table

We will get the following table:

                 -1       0       1
 LAG      CORR   [Cross-correlation]
------------------------------------
-10     -0.1080           |        
-9      -0.1052           |        
-8      -0.1075           |        
-7      -0.1144           |        
-6      -0.1283          -|        
-5      -0.1412          -|        
-4      -0.1501          -|        
-3      -0.1578          -|        
-2      -0.1425          -|        
-1      -0.1437          -|        
0       -0.1853          -|        
1       -0.1828          -|        
2       -0.1685          -|        
3       -0.1177           |        
4       -0.0716           |        
5       -0.0325           |        
6       -0.0111           |        
7       -0.0038           |        
8        0.0168           |        
9        0.0393           |        
10       0.0419           |        

At lag 0 there is a negative immediate correlation between GDP growth rate and unemployment. This means that a drop in GDP causes an immediate increase in unemployment.

Let's check cross-correlation between another pair of variable: interest unemp

For graph, type:

xcorr interest unemp, lags(10) xlabel(-10(1)10,grid)

We will get the following graph:

For table, type:

xcorr interest unemp, lags(10) table

We will get the following table:

                 -1       0       1
 LAG      CORR   [Cross-correlation]
------------------------------------
-10      0.3297           |--      
-9       0.3150           |--      
-8       0.2997           |--      
-7       0.2846           |--      
-6       0.2685           |--      
-5       0.2585           |--      
-4       0.2496           |-       
-3       0.2349           |-       
-2       0.2323           |-       
-1       0.2373           |-       
0        0.2575           |--      
1        0.3095           |--      
2        0.3845           |---     
3        0.4576           |---     
4        0.5273           |----    
5        0.5850           |----    
6        0.6278           |-----   
7        0.6548           |-----   
8        0.6663           |-----   
9        0.6522           |-----   
10       0.6237           |----    

Interest rates have a positive effect on future level of unemployment, reaching the highest point at lag 8 (four quarters or two years). In this case, interest rates are positive correlated with unemployment rates eight quarters later.

5. Lag Selection

Too many lags could increase the error in the forecasts, and too few could leave out relevant information (See Stock & Watson for more details and on how to estimate BIC and SIC). Experience, knowledge, and theory are usually the best ways to determine the number of lags needed. There are, however, information criterion procedures to help come up with a proper number. Three commonly used are: Schwarz's Bayesian information criterion (SBIC), the Akaike's information criterion (AIC), and the Hannan and Quinn information criterion (HQIC). All these are reported by the command ‘varsoc’ in Stata.

Step 1: get the data:

use https://dss.princeton.edu/training/tsdata.dta, clear

Step 2: format the date variable
gen date1=substr(date,1,7) 
gen datevar=quarterly(date1,"yq")
format datevar %tq

Step 3: set data as time series
tsset datevar

Step 4: for lag selection type:

varsoc gdp cpi, maxlag(10)

Stata will give us the following table:

Lag-order selection criteria
   Sample: 1959q4 thru 2005q1                              Number of obs = 182
  +---------------------------------------------------------------------------+
  | Lag |    LL      LR      df    p     FPE       AIC      HQIC      SBIC    |
  |-----+---------------------------------------------------------------------|
  |   0 | -1294.75                     5293.32     14.25   14.2642   14.2852  |
  |   1 | -467.289  1654.9    4  0.000 .622031   5.20098    5.2438   5.30661  |
  |   2 | -401.381  131.82    4  0.000 .315041   4.52067   4.59204   4.69672* |
  |   3 | -396.232  10.299    4  0.036 .311102   4.50804   4.60796   4.75451  |
  |   4 | -385.514  21.435*   4  0.000 .288988*  4.43422*  4.56268*   4.7511  |
  |   5 |  -383.92  3.1886    4  0.527 .296769   4.46066   4.61766   4.84796  |
  |   6 | -381.135  5.5701    4  0.234 .300816   4.47401   4.65956   4.93173  |
  |   7 | -379.062  4.1456    4  0.387 .307335   4.49519   4.70929   5.02332  |
  |   8 | -375.483  7.1585    4  0.128 .308865   4.49981   4.74246   5.09836  |
  |   9 | -370.817  9.3311    4  0.053 .306748    4.4925   4.76369   5.16147  |
  |  10 | -370.585  .46392    4  0.977 .319888   4.53391   4.83364   5.27329  |
  +---------------------------------------------------------------------------+
   * optimal lag
   Endogenous: gdp cpi
    Exogenous: _cons

 

When all three criteria are agreed upon, the selection is clear, but what happens when conflicting results are obtained? Ivanov and Kilian (2001) suggest, in the context of VAR models, that AIC tends to be more accurate with monthly data, HQIC works better for quarterly data on samples over 120 and SBIC works fine with any sample size for quarterly data (on VEC models). In our example above we have quarterly data with 182 observations, HQIC suggest a lag of 4 (which is also suggested by AIC).

6. Unit Roots

A series with a unit root means that there is more than one trend in the series. In this section, we will check whether the variable "unemp" has unit roots. 

Step 1: get the data:
use https://dss.princeton.edu/training/tsdata.dta, clear

Step 2: format the date variable
gen date1=substr(date,1,7) 
gen datevar=quarterly(date1,"yq")
format datevar %tq

Step 3: set data as time series
tsset datevar

Step 4: draw a line graph of the unemp variable
line unemp datevar

Stata will give us the following graph

Step 5: conduct the unit root test

The Dickey-Fuller test is one of the most commonly use tests for stationarity. The null hypothesis is that the series has a unit root. Type:

dfuller unemp, lag(5)

Stata will give us the following output table:

Augmented Dickey–Fuller test for unit root
Variable: unemp                           Number of obs  = 187
                                          Number of lags =   5
H0: Random walk without drift, d = 0
                                       Dickey–Fuller
                   Test      -------- critical value ---------
              statistic           1%           5%          10%
--------------------------------------------------------------
 Z(t)            -2.597       -3.481       -2.884       -2.574
--------------------------------------------------------------
MacKinnon approximate p-value for Z(t) = 0.0936.

 

The test statistic indicates that the unemployment series have a unit root, it lies within the acceptance region.

One way to deal with stochastic trends (unit root) is by taking the first difference of the variable. Type:

dfuller D1.unemp, lag(5)

Stata will give us the following output table:

Augmented Dickey–Fuller test for unit root
Variable: D.unemp                         Number of obs  = 186
                                          Number of lags =   5
H0: Random walk without drift, d = 0
                                       Dickey–Fuller
                   Test      -------- critical value ---------
              statistic           1%           5%          10%
--------------------------------------------------------------
 Z(t)            -5.303       -3.481       -2.884       -2.574
--------------------------------------------------------------
MacKinnon approximate p-value for Z(t) = 0.0000.

The test statistic indicates that the difference of the unemployment series does not have a unit root, as it lies in the rejection region.

7. Testing for Cointegration

Cointegration refers to the fact that two or more series share a stochastic trend (Stock & Watson, 2020). Engle and Granger (1987) suggested a two-step process to test for cointegration (an OLS regression and a unit root test), the EG-ADF test. This section provides a step-by-step guide to conducting the EG-ADF test for cointegration. 

Step 1: get the data:
use https://dss.princeton.edu/training/tsdata.dta, clear

Step 2: format the date variable
gen date1=substr(date,1,7) 
gen datevar=quarterly(date1,"yq")
format datevar %tq

Step 3: set data as time series
tsset datevar

Step 4: run an OLS regression
regress unemp gdp

Step 5: get the residuals
predict e, resid

Step 6: run a unit root test on the residuals
dfuller e, lags(10)

Stata will give the following output:

Augmented Dickey–Fuller test for unit root
Variable: e                               Number of obs  = 181
                                          Number of lags =  10
H0: Random walk without drift, d = 0
                                       Dickey–Fuller
                   Test      -------- critical value ---------
              statistic           1%           5%          10%
--------------------------------------------------------------
 Z(t)            -2.535       -3.483       -2.885       -2.575
--------------------------------------------------------------
MacKinnon approximate p-value for Z(t) = 0.1071.

Test statistic indicates that we have unit root, which implies both variables are not cointegrated.

Note: Critical value for one independent variable in the OLS regression, at 5% is -3.41 (Stock & Watson, 2020). See Stock & Watson (2020) for a table of critical values for the unit root test and the theory behind.

8. Granger Causality

Granger causality: using OLS

If you regress ‘y’ on lagged values of ‘y’ and ‘x’ and the coefficients of the lag of ‘x’ are statistically significantly different from 0, then you can argue that‘x’Granger-cause‘y’, that is, ‘x’can be used to predict ‘y’ (see Stock & Watson, 2020). Here are the steps involved in conducting the Granger causality test using OLS.

Step 1: get the data
use https://dss.princeton.edu/training/tsdata.dta, clear

Step 2: format the date variable
gen date1=substr(date,1,7) 
gen datevar=quarterly(date1,"yq")
format datevar %tq

Step 3: set data as time series
tsset datevar

Step 4: run an OLS regression
regress unemp L(1/4).unemp L(1/4).gdp

Stata will give us the following output table:

      Source |       SS           df       MS      Number of obs   =       188
-------------+----------------------------------   F(8, 179)       =    668.37
       Model |  373.501653         8  46.6877066   Prob > F        =    0.0000
    Residual |  12.5037411       179  .069853302   R-squared       =    0.9676
-------------+----------------------------------   Adj R-squared   =    0.9662
       Total |  386.005394       187  2.06419997   Root MSE        =     .2643
------------------------------------------------------------------------------
       unemp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       unemp |
         L1. |   1.625708   .0763035    21.31   0.000     1.475138    1.776279
         L2. |  -.7695503   .1445769    -5.32   0.000    -1.054845    -.484256
         L3. |   .0868131   .1417562     0.61   0.541    -.1929152    .3665415
         L4. |   .0217041   .0726137     0.30   0.765    -.1215849    .1649931
             |
         gdp |
         L1. |   .0060996   .0136043     0.45   0.654    -.0207458    .0329451
         L2. |  -.0189398   .0128618    -1.47   0.143    -.0443201    .0064405
         L3. |   .0247494   .0130617     1.89   0.060    -.0010253    .0505241
         L4. |    .003637   .0129079     0.28   0.778    -.0218343    .0291083
             |
       _cons |   .1702419    .096857     1.76   0.081    -.0208865    .3613704
------------------------------------------------------------------------------

Conduct the F-test
test L1.gdp L2.gdp L3.gdp L4.gdp

Stata will give us the following output:
. test L1.gdp L2.gdp L3.gdp L4.gdp
 ( 1)  L.gdp = 0
 ( 2)  L2.gdp = 0
 ( 3)  L3.gdp = 0
 ( 4)  L4.gdp = 0
       F(  4,   179) =    1.67
            Prob > F =    0.1601

As Prob > F = 0.1601, you cannot reject the null hypothesis that all coefficients of lag of ‘x’ are equal to 0. Therefore ‘gdp’ does not Granger-cause ‘unemp’.

Granger causality: using VAR

The following procedure uses VAR models to estimate Granger causality using the command ‘vargranger’. 

quietly var unemp gdp, lags(1/4)
vargranger

Stata will give us the following output:

Granger causality Wald tests
  +------------------------------------------------------------------+
  |          Equation           Excluded |   chi2     df Prob > chi2 |
  |--------------------------------------+---------------------------|
  |             unemp                gdp |  6.9953     4    0.136    |
  |             unemp                ALL |  6.9953     4    0.136    |
  |--------------------------------------+---------------------------|
  |               gdp              unemp |  6.8658     4    0.143    |
  |               gdp                ALL |  6.8658     4    0.143    |
  +------------------------------------------------------------------+

The null hypothesis is ‘var1 does not Granger-cause var2’. In both cases, we cannot reject the null that each variable does not Granger-cause the other.

9. Chow Test for Structural Break

The Chow test allows us to test whether a particular date causes a break in the regression coefficients. It is named after Gregory Chow (1960). See Stock & Watson (2020) for more details. 

Using the Conventional Method

Step 1: get the data:
use https://dss.princeton.edu/training/tsdata.dta, clear

Step 2: format the date variable
gen date1=substr(date,1,7) 
gen datevar=quarterly(date1,"yq")
format datevar %tq

Step 3: set data as time series
tsset datevar

Step 4: create a dummy variable where 1 if date > break date and 0 <= break date. Here, we’ll test whether the first quarter of 1982 causes a break in the regression coefficients.

gen break = (datevar>tq(1981q4))

Note: Change “tq” with the correct date format: tw (week), tm (monthly), tq (quarterly), th (half), ty (yearly), and the corresponding date format in the parenthesis

Step 5: create interaction terms between the lags of the independent variables and the lag of the dependent variables. We will assume lag 1 for this example (the number of lags depends on your theory/data)

generate break_unemp = break*l1.unemp
generate break_gdp = break*l1.gdp

Step 6: run a regression between the outcome variables (in this case ‘unemp’) and the independent along with the interactions and the dummy for the break.

reg unemp l1.unemp l1.gdp break break_unemp break_gdp

Step 7. run an F-test on the coefficients for the interactions and the dummy for the break

test break break_unemp break_gdp

Stata will give us the following test results:

. test break break_unemp break_gdp
 ( 1)  break = 0
 ( 2)  break_unemp = 0
 ( 3)  break_gdp = 0
       F(  3,   185) =    1.14
            Prob > F =    0.3351

Interpretation: the null hypothesis is no break. If the p-value is < 0.05 reject the null in favor of the alternative that there is a break. In this example, we fail to reject the null and conclude that the first quarter of 1982 does not cause a break in the regression coefficients.

10. White Noise

White noise refers to the fact that a variable does not have autocorrelation. In Stata, we use the wntestq (white noise Q test) to check for autocorrelation. The null hypothesis we test here is that the data is white noise (no autocorrelation up to the specified lags).

Here are the steps to conduct the white noise test in Stata:

Step 1: get the data:
use https://dss.princeton.edu/training/tsdata.dta, clear

Step 2: format the date variable
gen date1=substr(date,1,7) 
gen datevar=quarterly(date1,"yq")
format datevar %tq

Step 3: set data as time series
tsset datevar

Step 4: first  test for white noise without specifying certain lags:
wntestq unemp

Stata will give us the following outputs:

. wntestq unemp
Portmanteau test for white noise
---------------------------------------
 Portmanteau (Q) statistic =  1044.6341
 Prob > chi2(40)           =     0.0000

As rhe p-value is less than 0.05, we reject the null hypotheis implying that the variable is autocorrelated.

Let's now test for autocorrealtion with specifying certain lags:

wntestq unemp, lags(10)

Stata will give us the following outputs:

. wntestq unemp, lags(10)
Portmanteau test for white noise
---------------------------------------
 Portmanteau (Q) statistic =   901.1399
 Prob > chi2(10)           =     0.0000

As rhe p-value is less than 0.05, we reject the null hypotheis implying that the variable is autocorrelated.

For more help, type help wntestq 

11. Testing for Serial Correlation

Breush-Godfrey and Durbin-Watson are used to test for serial correlation. The null in both tests is that there is no serial correlation. Here are the steps for conducting both the tests:

Step 1: get the data:
use https://dss.princeton.edu/training/tsdata.dta, clear

Step 2: format the date variable
gen date1=substr(date,1,7) 
gen datevar=quarterly(date1,"yq")
format datevar %tq

Step 3: set data as time series
tsset datevar

Step 4: run a regression
regress D1.unemp gdp

Stata will give us the following output table:

. regress D1.unemp gdp
      Source |       SS           df       MS      Number of obs   =       192
-------------+----------------------------------   F(1, 190)       =      1.62
       Model |  .205043471         1  .205043471   Prob > F        =    0.2048
    Residual |  24.0656991       190  .126661574   R-squared       =    0.0084
-------------+----------------------------------   Adj R-squared   =    0.0032
       Total |  24.2707425       191   .12707195   Root MSE        =     .3559
------------------------------------------------------------------------------
     D.unemp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         gdp |   -.015947   .0125337    -1.27   0.205    -.0406701    .0087761
       _cons |   .0401123    .036596     1.10   0.274    -.0320743    .1122989
------------------------------------------------------------------------------

Step 5: run for Durbin-Watson test
estat dwatson

Stata will give us the following result:

. estat dwatson
Durbin–Watson d-statistic(  2,   192) =  .7562744

Step 6: run for alternative Durbin-Watson test
estat durbinalt

Stata will give us the following result:

. estat durbinalt
Durbin's alternative test for autocorrelation
---------------------------------------------------------------------------
    lags(p)  |          chi2               df                 Prob > chi2
-------------+-------------------------------------------------------------
       1     |        118.790               1                   0.0000
---------------------------------------------------------------------------
                        H0: no serial correlation

As p-value is less than 0.05, we reject null hypothesis suggesting the presence of serial correlation.

Step 7: run for Breush-Godfrey test 
estat bgodfrey

Stata will give us the following result:

. estat bgodfrey
Breusch–Godfrey LM test for autocorrelation
---------------------------------------------------------------------------
    lags(p)  |          chi2               df                 Prob > chi2
-------------+-------------------------------------------------------------
       1     |         74.102               1                   0.0000
---------------------------------------------------------------------------
                        H0: no serial correlation

As p-value is less than 0.05, we reject null hypothesis suggesting the presence of serial correlation.

For more help, type help estat dwatson, help estat dubinalt and help estat bgodfrey 

Correcting for serial correlation

For correcting serial correlation, run a Cochrane-Orcutt regression using the prais command:

prais unemp gdp, corc

Stata will give us the following table:

. prais unemp gdp, corc
Iteration 0:  rho = 0.0000
Iteration 1:  rho = 0.9556
Iteration 2:  rho = 0.9660
Iteration 3:  rho = 0.9661
Iteration 4:  rho = 0.9661
Cochrane–Orcutt AR(1) regression with iterated estimates
      Source |       SS           df       MS      Number of obs   =       191
-------------+----------------------------------   F(1, 189)       =      2.48
       Model |  .308369041         1  .308369041   Prob > F        =    0.1167
    Residual |  23.4694088       189  .124176766   R-squared       =    0.0130
-------------+----------------------------------   Adj R-squared   =    0.0077
       Total |  23.7777778       190  .125146199   Root MSE        =    .35239
------------------------------------------------------------------------------
       unemp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         gdp |   -.020264   .0128591    -1.58   0.117    -.0456298    .0051018
       _cons |   6.105931   .7526023     8.11   0.000     4.621351    7.590511
-------------+----------------------------------------------------------------
         rho |    .966115
------------------------------------------------------------------------------
Durbin–Watson statistic (original)    = 0.087210
Durbin–Watson statistic (transformed) = 0.758116

For more details, type help prais

12. Useful Resources

Becketti, S. (2020). Introduction to Time Series Using Stata, rev. ed. College Station: Stata Press. Available at: https://www.stata.com/bookstore/introduction-to-time-series-using-stata/
 
Engle, R. F., & Granger, C. W. (1987). Co-integration and error correction: representation, estimation, and testing. Econometrica: journal of the Econometric Society, 251-276.
 
Hamilton, J. D. (2020). Time series analysis. Princeton university press.
 
Ivanov, V., & Kilian, L. (2001). A practitioner's guide to lag-order selection for vector autoregressions (Vol. 2685). London: Centre for Economic Policy Research.
 
Princeton DSS Libguides. Available at https://libguides.princeton.edu/c.php?g=1415215
 
Stock, J. H., & Watson, M. W. (2020). Introduction to econometrics. Pearson.

Data Consultant

Profile Photo
Muhammad Al Amin
He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051

Data Consultant

Profile Photo
Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519

Comments or Questions?

If you have questions or comments about this guide or method, please email data@Princeton.edu.