*==============================================================*
*  				          Ex3.do                               *
*==============================================================*
****************************************************************
* This do file includes a number of lines that say:            *
*              *STOP HERE AND THINK!!                           *
* When Stata reaches the first of these, it will not           * 
* reconise it as a valid command and will come to a juddering  *
* halt. At this point, there will be a few questions for you   *
* to think about. When you have answered them, delete the      *
* offending line and re-run the do-file. It will crash at the  *
* next point where you have some questions to think about      *
****************************************************************
****************************************************************
* In this exercise:                                            *
* (1) Calculation of autocorrelation structures for            *
*     HILDA variables                                          *
* (2) Comparison of dynamic impacts for autoregressive and     *
*     distributed lag regression models                        *
* (3) Simulation of a dynamic binary response model            *
* (4) GMM estimation of dynamic regression models              *
* (5) Dynamic random effects ordered probit/logit models       *
****************************************************************
version 13
clear all
set more off
set matsize 800
clear mata
set maxvar 20000
capture log close
local working "P:\working"
log using "`working'\ex3.log", replace
***************************************************************
* Load the panel data and tell Stata it is panel data         *
***************************************************************
use "`working'\longperson_unbal_3.dta" 
xtset id wave
xtdes
//	Look at the data
foreach var in hgage sex jbmsall jbempt mrcurr esbrd  {
	tab `var'
}
summ wsce,de
summ jbhruc ,de
//	Create required variables
gen jobsat=jbmsall if jbmsall>=0
gen hours=jbhruc  if jbhruc >0
//	CONSTRUCT THE ACFs
preserve		// we'll be messing about with the data here, so
				// preserve the existing data set
				// so we can revert back to it later
//	Loop over variables of interest
foreach var in female jobsat degree {
	matrix acf=J(13,1,1)		// ACF to be stored in matrix
	capture drop x
	gen x=`var' if wave==13	// All correlations will be with the wave 13 obs
	forvalues j=1/12 {		// Loop over lag length
	  quietly { 			// use quietly to reduce volume of output
		capture drop l`j'x
		gen l`j'x=l`j'.`var'
		corr x l`j'x		
		matrix acf[`j'+1,1]=r(rho)
	  }
	}
	svmat acf		// Convert matrix to variable (called acf1 by default) 
	rename acf1 corr_`var'
	capture drop sep
	gen sep=_n-1 if _n<=13
//	Plot the ACF and export to file
	twoway scatter corr_`var' sep if _n<=13, lpattern(dash) connect(direct) ///
	  graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white))   ///
	  ytitle("ACF") xtitle("Lag in `var'")  yscale(range(0 1))  ylabel(0(0.1)1) xlabel(0(1)12)
}
*****************************************************************
* Question to think about:                                      *
*                                                               *
* (1) Re-run for other important HILDA variables. Do you see    *
*     what you expect?                                          *
* (2) Simulate an AR(1) or MA(1) process and plot its ACF       *
*     what are the implications for model specification?        *
*****************************************************************
STOP HERE AND THINK!!
restore		//	go back to the earlier dataset
********************************************************************************
*                                                                              *
*  NOW LOOK AT THE TIME PATTERN OF IMPACTS OF COVARIATES IN DYNAMIC            *
*  REGRESSION MODELS: DISTRIBUTED LAG AND AUTOREGRESSIVE SPECIFICATIONS        * 
*                                                                              *
********************************************************************************
//	Estimate simple model with lagged covariates but no lag in dependent variable
//	use random-effects regression
xtreg  lwage_hr female age  degree l.jbempt l.degree year,re  
estimates store distlag
//	Estimate autoregressive model with lagged dependent variable
//	use Anderson-Hsiao IV
ivregress 2sls d.lwage_hr d.degree   ///
	(ld.lwage_hr=l2.lwage_hr year) , first vce(cluster id) noconstant
estimates store armodel
*****************************************************************
* Question to think about:                                      *
*                                                               *
* What is the long-run return to a degree qualification?        *
* Note: you can access specific coefficients by restoring the   *
* relevant estimates then referring to them as _b[varname]      *
* e.g.                                                          *
*        estimates restore distlag                              *
*        display "LR impact = " _b[degree]+_b[l.degree]         *
*****************************************************************
STOP HERE AND THINK!
*****************************************************************
*                                                               *
* Now let's compute the interim multipliers (impulse responses) *
* and the cumulated impacts                                     *
*                                                               *
*****************************************************************
gen lag=_n-1 if _n<=15	// this will record the order of lag
//	construct the impulse-responses
estimates restore distlag
gen distlag=0
replace distlag=_b[degree] if lag==0		
replace distlag=_b[l.degree] if lag==1
estimates restore armodel
gen autoreg=_b[d.degree]*(_b[ld.lwage_hr]^lag)
//	Now plot them
twoway spike distlag lag if lag<=15, ///
	graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white))   ///
	ytitle("Response coefficient") xtitle("Lag in x")  ///
	title("Distributed lag model",position(6)) saving(distlag, replace) legend(off)
twoway spike autoreg lag if lag<=15, ///
	graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white))   ///
	ytitle("Response coefficient") xtitle("Lag in x") ///
	title("Autoregressive model",position(6)) saving(autoreg,replace) legend(off)
graph combine distlag.gph autoreg.gph, rows(1) cols(2) ///
	graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white)) ///
	saving(multipliers,replace)
//	Now construct & plot the cumulative impacts	
sort lag
gen cumdlag=sum(distlag) if lag<=15
gen cumauto=sum(autoreg) if lag<=15
twoway line cumdlag cumauto lag if lag<=15, ///
  	graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white))   ///
  	ytitle("Cumulative multiplier") lpattern(dash solid ) xtitle("Elapsed time")  ///
  	legend(rows(2) label(1 "Distributed lag model") label(2 "Autoregressive model")) ///
	saving(cumulatedmultipliers,replace)
*****************************************************************
* Question to think about:                                      *
*                                                               *
* How do these estimated responses change as you change the     *
* model (add additional covariates, try higher-order lags in    *
* the distributed lag model, etc...)                            *
*****************************************************************
STOP HERE AND THINK!	
*****************************************************************************
//	Saving computer time....
//	This do file does a lot of things in one programme. You may want to run
//	some of the later things without re-running all the earlier stuff
//	so just save a temporary  version of the data. In later runs you can
//	read the data back in without re-running the whole do file. For example:
//	save "`working'\temp.dta",replace    // include this command the first time
//	then include these two lines and run the code from this point on:
//		local working "P:\working"
//		use "`working'\temp.dta",clear
*****************************************************************************	
*****************************************************************
*                                                               *
* In the lectures, we went through a simple example,            *
* estimating and simulating a dynamic probit model of marriage, *
* involving both unobserved persistent heterogeneity (i.e. a    *
* random effect) and state dependence (i.e. a lagged dependent  *
* variable)                                                     *
* That example used BHPS data. Now let's re-do it using HILDA   *
* data. Feel free to vary the model specification or simulation *
* details.                                                      *
*                                                               *
*****************************************************************	
gen emp=rwscei>0 if rwscei!=.		// employment indicator
tab emp
preserve		// we'll be dropping some cases to keep things
				// simple - so preserve the existing data set
				// so we can revert back to it later
replace age=hgage 	// revert to measuring age in years not decades
replace agesq=age^2
keep if age>=16	// NB HILDA interviews 15-year-olds (who can't be married)
// keep males only to avoid conflicting predictions for two marital partners
keep if female==0	
//	Do some basic data description
summ rwscei,de
summ rwscei if emp,de
xtsum married age further degree emp rwscei
xttab degree
xttab emp
xttab married
fillin id wave
xttrans married
drop if _fillin
//	Estimate simple age-earnings profiles for use later in simulations
//	do this now, before dropping observations that aren't used in estimation
//  of the marriage model
regr rwscei age agesq further degree if age>=16&age<60&emp
estimates store earnsreg		
*------------------------------------------------------------*
* We are going to estimate a simple dynamic RE probit model  *
* to explain marital status as a function of marital status  *
* in the previous year, age, whether have a degree, whether  *
* employed and earnings. For simplicity, use a compact panel *
* of individuals who were present in wave 1 and aged 16-30.  * 
*------------------------------------------------------------*
egen inw1 = total(wave==1), by(id) // =1 if present in wave 1
keep if inw1
drop inw1 // don't need any more
egen agew1 = total(age * (wave==1)), by(id)
drop if agew1<16 | agew1>30
drop if married==.|degree==.|further==.|rwscei==.
sort id wave
by id: gen gap = wave>(wave[_n-1]+1) & _n>1 // detect gap
by id: gen gaps = sum(gap) // running count of gaps
keep if gaps==0 // keep all observations before first gap after wave 1
sort id wave
by id: assert wave==(wave[_n-1]+1) if _n>1 // check no gaps!
xtdes,patterns(13)
*****************************************************************
* Question to think about:                                      *
*                                                               *
* How has the sample selection and removal of observations      *
* following gaps altered the characteristics of the sample      *
* Repeat the data summaries we used earlier and check           *
*****************************************************************
STOP HERE AND THINK!
*------------------------------------------------------------*
* We need to control for initial conditions, i.e. whether    *
* someone was married or not in wave 1. Implement the        *
* Wooldridge-style method. This involves approximating the   *
* individual effect by a linear function of marital status   *
* at the start of the panel and the individual means of the  *
* exogenous time-varying regressors.                         *
*------------------------------------------------------------*
foreach x in age further degree emp rwscei {
	capture drop tmp
	capture drop m`x'
	sort id wave
	by id: gen tmp=`x' if _n>1  	// construct the mean using all but the 1st obs
									// see Wooldridge
	egen m`x' = mean(tmp), by(id)
}
by id: gen married0 = married[1] // marital status at start
gen lmarried = l.married // lagged marital status
xtprobit married lmarried age further degree emp rwscei ///
	mage mfurther mdegree memp mrwscei married0, re
scalar sigma_u=e(sigma_u)
matrix b=e(b)
di "Stored coefficients "
matrix list b
estimates store marriage
**************************************************************
* Question to think about:                                   *
*                                                            *
* Is there (genuine) state dependence in marriage (as        *
* expected)? Is there also evidence that unobserved          *
* characteristics affect the propensity to get married? Are  *
* individual effects related to the initial condition and    *
* explanatory variables?                                     *
*------------------------------------------------------------*
STOP HERE AND THINK!
//	Calculate probit intercept under assumption E(u)=0 
local pinter=_b[_cons]
foreach v in mage mfurther mdegree memp mrwscei married0 {
	quietly summ `v', meanonly
	local pinter=`pinter'+_b[`v']*r(mean)
}
di " "
di "Probit intercept = " `pinter'
*------------------------------------------------------------*
* Can use -margins- to calculate probability. But need to    *
* be very careful to specify which X are held constant.      *
* We'll first re-run the estimation using factor variables   *
* for the discrete covariates                                *
* Note that we want to hold ui at its mean value of zero.    *
*------------------------------------------------------------*
xtprobit married i.lmarried age i.further i.degree i.emp rwscei ///
	mage mfurther mdegree memp mrwscei i.married0, re
margins, predict(pu0) at((mean) _all lmarried=1 age=40 ///
further=0 degree=0 emp=1 rwscei=1000)
*------------------------------------------------------------*
*                                                            *
* Calculate the probability of marriage for another          *
* hypothetical person with identical characteristics except  *
* that he was not married last year. Also calculate the      *
* difference, which shows the amount of state dependence.    *
* How does this compare to the difference in raw transition  *
* probabilities? What do you conclude?                       *
*                                                            *
*------------------------------------------------------------*
margins, predict(pu0) at((mean) _all lmarried=0 age=40 ///
further=0 degree=0 emp=1 rwscei=1000)
margins, predict(pu0) dydx(lmarried) at((mean) _all ///
lmarried=0 age=40 further=0 degree=0 emp=1 rwscei=1000)
*------------------------------------------------------------*
*                                                            *
* Question to think about                                    *
*                                                            *
* How much difference would an extra $500 per month make to  *
* the unmarried reference person?                            *
* How much difference would it make if they gained a degree? *
*                                                            *
*------------------------------------------------------------*
STOP HERE AND THINK!
*------------------------------------------------------------*
*          Stochastic simulation                             *
*------------------------------------------------------------*
drop age agesq
gen age=_n+15 if (_n+15)<60
drop if age==.
gen agesq=age^2
*------------------------------------------------------------*
*          Baseline individal: leaves school at 16           *
*          & is continuously employed at income predicted    *
*          by age-earnings profile                           *
*------------------------------------------------------------*
replace further=0
replace degree=0
replace emp=1
estimates restore earnsreg		//  Bring back the earnings regression
predict base_rwscei
gen base_degree=0
gen base_further=0 
gen base_emp=1
gen base_age=age
gen base_u=0     // fix u_i at zero
*------------------------------------------------------------*
*          Alternate individal: gains a degree at age 21     *
*          & is continuously employed after 21 at income     *
*          predicted by age-earnings profile                 *
*------------------------------------------------------------*
replace degree=0 if age<=21
replace degree=1 if age>21
replace further=0
predict alt_rwscei 
gen alt_degree=degree
gen alt_further=0 
gen alt_emp=1
gen alt_age=age
gen alt_u=0
*------------------------------------------------------------*
*          Loop over Monte Carlo replications                *
*------------------------------------------------------------*
estimates restore marriage
expand 500   	// make room for simulated outcomes
bysort age: gen repno=_n	// identifier for each replication
set seed 1492		// initialise the random number generator
gen probitresid=rnormal()  	// random term in probit
gen base_married=0 		// inital unmarried state
gen alt_married=0 
gen base_xb=0
gen alt_xb=0
xtset repno age 		// important - order the data by age within replications
forvalues t=17/59 {		// loop over ages from 17 onwards
  quietly {
// first generate the sequences of marriage states for the base individual 
	replace base_xb=_b[lmarried]*base_married[_n-1] if age==`t'
	foreach v in age further degree emp rwscei {
		 replace base_xb=base_xb+_b[`v']*base_`v' if age==`t'
	}
	summ base_xb if age==`t'
	replace base_xb=base_xb+`pinter'+probitresid if age==`t'
	summ base_xb if age==`t'
	replace base_married=(base_xb)>0 if age==`t'
// now generate the sequences of marriage states for the alternate individual 
	replace alt_xb=_b[lmarried]*alt_married[_n-1] if age==`t'
	foreach v in age further degree emp rwscei {
		 replace alt_xb=alt_xb+_b[`v']*alt_`v' if age==`t'
	}
	replace alt_xb=alt_xb+alt_married+`pinter'+probitresid if age==`t'
	replace alt_married=(alt_xb)>0 if age==`t'
  }
}
list repno age base_married base_xb alt_xb base_degree alt_degree probitresid if repno<5
*------------------------------------------------------------*
*   Now analyse the simulated paths however you like         *
*------------------------------------------------------------*
//	Example: plot the first 4 simulated marital histories
forvalues plotnumber=1/4 {
  twoway line base_married  alt_married age if age<60&repno==`plotnumber', ///
		graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) /// 
		ifcolor(white)) ytitle("Marital status") lpattern(dash solid ) ///
		xtitle("Age") legend(rows(1) label(1 "Baseline") /// 
		label(2 "With degree")) name(rep`plotnumber',replace)
}
graph combine rep1 rep2 rep3 rep4, rows(2) cols(2) ///
		graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white))
//	Summarise number of marriages & divorces and proportion of time married (across replicated histories)
gen base_wedding=base_married>l.base_married
gen alt_wedding=alt_married>l.alt_married
gen base_divorce=base_married==0&l.base_married==1   // the model assumes no lightning remarriages!
gen alt_divorce=alt_married==0&l.alt_married==1
bys repno: egen base_nmar=sum(base_wedding)		// number of marriages
bys repno: egen base_ndiv=sum(base_divorce)		// number of divorces
bys repno: egen alt_nmar=sum(alt_wedding)
bys repno: egen alt_ndiv=sum(alt_divorce)
gen base_nomar=base_nmar==0		// identify those who never marry
gen alt_nomar=alt_nmar==0
summ base_nomar alt_nomar if age==16  // summarise (only need 1 obs per person)
summ base_nmar alt_nmar if age==16
summ base_ndiv alt_ndiv if age==16
bys repno: egen base_timemarried=sum(base_married)
bys repno: egen alt_timemarried=sum(alt_married)
summ base_timemarried alt_timemarried if age==16
*------------------------------------------------------------*
*                                                            *
* Question to think about                                    *
*                                                            *
* Repeat the simulation varying other characteristics. What  *
* you think of this way of summarising the properties of an  *
* estimated model compared to (say) presenting a table of    *
* odds ratios for each covariate?                            *
*                                                            *
*------------------------------------------------------------*
STOP HERE AND THINK!
restore		//	go back to the original HILDA dataset
********************************************************************************
*                                                                              *
*   NOW WE'LL LOOK AT A DYNAMIC LINEAR REGRESSION MODEL FOR THE (LOG)          *
*   HOURLY WAGE, USING A LAGGED DEPENDENT VARIABLE TO CAPTURE LONG-RUN         *
*   DYNAMIC ADJUSTMENT                                                         *
*                                                                              *
********************************************************************************
sort id wave
keep if sex==1		
// we're using males here - but feel free to use the female sample (or both)
// - but consider the different nature of many women's career histories 
// and the possible structural differences in model specification for males 
// and females
*------------------------------------------------------------*
* Before beginning estimation, to get a feel for lags and    *
* differences, and how they change the number of useable     *
* periods in estimation, we will look at some lags and       *
* differences. Lags and first-differences can be created     *
* directly within the syntax of a command in Stata, using    *
* the "l." and the "d." operators, without the need          * 
* to create them in advance using the "generate" command     *
* Considering the log wage variable, we look at the          *
* first difference, the lagged  first difference and the     *
* second lag in levels. We choose these as examples because  *
* in the simple Anderson-Hsiao method, the 1st diff is the   *
* dependent var, the lagged dependent variable is            *
* an explanatory variable and the second lag in levels is    *
* used as an instrument.                                     *
*                                                            *
* Examine the patterns in the data and observe which waves   *
* have to be dropped in estimation - for simplicty focus on  *
* individuals without any gaps in their waves.               *
* We use both ld. and dl. to show they give identical        *
* results!                                                   *
*------------------------------------------------------------*
list id wave lwage_hr d.lwage_hr ld.lwage_hr dl.lwage_hr l2.lwage_hr /// 
	in 1/20, sepby(id)
*------------------------------------------------------------*
* We will first estimate dynamic models with one lag using   *
* the pooled OLS estimator, the fixed effect estimator and   *
* the random effect estimator                                *
* Here, we use a quadratic in age/10                         *
*------------------------------------------------------------*
* Dynamic pooled OLS
reg lwage_hr l.lwage_hr age agesq jbempt tucov further degree permanent /// 
	nsw , vce(cluster id)		
// NB we used the cluster option for computing std errors, since there's 
// correlation within individuals which isn't being allowed for by 
// including individual effects
* Dynamic RE 
xtreg lwage_hr l.lwage_hr age agesq jbempt tucov further degree permanent /// 
	nsw , re
* Dynamic FE
xtreg lwage_hr l.lwage_hr age agesq jbempt tucov further degree permanent /// 
	nsw , fe
*****************************************************************
* Question to think about:                                      *
*                                                               *
* Given the results for pooled OLS, RE regression and FE        *
* regression, what range of values would you expect to see for  *
* coefficient of the lagged dependent variable when the model   *
* "correctly" estimated?                                        *
*****************************************************************
STOP HERE AND THINK!
*------------------------------------------------------------*
* Now we estimate the dynamic wage equation using the simple *
* Anderson-Hsiao [1982] estimator. This involves first time- *
* differencing the data, and then using the observations on  *
* wages lagged two periods (as one is used up in the         *
* differencing process) [or more] as instrument for the first*
* difference. This estimator is consistent, but inefficient, *
* as it does not make use of all the available information.  *
*------------------------------------------------------------*
ivregress 2sls d.lwage_hr d.age d.agesq d.jbempt d.tucov d.further d.degree ///
	d.permanent d.nsw d.wave (ld.lwage_hr = l2.lwage_hr), /// 
	first vce(cluster id) noconstant 
*****************************************************************
* Question to think about:                                      *
*                                                               *
* Are the Anderson-Hsiao estimates consistent with your         *
* expectations?                                                 *
*****************************************************************
STOP HERE AND THINK!
* Now we estimate the model using the Arellano-Bond [1991]   *
* estimator.                                                 *
* First estimate the dynamic wage equation with only one lag *
* using the A-B estimator, and compare the estimate of the   *
* coefficient on the wage in the previous period to those    *
* from the OLS, RE and FE estimators used above: is the sign *
* of the bias as predicted by the theory?                    *
* Note that we suppress the intercept. An intercept is       *
* normally included automatically and would then be dropped  *
* since it's collinear with the difference of age, but       *
* Stata will not compute diagnostic tests if there are       *
* dropped covariates. For the same reason, we do not use the *
* preferable vce(robust) option to give robust standard      *
* errors                                                     *
*------------------------------------------------------------*
* A-B estimator with one lag
xtabond lwage_hr age agesq jbempt tucov further degree permanent /// 
	nsw , lags(1) twostep noconst
*------------------------------------------------------------*
* Now check the diagnostics:                                 *
* - what does the Sargan test of overidentifying restrictions*
*   suggest?                                                 *
* - is there any evidence of first- or second-order          *
*   autocorrelation? Is this problematic?                    *
*------------------------------------------------------------*
estat sargan
estat abond
*------------------------------------------------------------*
* There seems to second order serial correlation in the      *
* differenced errors, suggesting we have omitted some        *
* dynamics. Try adding a second lag of the dep variable.     *
*------------------------------------------------------------*
STOP HERE AND THINK! 
*------------------------------------------------------------*
* By explicitly modelling the dynamics of earnings with      *
* the addition of a second lag we can no longer reject the   *
* null hypothesis of no second-order serial correlation in   *
* the differenced residuals.                                 *
* But the Sargan test still indicates there is a problem     *
* with the instruments. We were concerned that job tenure    *
* might be endogneous (correlated with transitory error      *
* epsilon), so we endogenise it (using 2nd lag  in levels    *
* as instruments).                                           *
* Notice that variables are only listed once in the command  *
* syntax.                                                    *
*------------------------------------------------------------*
* A-B estimator with two lags, endogenise job tenure
xtabond lwage_hr age agesq tucov further degree permanent /// 
	nsw , lags(2) twostep endog(jbempt) noconst
estat sargan
estat abond
*------------------------------------------------------------*
* The Sargan test still indicates a problem. Allow for more  *
* of the explanatory variables to be endogenous. Assume perm *
* job status is endogenous, and that trade union coverage    *
* is predetermined. This is a weaker condition meaning union *
* status isn't correlated with the current transitory error  *
* but is correlated with previous errors. In other words,    *
* random shocks to wages might affect future union or public *
* sector status.                                             *
* What do you find?                                          *
*------------------------------------------------------------*
STOP HERE AND THINK! 
*------------------------------------------------------------*
* We now estimate the model using the Blundell-Bond [1998]   * 
* estimator. This is done using the xtdpdsys command         *
* introduced in version 10 (previously, a user-written       *
* command, xtabond2, was widely used for B-B). B-B is more   *
* efficient than A-B but only under more stringent           *
* assumptions. Note we don't have to drop the intercept here *
*------------------------------------------------------------*
xtdpdsys lwage_hr age agesq further degree  /// 
	nsw , lags(2) twostep endog(jbempt permanent) pre(tucov) 
estat sargan
estat abond
*****************************************************************
* Question to think about:                                      *
*                                                               *
* Any other bright ideas for improving the model? Try some of   *
* them                                                          *
*****************************************************************
STOP HERE AND THINK!
set more on
log close
exit