*==============================================================* * Ex3.do * *==============================================================* **************************************************************** * This do file includes a number of lines that say: * * *STOP HERE AND THINK!! * * When Stata reaches the first of these, it will not * * reconise it as a valid command and will come to a juddering * * halt. At this point, there will be a few questions for you * * to think about. When you have answered them, delete the * * offending line and re-run the do-file. It will crash at the * * next point where you have some questions to think about * **************************************************************** **************************************************************** * In this exercise: * * (1) Calculation of autocorrelation structures for * * HILDA variables * * (2) Comparison of dynamic impacts for autoregressive and * * distributed lag regression models * * (3) Simulation of a dynamic binary response model * * (4) GMM estimation of dynamic regression models * * (5) Dynamic random effects ordered probit/logit models * **************************************************************** version 13 clear all set more off set matsize 800 clear mata set maxvar 20000 capture log close local working "P:\working" log using "`working'\ex3.log", replace *************************************************************** * Load the panel data and tell Stata it is panel data * *************************************************************** use "`working'\longperson_unbal_3.dta" xtset id wave xtdes // Look at the data foreach var in hgage sex jbmsall jbempt mrcurr esbrd { tab `var' } summ wsce,de summ jbhruc ,de // Create required variables gen jobsat=jbmsall if jbmsall>=0 gen hours=jbhruc if jbhruc >0 // CONSTRUCT THE ACFs preserve // we'll be messing about with the data here, so // preserve the existing data set // so we can revert back to it later // Loop over variables of interest foreach var in female jobsat degree { matrix acf=J(13,1,1) // ACF to be stored in matrix capture drop x gen x=`var' if wave==13 // All correlations will be with the wave 13 obs forvalues j=1/12 { // Loop over lag length quietly { // use quietly to reduce volume of output capture drop l`j'x gen l`j'x=l`j'.`var' corr x l`j'x matrix acf[`j'+1,1]=r(rho) } } svmat acf // Convert matrix to variable (called acf1 by default) rename acf1 corr_`var' capture drop sep gen sep=_n-1 if _n<=13 // Plot the ACF and export to file twoway scatter corr_`var' sep if _n<=13, lpattern(dash) connect(direct) /// graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white)) /// ytitle("ACF") xtitle("Lag in `var'") yscale(range(0 1)) ylabel(0(0.1)1) xlabel(0(1)12) } ***************************************************************** * Question to think about: * * * * (1) Re-run for other important HILDA variables. Do you see * * what you expect? * * (2) Simulate an AR(1) or MA(1) process and plot its ACF * * what are the implications for model specification? * ***************************************************************** STOP HERE AND THINK!! restore // go back to the earlier dataset ******************************************************************************** * * * NOW LOOK AT THE TIME PATTERN OF IMPACTS OF COVARIATES IN DYNAMIC * * REGRESSION MODELS: DISTRIBUTED LAG AND AUTOREGRESSIVE SPECIFICATIONS * * * ******************************************************************************** // Estimate simple model with lagged covariates but no lag in dependent variable // use random-effects regression xtreg lwage_hr female age degree l.jbempt l.degree year,re estimates store distlag // Estimate autoregressive model with lagged dependent variable // use Anderson-Hsiao IV ivregress 2sls d.lwage_hr d.degree /// (ld.lwage_hr=l2.lwage_hr year) , first vce(cluster id) noconstant estimates store armodel ***************************************************************** * Question to think about: * * * * What is the long-run return to a degree qualification? * * Note: you can access specific coefficients by restoring the * * relevant estimates then referring to them as _b[varname] * * e.g. * * estimates restore distlag * * display "LR impact = " _b[degree]+_b[l.degree] * ***************************************************************** STOP HERE AND THINK! ***************************************************************** * * * Now let's compute the interim multipliers (impulse responses) * * and the cumulated impacts * * * ***************************************************************** gen lag=_n-1 if _n<=15 // this will record the order of lag // construct the impulse-responses estimates restore distlag gen distlag=0 replace distlag=_b[degree] if lag==0 replace distlag=_b[l.degree] if lag==1 estimates restore armodel gen autoreg=_b[d.degree]*(_b[ld.lwage_hr]^lag) // Now plot them twoway spike distlag lag if lag<=15, /// graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white)) /// ytitle("Response coefficient") xtitle("Lag in x") /// title("Distributed lag model",position(6)) saving(distlag, replace) legend(off) twoway spike autoreg lag if lag<=15, /// graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white)) /// ytitle("Response coefficient") xtitle("Lag in x") /// title("Autoregressive model",position(6)) saving(autoreg,replace) legend(off) graph combine distlag.gph autoreg.gph, rows(1) cols(2) /// graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white)) /// saving(multipliers,replace) // Now construct & plot the cumulative impacts sort lag gen cumdlag=sum(distlag) if lag<=15 gen cumauto=sum(autoreg) if lag<=15 twoway line cumdlag cumauto lag if lag<=15, /// graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white)) /// ytitle("Cumulative multiplier") lpattern(dash solid ) xtitle("Elapsed time") /// legend(rows(2) label(1 "Distributed lag model") label(2 "Autoregressive model")) /// saving(cumulatedmultipliers,replace) ***************************************************************** * Question to think about: * * * * How do these estimated responses change as you change the * * model (add additional covariates, try higher-order lags in * * the distributed lag model, etc...) * ***************************************************************** STOP HERE AND THINK! ***************************************************************************** // Saving computer time.... // This do file does a lot of things in one programme. You may want to run // some of the later things without re-running all the earlier stuff // so just save a temporary version of the data. In later runs you can // read the data back in without re-running the whole do file. For example: // save "`working'\temp.dta",replace // include this command the first time // then include these two lines and run the code from this point on: // local working "P:\working" // use "`working'\temp.dta",clear ***************************************************************************** ***************************************************************** * * * In the lectures, we went through a simple example, * * estimating and simulating a dynamic probit model of marriage, * * involving both unobserved persistent heterogeneity (i.e. a * * random effect) and state dependence (i.e. a lagged dependent * * variable) * * That example used BHPS data. Now let's re-do it using HILDA * * data. Feel free to vary the model specification or simulation * * details. * * * ***************************************************************** gen emp=rwscei>0 if rwscei!=. // employment indicator tab emp preserve // we'll be dropping some cases to keep things // simple - so preserve the existing data set // so we can revert back to it later replace age=hgage // revert to measuring age in years not decades replace agesq=age^2 keep if age>=16 // NB HILDA interviews 15-year-olds (who can't be married) // keep males only to avoid conflicting predictions for two marital partners keep if female==0 // Do some basic data description summ rwscei,de summ rwscei if emp,de xtsum married age further degree emp rwscei xttab degree xttab emp xttab married fillin id wave xttrans married drop if _fillin // Estimate simple age-earnings profiles for use later in simulations // do this now, before dropping observations that aren't used in estimation // of the marriage model regr rwscei age agesq further degree if age>=16&age<60&emp estimates store earnsreg *------------------------------------------------------------* * We are going to estimate a simple dynamic RE probit model * * to explain marital status as a function of marital status * * in the previous year, age, whether have a degree, whether * * employed and earnings. For simplicity, use a compact panel * * of individuals who were present in wave 1 and aged 16-30. * *------------------------------------------------------------* egen inw1 = total(wave==1), by(id) // =1 if present in wave 1 keep if inw1 drop inw1 // don't need any more egen agew1 = total(age * (wave==1)), by(id) drop if agew1<16 | agew1>30 drop if married==.|degree==.|further==.|rwscei==. sort id wave by id: gen gap = wave>(wave[_n-1]+1) & _n>1 // detect gap by id: gen gaps = sum(gap) // running count of gaps keep if gaps==0 // keep all observations before first gap after wave 1 sort id wave by id: assert wave==(wave[_n-1]+1) if _n>1 // check no gaps! xtdes,patterns(13) ***************************************************************** * Question to think about: * * * * How has the sample selection and removal of observations * * following gaps altered the characteristics of the sample * * Repeat the data summaries we used earlier and check * ***************************************************************** STOP HERE AND THINK! *------------------------------------------------------------* * We need to control for initial conditions, i.e. whether * * someone was married or not in wave 1. Implement the * * Wooldridge-style method. This involves approximating the * * individual effect by a linear function of marital status * * at the start of the panel and the individual means of the * * exogenous time-varying regressors. * *------------------------------------------------------------* foreach x in age further degree emp rwscei { capture drop tmp capture drop m`x' sort id wave by id: gen tmp=`x' if _n>1 // construct the mean using all but the 1st obs // see Wooldridge egen m`x' = mean(tmp), by(id) } by id: gen married0 = married[1] // marital status at start gen lmarried = l.married // lagged marital status xtprobit married lmarried age further degree emp rwscei /// mage mfurther mdegree memp mrwscei married0, re scalar sigma_u=e(sigma_u) matrix b=e(b) di "Stored coefficients " matrix list b estimates store marriage ************************************************************** * Question to think about: * * * * Is there (genuine) state dependence in marriage (as * * expected)? Is there also evidence that unobserved * * characteristics affect the propensity to get married? Are * * individual effects related to the initial condition and * * explanatory variables? * *------------------------------------------------------------* STOP HERE AND THINK! // Calculate probit intercept under assumption E(u)=0 local pinter=_b[_cons] foreach v in mage mfurther mdegree memp mrwscei married0 { quietly summ `v', meanonly local pinter=`pinter'+_b[`v']*r(mean) } di " " di "Probit intercept = " `pinter' *------------------------------------------------------------* * Can use -margins- to calculate probability. But need to * * be very careful to specify which X are held constant. * * We'll first re-run the estimation using factor variables * * for the discrete covariates * * Note that we want to hold ui at its mean value of zero. * *------------------------------------------------------------* xtprobit married i.lmarried age i.further i.degree i.emp rwscei /// mage mfurther mdegree memp mrwscei i.married0, re margins, predict(pu0) at((mean) _all lmarried=1 age=40 /// further=0 degree=0 emp=1 rwscei=1000) *------------------------------------------------------------* * * * Calculate the probability of marriage for another * * hypothetical person with identical characteristics except * * that he was not married last year. Also calculate the * * difference, which shows the amount of state dependence. * * How does this compare to the difference in raw transition * * probabilities? What do you conclude? * * * *------------------------------------------------------------* margins, predict(pu0) at((mean) _all lmarried=0 age=40 /// further=0 degree=0 emp=1 rwscei=1000) margins, predict(pu0) dydx(lmarried) at((mean) _all /// lmarried=0 age=40 further=0 degree=0 emp=1 rwscei=1000) *------------------------------------------------------------* * * * Question to think about * * * * How much difference would an extra $500 per month make to * * the unmarried reference person? * * How much difference would it make if they gained a degree? * * * *------------------------------------------------------------* STOP HERE AND THINK! *------------------------------------------------------------* * Stochastic simulation * *------------------------------------------------------------* drop age agesq gen age=_n+15 if (_n+15)<60 drop if age==. gen agesq=age^2 *------------------------------------------------------------* * Baseline individal: leaves school at 16 * * & is continuously employed at income predicted * * by age-earnings profile * *------------------------------------------------------------* replace further=0 replace degree=0 replace emp=1 estimates restore earnsreg // Bring back the earnings regression predict base_rwscei gen base_degree=0 gen base_further=0 gen base_emp=1 gen base_age=age gen base_u=0 // fix u_i at zero *------------------------------------------------------------* * Alternate individal: gains a degree at age 21 * * & is continuously employed after 21 at income * * predicted by age-earnings profile * *------------------------------------------------------------* replace degree=0 if age<=21 replace degree=1 if age>21 replace further=0 predict alt_rwscei gen alt_degree=degree gen alt_further=0 gen alt_emp=1 gen alt_age=age gen alt_u=0 *------------------------------------------------------------* * Loop over Monte Carlo replications * *------------------------------------------------------------* estimates restore marriage expand 500 // make room for simulated outcomes bysort age: gen repno=_n // identifier for each replication set seed 1492 // initialise the random number generator gen probitresid=rnormal() // random term in probit gen base_married=0 // inital unmarried state gen alt_married=0 gen base_xb=0 gen alt_xb=0 xtset repno age // important - order the data by age within replications forvalues t=17/59 { // loop over ages from 17 onwards quietly { // first generate the sequences of marriage states for the base individual replace base_xb=_b[lmarried]*base_married[_n-1] if age==`t' foreach v in age further degree emp rwscei { replace base_xb=base_xb+_b[`v']*base_`v' if age==`t' } summ base_xb if age==`t' replace base_xb=base_xb+`pinter'+probitresid if age==`t' summ base_xb if age==`t' replace base_married=(base_xb)>0 if age==`t' // now generate the sequences of marriage states for the alternate individual replace alt_xb=_b[lmarried]*alt_married[_n-1] if age==`t' foreach v in age further degree emp rwscei { replace alt_xb=alt_xb+_b[`v']*alt_`v' if age==`t' } replace alt_xb=alt_xb+alt_married+`pinter'+probitresid if age==`t' replace alt_married=(alt_xb)>0 if age==`t' } } list repno age base_married base_xb alt_xb base_degree alt_degree probitresid if repno<5 *------------------------------------------------------------* * Now analyse the simulated paths however you like * *------------------------------------------------------------* // Example: plot the first 4 simulated marital histories forvalues plotnumber=1/4 { twoway line base_married alt_married age if age<60&repno==`plotnumber', /// graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) /// ifcolor(white)) ytitle("Marital status") lpattern(dash solid ) /// xtitle("Age") legend(rows(1) label(1 "Baseline") /// label(2 "With degree")) name(rep`plotnumber',replace) } graph combine rep1 rep2 rep3 rep4, rows(2) cols(2) /// graphregion(fcolor(white) ilcolor(white) icolor(white) lcolor(white) ifcolor(white)) // Summarise number of marriages & divorces and proportion of time married (across replicated histories) gen base_wedding=base_married>l.base_married gen alt_wedding=alt_married>l.alt_married gen base_divorce=base_married==0&l.base_married==1 // the model assumes no lightning remarriages! gen alt_divorce=alt_married==0&l.alt_married==1 bys repno: egen base_nmar=sum(base_wedding) // number of marriages bys repno: egen base_ndiv=sum(base_divorce) // number of divorces bys repno: egen alt_nmar=sum(alt_wedding) bys repno: egen alt_ndiv=sum(alt_divorce) gen base_nomar=base_nmar==0 // identify those who never marry gen alt_nomar=alt_nmar==0 summ base_nomar alt_nomar if age==16 // summarise (only need 1 obs per person) summ base_nmar alt_nmar if age==16 summ base_ndiv alt_ndiv if age==16 bys repno: egen base_timemarried=sum(base_married) bys repno: egen alt_timemarried=sum(alt_married) summ base_timemarried alt_timemarried if age==16 *------------------------------------------------------------* * * * Question to think about * * * * Repeat the simulation varying other characteristics. What * * you think of this way of summarising the properties of an * * estimated model compared to (say) presenting a table of * * odds ratios for each covariate? * * * *------------------------------------------------------------* STOP HERE AND THINK! restore // go back to the original HILDA dataset ******************************************************************************** * * * NOW WE'LL LOOK AT A DYNAMIC LINEAR REGRESSION MODEL FOR THE (LOG) * * HOURLY WAGE, USING A LAGGED DEPENDENT VARIABLE TO CAPTURE LONG-RUN * * DYNAMIC ADJUSTMENT * * * ******************************************************************************** sort id wave keep if sex==1 // we're using males here - but feel free to use the female sample (or both) // - but consider the different nature of many women's career histories // and the possible structural differences in model specification for males // and females *------------------------------------------------------------* * Before beginning estimation, to get a feel for lags and * * differences, and how they change the number of useable * * periods in estimation, we will look at some lags and * * differences. Lags and first-differences can be created * * directly within the syntax of a command in Stata, using * * the "l." and the "d." operators, without the need * * to create them in advance using the "generate" command * * Considering the log wage variable, we look at the * * first difference, the lagged first difference and the * * second lag in levels. We choose these as examples because * * in the simple Anderson-Hsiao method, the 1st diff is the * * dependent var, the lagged dependent variable is * * an explanatory variable and the second lag in levels is * * used as an instrument. * * * * Examine the patterns in the data and observe which waves * * have to be dropped in estimation - for simplicty focus on * * individuals without any gaps in their waves. * * We use both ld. and dl. to show they give identical * * results! * *------------------------------------------------------------* list id wave lwage_hr d.lwage_hr ld.lwage_hr dl.lwage_hr l2.lwage_hr /// in 1/20, sepby(id) *------------------------------------------------------------* * We will first estimate dynamic models with one lag using * * the pooled OLS estimator, the fixed effect estimator and * * the random effect estimator * * Here, we use a quadratic in age/10 * *------------------------------------------------------------* * Dynamic pooled OLS reg lwage_hr l.lwage_hr age agesq jbempt tucov further degree permanent /// nsw , vce(cluster id) // NB we used the cluster option for computing std errors, since there's // correlation within individuals which isn't being allowed for by // including individual effects * Dynamic RE xtreg lwage_hr l.lwage_hr age agesq jbempt tucov further degree permanent /// nsw , re * Dynamic FE xtreg lwage_hr l.lwage_hr age agesq jbempt tucov further degree permanent /// nsw , fe ***************************************************************** * Question to think about: * * * * Given the results for pooled OLS, RE regression and FE * * regression, what range of values would you expect to see for * * coefficient of the lagged dependent variable when the model * * "correctly" estimated? * ***************************************************************** STOP HERE AND THINK! *------------------------------------------------------------* * Now we estimate the dynamic wage equation using the simple * * Anderson-Hsiao [1982] estimator. This involves first time- * * differencing the data, and then using the observations on * * wages lagged two periods (as one is used up in the * * differencing process) [or more] as instrument for the first* * difference. This estimator is consistent, but inefficient, * * as it does not make use of all the available information. * *------------------------------------------------------------* ivregress 2sls d.lwage_hr d.age d.agesq d.jbempt d.tucov d.further d.degree /// d.permanent d.nsw d.wave (ld.lwage_hr = l2.lwage_hr), /// first vce(cluster id) noconstant ***************************************************************** * Question to think about: * * * * Are the Anderson-Hsiao estimates consistent with your * * expectations? * ***************************************************************** STOP HERE AND THINK! * Now we estimate the model using the Arellano-Bond [1991] * * estimator. * * First estimate the dynamic wage equation with only one lag * * using the A-B estimator, and compare the estimate of the * * coefficient on the wage in the previous period to those * * from the OLS, RE and FE estimators used above: is the sign * * of the bias as predicted by the theory? * * Note that we suppress the intercept. An intercept is * * normally included automatically and would then be dropped * * since it's collinear with the difference of age, but * * Stata will not compute diagnostic tests if there are * * dropped covariates. For the same reason, we do not use the * * preferable vce(robust) option to give robust standard * * errors * *------------------------------------------------------------* * A-B estimator with one lag xtabond lwage_hr age agesq jbempt tucov further degree permanent /// nsw , lags(1) twostep noconst *------------------------------------------------------------* * Now check the diagnostics: * * - what does the Sargan test of overidentifying restrictions* * suggest? * * - is there any evidence of first- or second-order * * autocorrelation? Is this problematic? * *------------------------------------------------------------* estat sargan estat abond *------------------------------------------------------------* * There seems to second order serial correlation in the * * differenced errors, suggesting we have omitted some * * dynamics. Try adding a second lag of the dep variable. * *------------------------------------------------------------* STOP HERE AND THINK! *------------------------------------------------------------* * By explicitly modelling the dynamics of earnings with * * the addition of a second lag we can no longer reject the * * null hypothesis of no second-order serial correlation in * * the differenced residuals. * * But the Sargan test still indicates there is a problem * * with the instruments. We were concerned that job tenure * * might be endogneous (correlated with transitory error * * epsilon), so we endogenise it (using 2nd lag in levels * * as instruments). * * Notice that variables are only listed once in the command * * syntax. * *------------------------------------------------------------* * A-B estimator with two lags, endogenise job tenure xtabond lwage_hr age agesq tucov further degree permanent /// nsw , lags(2) twostep endog(jbempt) noconst estat sargan estat abond *------------------------------------------------------------* * The Sargan test still indicates a problem. Allow for more * * of the explanatory variables to be endogenous. Assume perm * * job status is endogenous, and that trade union coverage * * is predetermined. This is a weaker condition meaning union * * status isn't correlated with the current transitory error * * but is correlated with previous errors. In other words, * * random shocks to wages might affect future union or public * * sector status. * * What do you find? * *------------------------------------------------------------* STOP HERE AND THINK! *------------------------------------------------------------* * We now estimate the model using the Blundell-Bond [1998] * * estimator. This is done using the xtdpdsys command * * introduced in version 10 (previously, a user-written * * command, xtabond2, was widely used for B-B). B-B is more * * efficient than A-B but only under more stringent * * assumptions. Note we don't have to drop the intercept here * *------------------------------------------------------------* xtdpdsys lwage_hr age agesq further degree /// nsw , lags(2) twostep endog(jbempt permanent) pre(tucov) estat sargan estat abond ***************************************************************** * Question to think about: * * * * Any other bright ideas for improving the model? Try some of * * them * ***************************************************************** STOP HERE AND THINK! set more on log close exit