*==============================================================*
*  				Ex2.do					   						*
*==============================================================*
****************************************************************
* This do file includes a number of lines that say:            *
*              *STOP HERE AND THINK!!                           *
* When Stata reaches the first of these, it will not           * 
* reconise it as a valid command and will come to a juddering  *
* halt. At this point, there will be a few questions for you   *
* to think about. When you have answered them, delete the      *
* offending line and re-run the do-file. It will crash at the  *
* next point where you have some questions to think about      *
****************************************************************
****************************************************************
* In this exercise:                                            *
* Hausman test of the random-effects model                     *
* Examples of the linear probability model 			           *
* Random effects probit, logit and conditional (fixed effects) *
* logit models for a binary dependent variable. 		       *
* Calculation of predicted probabilities				       *
* Instrumental variable estimation (using xtivreg, xtivreg2    *
* and xthtaylor) for models involving endogenous explanatory   *
* variable                                                     *
****************************************************************
version 13
clear all
set more off
set matsize 800
clear mata
set maxvar 20000
capture log close
local working "P:\working"
log using "`working'\ex2.log", replace
***************************************************************
* Load the panel data and tell Stata it is panel data         *
***************************************************************
use "`working'\longperson_unbal_2.dta" 
xtset id wave
***************************************************************
* The RE estimates are only consistent if the regressors are  *
* uncorrelated with the individual effects. Test this 	      *
* assumption using a Hausman test.			                  *
***************************************************************
xtreg lwage_hr age agesq cohort married female jbemp degree further ///
	 tucov permanent nsw if keeper==1, fe
estimates store fixed //store results under name fixed
xtreg lwage_hr age agesq cohort married female jbemp degree further ///
	 tucov permanent nsw if keeper==1, re
estimates store random //store results under name random
hausman fixed random
***************************************************************
* Now test the zero correlation using the mundlak formulation.*
* Recall that this involves augmenting the RE model with the  *
* individual means of the time varying characteristics.       *
***************************************************************
* Create individula means for time-varying covariates
foreach x in age agesq married jbemp ///
	degree further tucov permanent nsw {
	capture drop m`x'
	egen m`x'=mean(`x'), by(id) //individual means
} 
*****************************************************************
* Questions to think about:                                     *
*                                                               *
* (1) Re-run the random effects regression adding the covariate *
*     means as additional RHS variables. How do the coefficients*
*     compare with the ones you produced earlier?               *
* (2) Test the RE assumption using the Mundlak approach, using  *
*     the test command to test the significance of the added    *
*     variables                                                 *
* (3) Based on the results of the two tests, what do you        *
*     conclude about the appropriateness of the RE model?       *
*     what do the differences between FE and RE estimates say   *
*     about possible omitted variable biases?                   *
*****************************************************************
*STOP HERE AND THINK!!
**************************************************************
* Continuing with the wage example, suppose we specifically  *
* want to examine the determinants of low pay. Create a dummy*
* variable to indicate an hourly wage of less than $14/hour. * 
* Examine the proportion of low paid workers, how it has     *
* changed over time and low pay transitions.                 *
**************************************************************
gen lopay=w_hr<14 
xttab lopay
xttrans lopay
bysort wave: su lopay
**************************************************************
* Estimate a linear probability model (LPM) to explain low   *
* pay, using the same specification as the wage equation.    *
* Estimate a RE model.                                       *
**************************************************************
xtreg lopay age agesq cohort married female further degree ///
	jbemp  permanent nsw  if keeper==1, re
**************************************************************
* Predict the probability of being low paid for each person. *
* Type help xtreg post-estimation for details of the predict *
* command.                                                   *
**************************************************************
predict reprob if keeper==1, xb
predict reprob1 if keeper==1, xbu //including predicted individual effect
summ lopay reprob reprob1, de 
**************************************************************
* Estimate an FE LPM and examine the predicted probabilities.*
**************************************************************
xtreg lopay age agesq cohort married female further degree ///
	jbemp  permanent nsw if keeper==1, fe
predict feprob if keeper==1, xb
predict feprob1 if keeper==1, xbu //including predicted individual effect
summ lopay feprob feprob1, de
**************************************************************
* Questions to think about:                                  *
* (1) This is a very crude model. Do you think the           *
*     covariates are likely to be adequate to explain        *
*     movements in the prevalence and pattern of low pay     *
*     over time? How would you interpret the coefficients    *
*     in these linear probability models? [Note they can be  *
*     interpreted directly, unlike in the probit and logit   *
*     models below.]                                         *
* (2) Examine the predictions. What do you notice?           *
*     Compare the covariate means for cases where the        *
*     predictions are clearly implausible with those for     *
*     cases where the predictions are a priori plausible.    *
*     Why might there be a problem for policy analysis?      *
**************************************************************
*STOP HERE AND THINK!!
**************************************************************
* Estimate the model as a RE probit and examine the          *
* predicted probabilities (type help xtprobit post-estimation*
* for details of the predict command).                       *
* What do you notice? How have the probabilities been        *
* calculated?                                                *
**************************************************************
xtprobit lopay age agesq cohort married female further degree ///
	jbemp  permanent nsw if keeper==1, re
predict repprob if keeper==1, pu0
summ lopay repprob, de
**************************************************************
* Now estimate the model as a RE logit and examine the       *
* predicted probabilities (type help xtlogit postestimation  *
* for details of the predict command).                       *
* What do you notice? How have the probabilities been        *
* calculated?                                                *
**************************************************************
xtlogit lopay age agesq cohort married female further degree ///
	jbemp  permanent nsw if keeper==1, re
predict relprob if keeper==1, pu0
summ lopay relprob, de
**************************************************************
* Do the linear probability model, logit model and probit    *
* model produce similar predictions of Pr(y=1) for different *
* types of individuals? Use scatter plots of predicted       *
* probabilities from the different (random effects) models.  *
**************************************************************
*First plot logit and probit predicted probabilities for first 2000 obs
scatter relprob repprob in 1/2000, ///
title(Logit and Probit predicted probabilities) ///
xtitle(Predicted probabilities from the probit) ///
ytitle(Predicted probability from the logit) ///
scheme(s1color) legend(off)
graph save "`working'\Logit vs. Probit.gph", replace
* Then plot LPM and logit predicted probabilities *
scatter reprob relprob in 1/2000, msize(small) ///
title(LPM and Logit predicted probabilities) ///
xtitle(Predicted probabilities from the logit) ///
ytitle(Predicted probabilities from the LPM) scheme(s1color)
graph save "`working'\Logit vs. LPM.gph", replace
* now combine the graphs and export them  to a Windows 
* MetaFile for inclusion in a Word document
graph combine  "`working'\Logit vs. Probit.gph"  ///
        "`working'\Logit vs. LPM.gph", cols(2)
graph export "`working'/Logit vs. Probit vs. LPM.wmf", ///
        as(wmf) replace	
**************************************************************
* Questions to think about:                                  *
*                                                            *
* (1) What do these plots tell you about the choice between  *
*     logit, probit and linear models?                       *
* (2) Produce a plot of the logit vs. linear predicted       *
*     probabilities, restricting attention to cases where    *
*     the linear model predicts a probability below 0.1      *
**************************************************************
*STOP HERE AND THINK!!
**************************************************************
* Estimate the FE logit (recall there is no FE probit).      *
* Why does Stata report that it has dropped (a lot of!)      *
* observations? Why is there no constant reported?           *
**************************************************************
xtlogit lopay age agesq cohort married female further degree ///
	jbemp  permanent nsw if keeper==1, fe
**************************************************************
* Predict the probabilites implied by the FE logit. What do  *
* you notice? How are the probabilities calculated?          *
**************************************************************
predict felprob, pu0
summ felprob, de
**************************************************************
* We wish to test the RE assumption that the individual      *
* effects are uncorrelated with the regressors. We compare   *
* the RE and FE logit models. Re-estimate both models and    *
* store the results. Carry out a Hausman test                *
**************************************************************
xtlogit lopay age agesq cohort married further degree jbemp /// 
	permanent nsw if keeper==1, fe
estimates store fixed
xtlogit lopay age agesq cohort married further degree jbemp /// 
	permanent nsw if keeper==1, re
estimates store random
hausman fixed random
**************************************************************
* Questions to think about:                                  *
*                                                            *
* (1) What's the interpretation of the Hausman test?         *
* (2) How do the FE coefficients differ from the RE model?   *
* (3) Which model do you prefer?                             *
**************************************************************
*STOP HERE AND THINK!!
**************************************************************
* Predicting the probability for a person with selected      *
* characteristics.                                           *
* The predict command predicts the probabilities             *
* for each observation separately (given their               *
* characteristics).                                          *
* To make predictions for a specific type of person we need  *
* to use the model formulae directly and plug in the chosen  *
* values. Use command nlcom (which also gives standard       *
* errors). Note this is a rather clumsy manual approach:     *
* the more complicated margins command could be used instead *
*                                                            *
* Calculate the probability of being low paid for a reference*
* male aged 40 and born in 1967 (i.e. current yr is 2007),   *
* who is not married, has no further or higher education,    *
* non-unionised permanent job, has been on his job for 1 year*
* and does not live in nsw.                                  * 
* Calculate the probability for                              *
* 3 values of individual effect, low (-1 sd), medium (0) and *
* high (+1 sd). What impact do unobserved individual level   *
* factors have on the probability of being low paid?         *
**************************************************************
* For convenience, use model just estimated (RE logit)
* "replay" previous results for reference
xtlogit
**************************************************************
* write down xb for ref person using stored coefficients,
* accessed using _b notation
* use a local "macro" to store the text for future use
**************************************************************
local refxb _b[_cons] + _b[age]*4 + _b[agesq]*16 + /// 
     _b[cohort]*1967 + _b[jbemp]*1 + _b[permanent]
* use a macro to store the standard deviation of u(i) *
local ui=e(sigma_u) //estimate of sd from model
display e(sigma_u)
**************************************************************
* NOTE: macro subtlety                                       *
* A local statement without the = sign stores the            *
* following text directly in the macro                       *
* A local statement with the = sign evaluates the expression *
* and stores the resulting number in the macro               *
* You will see difference below when the macro contents are  *
* displayed.                                                 *
**************************************************************
* calculate probabilities using logit formula*
*low individual effect -use -1 st dev of u(i)*
* Note single quotes used to access macro contents
nlcom prlo:exp(`refxb'-`ui')/(1+exp(`refxb'-`ui'))
* medium individual efefct, u(o)=0
nlcom prmed: exp(`refxb') / (1+exp(`refxb'))
* High individual effects - use +1 st dev of u(i)
nlcom prhi: exp(`refxb'+`ui') / (1+exp(`refxb'+`ui'))
* calculate xb for ref person who has done further studies *
local refxb1 _b[_cons] + _b[age]*4 + _b[agesq]*16 + /// 
_b[cohort]*1967 + _b[jbemp]*1 + _b[permanent]+_b[degree]
local ui=e(sigma_u)
* calculate probabilities *
nlcom prlo: exp(`refxb1'-`ui') / (1+exp(`refxb1'-`ui')) 
nlcom prmed: exp(`refxb1') / (1+exp(`refxb1'))
nlcom prhi: exp(`refxb1'+`ui') / (1+exp(`refxb1'+`ui'))
**************************************************************
* Questions to think about:                                  *
*                                                            *
* (1) Consider the impact of the individual effect (u_i) vs. *
*     education (degree). How successful is the model in     *
*     "explaining" low pay?                                  *
* (2) Using the same method, try exploring the impact of     *
*     other covariates on the probability of low pay.        *
* (3) How would the impact of education have differed, if we *
*     had chosen a baseline individual with different        *
*     characteristics?                                       * 
**************************************************************
*STOP HERE AND THINK!!
**************************************************************
* We continue with the wage equation. We suspect that the    *
* error terms may be correlated with job tenure.             *
* Various sources of bias are possible:                      *
* - people with higher earning ability tend to hold jobs     *
*   for longer(?). Implies positive correlation between u(i) *
*   and jbemp. Note this is controlled for in the FE model.  *
* - people stay longer in high-paid jobs (high lwage_hr)).   *
*   Implies positive correlation between high lwage_hr and   *
*   jbemp.                                                   *
* - people move to good (high paid) jobs. Implies negative   *
*   correlation between high lwage_hr and jbemp.             *
* - Any others?                                              *
* To control for a possible correlation of jbemp with        *
* lwage_hr. we will use instrumental variables. We need      *
* variables which help explain jbemp, but are uncorrelated  *
* with wages (conditional on regressors). Suggested          *
* instruments are hours worked by spouse (spousehr) [zero if *
* not employed employed or no spouse], preference to stay in *
* current home (wantstay), and whether renting home (tenant).*
* To allow for the effects to vary over men and women,       *
* interact the instruments with a gender dummy variable.     *
* Based on Ex3, the appropriate model is FE.Before estimating*
* the model it is necessary to check that the instruments    *
* help explain jbemp (using FE (why?) equation).             *
**************************************************************
* Derive extra variables for the model *
* preference to stay in current home *
sort id wave
tab losathl 
replace losathl =. if losathl<0 
drop if losathl==.
capture drop wantstay
recode losathl(6/10=1) (else=0), gen (wantstay)
tab wantstay
*whether renting home*
capture drop tenant
recode hstenur(2=1) (else=0), gen(tenant)
drop if hstenur==.
save "`working'\temp.dta",replace
**************************************************************
* Create a file of spouse information                        *
**************************************************************
use "`working'\longperson_unbal.dta"
keep hhpxid wave hgage hgsex jbhruc edhigh1 esdtl 
 tab wave if hhpxid ==""
drop if hhpxid==""
rename hhpxid xwaveid
rename hgage spouseage
rename hgsex spousesex
rename jbhruc spousehr
rename edhigh1 spouseedhigh1
rename esdtl spouseesdtl
sort xwaveid wave 
save "`working'\temp2.dta", replace
clear
**************************************************************
* Merge with original dataset to add spouse characteristics  *
* as additional variables
**************************************************************
use "`working'\temp.dta"
sort xwaveid wave
merge 1:1 xwaveid wave using "`working'\temp2.dta"
tab _merge 
drop if _merge==2
* if no spouse assign value zero *
replace spouseage=0 if _merge==1
replace spousesex=0 if _merge==1
replace spouseesdtl=0 if _merge==1
replace spousehr=0 if _merge==1
replace spouseedhigh1=0 if _merge==1
drop _merge 
* set spousehr=0 if not employed *
replace spousehr=0 if spouseesdtl!=1 & spouseesdtl!=2 & spouseesdtl!=7   
replace spousehr=. if spousehr<0 
drop if spousehr==. //drop the non-responding records from the analysis
* Possible interactions with gender
gen femten=female*tenant
gen femstay=female*wantstay
gen femsphr=female*spousehr
**************************************************************
* Estimate a FE reduced form model for job tenure, including *
* the instrumental variables as covariates. This is the      *
* "1st stage" regression of 2SLS/IV                          *
**************************************************************
xtreg jbemp age agesq cohort married female degree further ///
	permanent nsw tenant femten wantstay femstay spousehr femsphr ///
	 if keeper==1, fe
test tenant femten wantstay femstay spousehr femsphr
test femten femstay femsphr
* Note that xtivreg will show this first stage regression as an option:
xtivreg lwage_hr age agesq cohort married female degree further ///
	permanent nsw  ///
	(jbemp=tenant femten wantstay femstay spousehr femsphr) ///
	 if keeper==1, first fe
**************************************************************
* Questions to think about:                                  *
* (1) Good instruments must satisfy a VALIDITY condition -   *
*     i.e they must have no direct causal impact on the      *
*     dependent variable. Is this plausible for our chosen   *
*     IVs: tenant femten wantstay femstay spousehr femsphr?  *
* (2) Good instruments must also satisfy the RELEVANCE       *
*     condition - they must be strongly correlated with the  *
*     endogenous covariate (jbemp), after allowing for the   *
*     other exogenous covariates. Is this so here?           *
* (3) Note that using lots of weak instruments doesn't help  *
*     We can probably strengthen the 1st-stage regression by *
*     dropping the interaction variables femten, femstay and *
*     femsphr from the instrument list. Re-run the estimates *
*     Does the first stage regression look stronger now?     *
* (4) The validity of the spousehr instrument is especially  *
*     questionable. What happens when you drop it from the   *
*     instrument list                                        *
**************************************************************
*STOP HERE AND THINK!!	
**************************************************************
* Now test whether job tenure is, in fact, exogenous in the  *
* FE model. We can use a Hausman test for this, assuming our *
* instruments are valid. Recall we need an estimator that is *
* consistent under both H0 (tenure is exogenous) and H1      *
* (tenure is endogenous); and we need a second estimator     *
* that is efficient under H0, but inconsistent under H1.     *
**************************************************************
* consistent under H0 and H1
estimates store ivfe
* efficient under H0; inconsistent under H1
xtreg lwage_hr age agesq cohort married female degree further ///
	permanent nsw jbemp if keeper==1, fe
estimates store fe
hausman ivfe fe
**************************************************************
* What do you conclude? What is your best estimate of the    *
* tenure effect (with confidence interval)?                  *
**************************************************************
**************************************************************
* Now we are going to illustrate the Hausman Taylor Method.  *
* Its main attraction is to allow some time-invariant 	     *
* characteristics to be correlated with u(i). To identify 	 *
* their coefs, we required at least as many time-varying 	 *
* characteristics which are uncorrelated with u(i). We are   *
* going to modify the wage equation to include some highest  *
* education someone ever attained, ant try to estimate the 	 *
* returns to education based on this variation across 	     *
* individuals. Estimate returns using RE model for comparison*
**************************************************************
egen everdeg=max(degree), by(id) //ever got degree
egen everfur=max(further), by(id) //ever got further edu
replace everfur=0 if everdeg   //replace with highest
assert everfur+everdeg<2   if everdeg != . //check only got the highest
xtreg lwage_hr age agesq cohort married female everdeg everfur ///
	permanent nsw jbemp if keeper==1, re
**************************************************************
* Assume that age is uncorrelated with u(i), but that all    *
* other time-varying characteristics are correlated with     *
* u(i). Does this model satisfy the identification condition?*
* Check how strongly age and age squared are correlated      *
* with education.                                            *
**************************************************************
correlate age agesq further degree
**************************************************************
* Do these correlations suggest that age and age squared are *
* good instruments?                                          *
* Estimate HT model                                          *
**************************************************************
xthtaylor lwage_hr age agesq cohort married female everdeg everfur ///
	permanent nsw jbemp if keeper==1, ///
endog(married everdeg everfur jbemp permanent nsw)
**************************************************************
* What do you conclude?                                      *
* Other time-varying characteristics can be included in the  *
* instrument set, but only if we are satisfied they are not  *
* correlated with the individual effect. Otherwise estimates *
* are biased. For example, assume that the living in NSW is  *
* not correlated with u(i) (is this plausible?):             *
**************************************************************
correlate age agesq nsw further degree
xthtaylor lwage_hr age agesq cohort married female ///
	    everdeg everfur jbemp permanent nsw if keeper==1, ///
	    endog(married everdeg everfur jbemp  permanent)
**************************************************************
* A difficulty with HT is finding good instruments from      *
* within the model which are also uncorrelated with u(i).    *
* You might like to consider other instruments.              *
**************************************************************
**************************************************************
* An alternative strategy would be use external instruments  *
* and a conventional RE model. But all regressors that are   *
* not instrumented must be uncorrelated with u(i).           *
* Parental background measures are sometimes used to         *
* instrument educational attainment (assumed correlated with *
* children's education but not with their wage). Examine the *
* dummy variables for father's one digit occupation. What do *
* you notice?                                                *
**************************************************************
* derive the dummy variables for father's occupation *
replace fmfo62 =. if fmfo62<0 
recode fmfo62(10/19=1) (else=0), gen(pamanager)
recode fmfo62(20/29=1) (else=0), gen(paprof)
recode fmfo62(30/39=1) (else=0), gen (patechtrade)
recode fmfo62(40/49=1) (else=0), gen (pacomserv)
recode fmfo62(50/59=1) (else=0), gen (paclerical)
recode fmfo62(60/69=1) (else=0), gen (pasales)
recode fmfo62(70/79=1) (else=0), gen (pamachop)
recode fmfo62(80/89=1) (else=0), gen (palabour)
**************************************************************
* To deal with missing values, create a dummy variable to    *
* indicate father's occupation missing.			             *
**************************************************************
gen pamiss=1 if pamanager==0 & paprof==0 & patechtrade==0 & pacomserv==0 ///
	 & paclerical==0 & pasales==0 & pamachop==0 ///
	 & palabour==0
replace pamiss=0 if pamiss==.
xtsum pamanager paprof patechtrade ///
	pacomserv paclerical pasales pamachop palabour
* Estimate by RE IV:
xtivreg lwage_hr age agesq cohort married female jbemp permanent ///
    (everdeg everfur=pamanager paprof patechtrade ///
	pacomserv paclerical pasales pamachop ///
	palabour pamiss) if keeper==1,  re
est store ivre
**************************************************************
* Questions to think about:                                  *
* (1) re-run xtivreg using the "first" option. Look at the   *
*     first stage regressions. Are these instruments better  *
*     than the internal ones used by HT?                     *
* (2) Assuming the instruments are OK, is education          *
*     exogenous here? Use a Hausman test to compare the IV   *
*     RE regression. What is your comnclusion?               *
**************************************************************
*STOP HERE AND THINK!!
****************************************************************
* Save the new data set for day 3                                *
****************************************************************
save "`working'\longperson_unbal_3.dta", replace
set more on 
log close
exit
**************************************************************
* FURTHER MATERIAL - A MORE SOPHISTICATED IV COMMAND         *
*                                                            *
* The "canned" Stata routine for panel IV is xtivreg. We     *
* are now going to install a more advanced command which     *
* provides more diagnostic statistics (but is for FE models  *
* only).                                                     *
**************************************************************
ssc install ranktest, replace                           
ssc install ivreg2, replace                             
ssc install xtivreg2, replace                           
ssc describe ivreg2                                     
**************************************************************
* Questions to think about again:                            *
* (1) Good instruments must satisfy a VALIDITY condition -   *
*     i.e they must have no direct causal impact on the      *
*     dependent variable. Is this plausible for our chosen   *
*     IVs: tenant femten wantstay femstay spousehr femsphr?  *
*     What does the Sargan test say about this?              *
* (2) Good instruments must also satisfy the RELEVANCE       *
*     condition - they must be strongly correlated with the  *
*     endogenous covariate (jbemp), after allowing for the   *
*     other exogenous covariates. Is this so here?           *
* (3) Note that using lots of weak instruments doesn't help  *
*     We can probably strengthen the 1st-stage regression by *
*     dropping the interaction variables femten, femstay and *
*     femsphr from the instrument list. Re-run the estimates *
*     Does the first stage regression look stronger now?     *
*     What is the outcome of the Sargan test for instrument  *
*     validity?                                              *
* (4) The validity of the spousehr instrument is especially  *
*     questionable. What happens when you drop it from the   *
*     instrument list                                        *
**************************************************************
xtivreg2 lwage_hr age agesq cohort married female degree ///
	further  permanent nsw ///
	(jbemp=tenant femten wantstay femstay spousehr femsphr) ///
	 if keeper==1, fe
xtivreg2 lwage_hr age agesq cohort married female degree ///
	further  permanent nsw ///
	(jbemp= tenant wantstay spousehr) if keeper==1, fe first
xtivreg2 lwage_hr age agesq cohort married female degree ///
	further  permanent nsw ///
	(jbemp= tenant wantstay) if keeper==1, fe first