Weights
Cross–Sectional Weights
Wave 1
In wave 1, we essentially had a complex cross–sectional survey. The initial (or design) weights are derived from the probability of selecting the households into the sample. These household weights are initially adjusted according to information collected about all selected households (both responding and non–responding) and further adjusted so that weighted household estimates from the HILDA Survey match several known household–level benchmarks.
The person–level weights are based on the household–level weights, with adjustments made based on information collected about all the people listed in the responding households. These weights are also adjusted to ensure that the weighted person estimates match several known person–level benchmarks.
More information about the weighting procedure can be found in Watson and Fry (2002). See the section below for a description of the benchmarks as these have been modified after Release 1.
Wave 2 Onwards
From wave 2 onwards, the ‘selection’ of the sample is dependent on the wave 1 responding sample and the household and individual attrition after waves 1. The cross-sectional weights for wave 2 onwards opportunistically include temporary members into the sample (i.e., those people who are part of the sample only because they currently live with a continuing sample member). The underlying probability of selection for these households is amended to account for the various pathways from wave 1 into the relevant wave household. Following this, non-response adjustments are made which require within-sample modelling of non-response probabilities and benchmarking to known population estimates at both the household and person level.
The weighting process for wave 2 onwards is detailed in Watson (2004b).29 See the section below for a descriptions of the benchmarks as these have been modified after Release 2.
Longitudinal Weights
By comparison, the construction of the longitudinal weights is more straightforward and only includes an adjustment for attrition and benchmarking back to the intial wave characteristics. The longitudinal weights are described in Watson (2004b) but see the following section for a description of the benchmarks used.
We have provided longitudinal weights for the balanced panel of responding persons or enumerated persons from every wave to every other wave and for the balanced panel of any combination of a pair of waves. These weights adjust for attrition from the initial wave and are benchmarked back to the key characteristics of the initial wave. For instance if you were interested in a panel of respondents from waves 2 through 6, the weight provided for this panel would adjust for attrition from the balanced panel from wave 2 to 6 and would ensure key characteristics of the wave 2 population are matched.
Benchmarks
The benchmarks used in the weighting process are listed in Table 4.2430. The changes made to the benchmarking process originally documented in Watson (2004b)include:
- The household and enumerated person weights are determined at the same time. This is known as integrated weighting. The weights are adjusted to the household benchmarks at the same time as they are adjusted to the enumerated person benchmarks. The household weight will be the same as the enumerated weight for each person in the household, resulting in identical estimates where the same concept can be determined from the two files.31
- Due to the demands placed on the weights through the integrated weighting process, some of the benchmarks used have been simplified.
- Following some concerns about the representativeness of the sample, additional benchmarks on marital status and household composition have been included.32
- The person benchmarks for State, part of State, sex and age are from the Estimated Residential Population figures produced by the ABS based on the 2001 Census and the 2006 Census, updated for births, deaths, immigration, emigration and interstate migration. The household benchmarks are derived from these person benchmarks by the ABS.33 The person benchmarks for household composition are derived from the household benchmarks.
- The person benchmarks for labour force status and marital status come from the ABS Labour Force Survey.
- The very remote parts of New South Wales, Queensland, South Australia, Western Australia and the Northern Territory have been excluded from the benchmarks, which is in line with the practice adopted in similar large-scale surveys run by the ABS. As a result, a small number of cases may have zero weights.34
Note also that the benchmarks exclude people living in non-private dwellings, so people that move into these dwellings after wave 1 are given zero cross-sectional weights.
Table 4.24: Benchmarks used in weighting
Household weights | Enumerated person weights | Responding person weights | |
Cross–sectional weights |
Determined jointly with enumerated person weights |
Determined jointly with household weights |
|
Longitudinal weights | Not applicable |
|
|
Replicate Weights
Replicate weights have been provided for users to calculate standard errors that take into account the complex sample design of the HILDA Survey. These weights can be used by the SAS GREGWT macro, the STATA ‘svy jackknife’ commands (more detail is provided below in the section on calculating standard errors), or you can write your own routine to use these weights. Weights for 45 replicate groups are provided.
Weights Provided on the Data Files
Table 4.25 provides a list of the weights provided on the data files together with a description of those weights. The longitudinal weights provided on the enumerated and responding person files are the ones you are most likely to use, though other longitudinal weights are provided on the Longitudinal Weights File.
Irrespective of the modifications made in how the weights are constructed, some changes are expected to the weights with each new release. There are three reasons for this. Firstly, corrections may be made to age and sex variables when these are confirmed with individuals in subsequent wave interviews. Secondly, the benchmarks are updated from time to time. Thirdly, duplicate or excluded people in the sample may be identified after the release (very occasionally).
Table 4.25: Weights
File | Weights | Description |
Household File | _hhwth | The household weight is the cross–section population weight for all households responding in the relevant wave. Note the sum of these household weights for wave 1 is approximately 7.4 million. |
_hhwths | This is the cross–section household population weight rescaled to the sum of the sample size for the relevant wave (i.e. 7682 responding households in wave 1). Use this weight when the statistical package requires the weights to sum to the sample size. | |
_hhwte01 to _hhwte16 | The enumerated person weights are provided on both the household file and the enumerated person file. See description below. | |
_rwh1 to _rwh45 | Cross–section household population replicate weights. | |
Enumerated Person File | _hhwte | The enumerated person weight is the cross–section population weight for all people who are usual residents of the responding households in the relevant wave (this includes children, non–respondents and respondents). The sum of these enumerated person weights for wave 1 is 19.0 million. |
_hhwtes | This is the cross-section enumerated person population weight rescaled to the sum of the sample size for the relevant wave (i.e. for wave 1, 19,914 enumerated persons). Use this weight when the statistical package requires the weights to sum to the sample size. | |
_lnwte | This longitudinal enumerated person weight is the longitudinal population weight for all people who were enumerated (i.e. in responding households) each wave from wave 1 to the wave where this variable resides. This weight applies to the following people in responding households: children, non-respondents, intermittent respondents, and full respondents. blnwte is for the balanced panel of enumerated persons from wave 1 to 2; These variables are also on the Longitudinal Weights File, but are named differently: wlea_b; wlea_c; wlea_d, etc. |
|
_rwe1 to _rwe45 | Cross–section enumerated person population replicate weights. | |
_rwlne1 to _rwlne45 | Longitudinal enumerated person population replicate weights. | |
Responding Person File | _hhwtrp | The responding person weight is the cross–section population weight for all people who responded in the relevant wave (i.e. they provided a personal interview). The sum of these responding person weights for wave 1 is 15.0 million. |
_hhwtrps | This is the cross–section responding person population weight rescaled to sum to the number of responding persons in the relevant wave (i.e. 13,969 in wave 1). Use this weight when the statistical package requires the sum of the weights to be the sample size. | |
_lnwtrp | This longitudinal responding person weight is the longitudinal population weight for all people responding (i.e. provided an interview) each wave from wave 1 to the wave where this variable resides. blnwtrp is for the balanced panel of respondents from wave 1 to 2; These variables are also on the Longitudinal Weights File, but are named differently: wlra_b; wlra_c; wlra_d, etc. |
|
_rwrp1 to _rwrp45 | Cross–sectional responding person population replicate weights. | |
_rwlnr1 to _rwlnr45 | Longitudinal responding person population replicate weights. | |
Longitudinal Weights File | wlet1_tn | Longitudinal enumerated person weight for the balanced panel of all people who were enumerated (i.e. part of a responding household) each wave from wave t1 to tn. Wave letters are used in place to t1 and tn. For example, wlec_f is the longitudinal enumerated person weight for the balanced panel from wave 3 to 6. |
wlet1tn | Longitudinal enumerated person weight for the balanced panel of all people who were enumerated (i.e. part of a responding household) in wave t1 and tn. Wave letters are used in place of t1 and tn. The paired longitudinal weights do not restrict individuals in any way based on their response status in waves between t1 and tn. For example, wlecf is the longitunal enumerated person weight for the balanced panel of enumerated people in wave 3 and 6 (they may or may not have been enumerated in other waves). | |
wlrt1_tn | Longitudinal responding person weight for the balanced panel of all people who were interviewed each wave from wave t1 to tn. Wave letters are used in place to t1 and tn. For example, wlrc_f is the longitudinal responding person weight for the balanced panel of respondents from wave 3 to 6. | |
wlrt1tn | Longitudinal responding person weight for the balanced panel of all people who were interviewed in wave t1 and tn. Wave letters are used in place of t1 and tn. The paired longitudinal weights do not restrict individuals in any way based on their response status in waves between t1 and tn. For example, wlrcf is the longitudinal responding person weight for the balanced panel of respondents in wave 3 and 6 (they may or may not have been responding in other waves). | |
Longitudinal Replicate Weights File1 | wlet1_tn1 to wlet1_tn45 | Longitudinal enumerated person replicate weights for the balanced panel from t1 to tn. |
wlet1tn1 to wlet1tn45 | Longitudinal enumerated person replicate weights for the balanced panel for t1 and tn. | |
wlrt1_tn1 to wlrt1_tn45 | Longitudinal responding person replicate weights for the balanced panel from t1 to tn. | |
wlrt1tn1 to wlrt1tn45 | Longitudinal responding person replicate weights for the balanced panel for t1 and tn. |
1 | The Longitudinal Replicate Weights File is available on request. Please email us. |
Advice on Using Weights
Which Weight to Use
For some users, the array of weights on the dataset may seem confusing. This section provides examples of when it would be appropriate to use the different types of weights.
If you want to make inferences about the Australian population from frequencies or cross–tabulations of the HILDA sample then you will need to use weights. If you are only using information collected during the wave 4 interviews (either at the household level or person level) then you would use the wave 4 cross–section weights. Similarly, if you are only using wave 3 information, then you would use the wave 3 cross–section weights, and so on. If you want to infer how people have changed across the five years between waves 1 and 6, then you would use the longitudinal weights for the balanced panel from waves 1 to 6.
The following five examples show how the various weights may be used to answer questions about the population:
- What proportion of households rent in 2007? We would use the cross–section household weight for wave 7 and obtain a weighted estimate of proportion of households that were renting as at the time of interview.
- How many people live in poor households in 2002? We are interested in the number of individuals with a certain household characteristic, such as having low equivalised disposable household incomes. We would use the cross-section enumerated person weight for wave 2 and count the number of enumerated people in households with poorest 10 per cent of equivalised household incomes. (We do not need to restrict our attention to responding persons only as total household incomes are available for all households after the imputation process. We also want to include children in this analysis and not just limit our analysis to those aged 15 year or older.)
- What is the average salary of professionals in 2003? This is a question that can only be answered from the responding person file using the cross–section responding person weight for wave 3. We would identify those reportedly working in professional occupations and take the weighted average of their wages and salaries.
- For how many years have people been poor between 2001 and 2006? We might define the ‘poorest’ 10 per cent of households as having the lowest equivalised household incomes in each wave. We could then calculated how many years people were poor between wave 1 and wave 6, and apply the longitudinal enumerated person weight (flnwte or equivalently wlea_f) for those people enumerated every wave between wave 1 and 6.
- What proportion of people have changed their employment status between 2002 and 2007? This question can only be answered by considering the responding persons in both waves. We would use the longitudinal responding person weight for the pair of waves extracted from the Longitudinal Weight File (wlrbg) and construct a weighted cross–tabulation of the employment status of respondents in wave 2 against the employment status of respondents in wave 7.
When constructing regression models, the researcher needs to be aware of the sample design and non–response issues underlying the data and will need to take account of this in some way.
Calculating Standard Errors
The HILDA Survey has a complex survey design that needs to be taken into account when calculating standard errors. It is:
- clustered – 488 areas were originally selected from which households were chosen and people are clustered within households;
- stratified – the 488 areas were selected from a frame of areas stratified by State and part of State; and
- unequally weighted – the households and individuals have unequal weights due to some irregularities in the selection of the sample in wave 1 and the non–random non–response in wave 1 and the non–random attrition in waves 2 to 4.
Some options available for the calculation of appropriate standard errors and confidence intervals include:
- Standard Error Tables – Based on the wave 1 data, approximate standard errors have been constructed for a range of estimates (see Horn (2004)). Similar tables for wave 2 to 4 have not been produced.
- Use of the SPSS add-on module "SPSS Complex Samples" (available from SPSS Release 12). The add-on module produces standard errors via the Taylor Series approximation. SPSS does not have a built in feature to handle replicates weights.
- Use of SAS procedures SURVEYMEANS, SURVEYREG, SURVEYFREQ and SURVEYLOGISTIC (the last two only in version 9 onwards). The SAS procedures produce standard errors via the Taylor Series approximation. SAS does not have a built in feature to handle replicates weights, however, a SAS macro has been provided by one of our users in the program library.
- Use of GREGWT macro in SAS – Some users within FaHCSIA, ABS and other organisations may have access to the GREGWT macro that can be used to construct various population estimates. The macro uses the jackknife method to estimate standard errors using the replicate weights.
- Use of ‘svy’ commands in STATA – Stata has a set of survey commands that deal with complex survey designs. Using the ‘svyset’ commands, the clustering, stratification and weights can be assigned. You can request the standard errors be calculated using the Jackknife method using ‘svy jackknife’ and the replicate weights. Various statistical procedures are available within the suite of ‘svy’ commands including means, proportions, tabulations, linear regression, logistic regression, probit models and a number of other commands.
A User Guide for calculating the standard errors in HILDA is provided as part of our technical paper series, see Hayes (2008). Example code is provided in SAS, SPSS and STATA.
To assist you in the calculation of appropriate standard errors, the wave 1 area (cluster), and proxy stratification variables have been included on the master file. These are listed in Table 4.26 and need to be specified for the standard error calculations Taylor Series approximation method as suggested above. Any new entrants to the household are assigned to the same sample design information as the permanent sample member they join. As of Release 6 the proxy stratification variable (ahhstrat) has replaced major statistical region (ahhmsr) on the master file as the variable to be used in the Taylor Series approximation method. The new stratification variable is essentially a collapsed area unit variable that approximates the effect of both the systematic selection and stratification of the survey selection better than only using the variable for the major statistical region.
Table 4.26: Sample design variables
Variable | Description | Design element |
ahhraid | DV: randomised area id | Cluster |
ahhstrat | DV: Wave 1 Strata | Proxy stratification |
Also, a few users may be interested in the sample design weight in wave 1 before any benchmark or non-response adjustments have been made. This is available on the household file as ahhwtdsn.
Endnotes:
29 | While this paper is written in relation to the wave 2 weighting, the process in later waves follows the same methodology. Back to where you were |
30 | We thank the Demography Section and the Labour Force Estimates team from the Australian Bureau of Statistics for the provision of the benchmarks used in the weighting process. Back to where you were |
31 | For example, the number of people living in a household with two people can be derived by two methods. Firstly, this can be calculated from the household file by estimating the number of two person households and multiplying by two. Secondly, it can be estimated from the enumerated file by summing the weights of people living in two person households. Back to where you were |
32 | An occupation benchmark was included from Release 4 to 6, but this was later removed following concerns about the occupation coding as outlined by Watson and Summerfield (2009). Back to where you were |
33 | Due to updates to the household propensities used by the ABS to create the household benchmarks, the total number of households based on the 2006 Census is quite different from that based on the 2001 Census. For example, the number of households in Australia in September 2001 based on the 2001 Census was 7.43 million, whereas the corresponding number based on the 2006 Census was 7.32 million. In order to minimise the impact on our estimates caused by changes to the benchmarks, an incremental combination of the two sets of household benchmarks was taken. Back to where you were |
34 | This stemmed from a change in the benchmarks available from the ABS to align with the remoteness area classification rather than a ‘sparsely settled’ definition. Back to where you were |