Missing data

Missing data (MD) are data that were planned to be collected but could not be obtained in the end. Their causes are multiple, they concern all research fields, and their presence in a database can be considered as the rule, not the exception. The first consequence of MD is a reduced sample size, implying a loss in statistical power. Moreover, since MD rarely occur in a completely random fashion, they generally also imply biased coefficient estimates and reduced variance, what in turn leads to an under estimation of confidence intervals and a too high probability of rejecting the null hypothesis of statistical tests (Molenberghs et al., 2014).

In a survey, missing data can be partial (only one part of the answers of an individual are missing) or complete (all information is missing for this individual). In longitudinal designs especially, when a respondent drops out of the study after a given wave, then this individual will have only complete MD in all subsequent waves. This is a very important problem for the analysis of vulnerability, since research shows that vulnerable people tend to drop out more frequently of a longitudinal survey than other people do, leading to an under-representativeness of such people and to an underestimation of their proportion in the general population (Rothenbühler & Voorpostel, 2016).

In many situations, MD are apparent (we see in the database that the answer to a question is missing), but sometimes we cannot know whether an answer is missing or not. A good example is provided by retrospective data collected through a life history calendar. In such a survey design, every effort is done to enhance the recall memory of participants, but we can never be certain that all relevant information was collected (Morselli et al., 2016). Therefore, MD can be very complicated to identify, hence to treat. Finally, MD are sometimes planned in advance. For example, to reduce the burden on participants in a longitudinal design, only subsets of participants are surveyed in each wave, with the remainder considered missing. In this case, the MD process can be, under certain conditions, considered random and explicitly taken into account during the statistical analysis phase (Brandmaier et al., 2020; Rhemtulla & Little, 2012).

With the exception of cases where they have been planned in advance, every effort must be made to avoid the appearance of missing data, and if missing data do appear, then they must be treated using the correct method. Avoiding missing data includes working on the quality and accuracy of the data collection tool, repeatedly contacting non-respondents, using incentives , and mixing data collection modes, i.e., allowing survey participants to choose to respond by, for example, mail, telephone, portable devices, or the Internet (Stähli et al., 2016). A well-thought-out research design and data collection tools can drastically reduce the number of MD. However, it is difficult to completely avoid MD when collecting survey data, so they have to be treated a posteriori. Two approaches are currently considered as state of the art: Likelihood-based methods and multiple imputation. Notice that, unfortunately, it is still common practice to not treat MD at all and to just perform statistical computations on either the available data, or only on observations with complete data. However, this approach should be completely avoided, since it generally produces inaccurate and non-representative results (Berchtold, 2019).

Likelihood-based methods (Little & Rubin, 2019) allow for unbiased estimates of parameters of interest, but they rely on strong hypotheses regarding data distribution that cannot be met by all data types and all statistical models. Such methods are standard tools in, e.g., structural equation modelling, but they are barely applicable in sequence analysis, where the variables of interest are often measured on a nominal scale. Moreover, these methods rely on the appropriateness of the underlying model of analysis, which is difficult assumption to be tested. Multiple imputation (Rubin, 1987) consists in replacing each MD by k different possible replacement values, hence creating k complete datasets, called replications. Statistical models are computed independently on each replication, and results are then aggregated, with appropriate formulae, into a final unique solution. When both likelihood-based methods and multiple imputation are possible, the former is generally preferred, given its easier implementation. However, multiple imputation is applicable to a wider range of situations since it is based on weaker assumptions (van Buuren, 2018). Moreover, multiple imputation may produce smaller estimation variance, given the analyses are carried out on complete data sets.

In sum, missing data are a ubiquitous reality of empirical research, can lead to serious estimation bias if untreated, can be, at least, partially avoided with careful data collection methodologies, and can easily be handled by appropriate analyses so as to provide unbiased parameter estimates.

Author: André Berchtold

References

Berchtold, A. (2019). Treatment and reporting of item-level missing data in social science research. International Journal of Social Research Methodology, 22(5), 431–439. https://doi.org/10.1080/13645579.2018.1563978
Brandmaier, A. M., Ghisletta, P., & Oertzen, T. von. (2020). Optimal planned missing data design for linear latent growth curve models. Behavior Research Methods, 52(4), 1445–1458. https://doi.org/10.3758/s13428-019-01325-y
Little, R. J. A., & Rubin, D. B. (2019). Statistical Analysis with Missing Data (3rd ed.). Wiley.
Molenberghs, G., Fitzmaurice, G., Kenward, M. G., Tsiatis, A., & Verbeke, G. (Eds.). (2014). Handbook of Missing Data Methodology (1 edition). Chapman and Hall/CRC.
Morselli, D., Berchtold, A., Suris, J.-C., & Berchtold, A. (2016). On-line life history calendar and sensitive topics: A pilot study. Computers in Human Behavior, 58, 141–149. https://doi.org/10.1016/j.chb.2015.12.068
Rhemtulla, M., & Little, T. D. (2012). Planned Missing Data Designs for Research in Cognitive Development. Journal of Cognition and Development, 13(4), 425–438. https://doi.org/10.1080/15248372.2012.717340
Rothenbühler, M., & Voorpostel, M. (2016). Attrition in the Swiss Household Panel: Are Vulnerable Groups more Affected than Others? In M. Oris, C. Roberts, D. Joye, & M. Ernst Stähli (Eds.), Surveying Human Vulnerabilities across the Life Course (pp. 223–244). Springer International Publishing. https://doi.org/10.1007/978-3-319-24157-9_10
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
Stähli, M. E., Joye, D., & Roberts, C. (2016). Mixing modes of data collection in Swiss social surveys: Methodological Report of the LIVES-FORS Mixed Mode Experiment. LIVES Working Paper, 48, 1-42. https://doi.org/10.12682/LIVES.2296-1658.2016.48
van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd New edition). CRC Press.

Semantic network visualisation

Click to activate zoom- and drag-fonctionnality (scroll to zoom, drag nodes to move, click and hold nodes to open next level)

Missing data

References

Semantic network visualisation

Navigation menu

Search