Martin Palm - Measuring statistical disclosure risks in synthetic data

  • Presenting author: Martin Palm (Federal Statistical Office, Germany (Destatis))

  • Authors: Martin Palm (presenter), Hanna Brenzel, Ralf Münnich, Jan Weymeirsch

  • Session: C02D - Synthetic Data - Wednesday 11:00-12:30 - Erika-Weinzierl Hall

According to Li and O’Donoghue (2013), microsimulations consist of two central parts: the actual simulation in terms of if-then-questions that are solved simulatively, and the data generation. The latter is based on the fact that complete data sets with all required variables and sufficiently large numbers of individuals are rarely available. Therefore, within the framework of the research group “MikroSim” (DFG FOR 2559), the German total population is represented synthetically, but close to reality, at the level of individuals and households. Correlations and marginal distributions are estimated predominantly on the basis of the German Microcensus and known totals from official sources as well as on the basis of the 2011 Census. The basic dataset, consisting largely of demographic variables, has been extended to include some additional topics such as care needs, housing modeling, and others. The synthetic generation of variables in MikroSim is done using statistical prediction methods, corrected where necessary with alignment methods so that known boundary values are statistically respected. The question therefore arises whether, in order to avoid statistical disclosure risks, further confidentiality measures are necessary despite the synthetic data generation – because the simulation is so close to reality that, for example, original input data are exactly replicated or rare events are simulated. The study will investigate whether and to what extent disclosure risks exist at all, and how they can be avoided a priori. Furthermore, it will be shown that the usability of the data generated in this way for microsimulation is not very limited.