Johannes Gussenbauer - How to generate a synthetic population as input for microsimulation

  • Presenting author: Johannes Gussenbauer (Statistics Austria)

  • Authors: Alexander Kowarik; Johannes Gussenbauer

  • Session: C02D - Synthetic Data - Wednesday 11:00-12:30 - Erika-Weinzierl Hall

  • Slides: PDF

Synthetic data generation methods are used to transform the original data into privacy-compliant synthetic copies (twin data) that can be used for training data, open-access data, internal datasets to speed up analyses and much more. With the R-Package simPop synthetic population data can be simulated in the same size as the input data or in any size, and in the case of finite populations even the entire population. This work aims to showcase the use of the R-Pacakge simPop with the focus on producing a synthetic population as input for microsimulation models. The package contains multiple different methods, for instance random forest or XGBoost, for synthesizing variables from population data. After the data generation it is recommended to adjust the synthetic data to known population margins. For this purpose, we implemented a simulated annealing algorithm which is capable of using multiple different population margins at once.