Pierre Poulon - A statistical matching method based on the Random Forest algorithm – an application to the micro-simulation of health expenditures
Presenting author: Pierre Poulon (French Ministry of Health, Direction of research, studies, evaluation and statistics)
Authors: Mathieu Fouquet, Pierre Poulon (corresponding author)
Session: C01B - Health [2] - Wednesday 9:00-10:30 - Marietta-Blau Hall
Slides: PDF
In France, the compulsory health insurance refunds 78% of households’ health expenditures while complementary health insurances cover another 13%. Whereas administrative records are available on an individual exhaustive level on health expenses and reimbursements from the compulsory health insurance (National Health Data System - SNDS), there are no population-wide records about the complementary health insurances’ reimbursements. Therefore, the French Ministry of health has developed a micro-simulation model, “Ines-Omar”. This model micro-simulates the breakdown of health expenditures between households and both the compulsory and the complementary health insurance, per care item. It is a static micro-simulation model, used to simulate the distributional impact of reforms on the out-of-pocket expenses of households. This paper will highlight a novel statistical matching method used to build the model, based on the machine-learning algorithm Random Forest. The core of the 2019 version of “Ines-Omar” is the European Health Interview Survey (EHIS). In France, EHIS 2019 includes questions about the coverage by a complementary health insurance: using them, we assign to each subscriber a complementary health insurance from a pool of contracts provided by a survey among health insurance providers. Besides, the EHIS 2019 sample has been partially identified in the SNDS. However, we do not have access to the health expenditures of individuals aged 14 and below in EHIS 2019, so that we need to match them with individuals under the age of 14 from EHIS 2014 (which has been fully linked to the SNDS). The matching procedure runs as follows: as in a Random Forest, we train a collection of regression trees predicting health expenditure on EHIS 2014, the donors’ sample. Each regression tree stratifies EHIS 2014 along the explanatory variables, such as health habits or health status. For each regression tree, we apply the same strata on the recipients’ sample: thus, each recipient is located in a stratum, and is linked to the population of the equivalent stratum in EHIS 2014. Once the whole collection of regression trees has been trained, each recipient has been associated a number of times to each potential donor. Finally, we draw a donor for each recipient among the most frequently associated individuals of EHIS 2014, and assign the donor’s health expenditures to the recipient for each care item. This method is more appropriate than a hot-deck when the variables that one wants to transmit to the recipients are strongly correlated (such as health expenditures by care items), and when the matching variables are numerous and their order of relevance is not clear beforehand. We will conclude the paper by an overview of the main insights obtained from the “Ines-Omar” model, on the weight of health expenditures in households’ income, and on the redistributive effect of the compulsory health insurance.