To create synthetic datasets for training and testing purposes.
33 variables selected from a dataset made of the 2006 long-form Census linked to the 2015 Canadian Mortality Registry
47 variables selected from a dataset made of the 2006 long-form Census linked to the 2015 Canadian Cancer Registry and the 2014 Canadian Vital Statistics Death Database
|Generating high quality data for training and hackathons
|Details of computation
Fully Conditional Specification approach with CART and regression methods were used to create a synthetic dataset which allowed access to detailed information in an non-secure environment by students (non-trusted analysts).
|Parties and trust relationship
|Statistics Canada acts as input party; multiple students/researchers act as output parties. There is no assumption of trust between the parties. However, when the synthetic data was used during a hackathon, participants were asked to agree not to share any data outside the hackathon environment.
|Pilot (synthetic datasets were successfully used as a training aid and to support a hackathon)
Sallier, Kenza. ‘Toward More User-centric Data Access Solutions: Producing Synthetic Data of High Analytical Value by Data Synthesis’. 1 Jan. 2020 : 1059 – 1066. Toward more user-centric data access solutions: Producing synthetic data of high analytical value by data synthesis - IOS Press
Statistics Canada’s recent experience with synthetic data is related to specific uses, such as providing datasets of high analytical value to hackathon participants. The participants were allowed to access the data in the hackathon setting under an agreement not to copy or share the data further. Analytical value was comparable to the original datasets. The disclosure risk was evaluated as if the produced synthetic datasets were Public-Use Microdata Files (PUMFs) with real respondents. The original microdata, e.g. census data, health variables and mortality indicators, contains sensitive information, and so could not be made available to outside researchers in an uncontrolled environment.
Case Study description
In the last two instances in 2018 and 2019, synthetic datasets were created using a mass imputation method - the fully conditional specification approach with the Classification and Regression Tree (CART) method - in order to preserve the analytical value of the original files. The files have been shared in the hackathons and can be offered to other trusted institutions looking for open data. They can also support access via remote desktop connection.
Outcomes and lessons learned
The two hackathon exercises have helped to develop Statistics Canada’s experience with synthetic data with high analytical value. The projects have illustrated that synthetic data can be created that preserves the analytic utility of the original data while effectively reducing the risk of disclosure. Both hackathons met the goal of increasing the knowledge and experience of the attendees. Both instances have illustrated the challenges in understanding the risks (real or perceived) associated with creating these files. Finally, in terms of the data’s utility, there are challenges associated with developing synthetic datasets that meet a generic analytic goal, i.e. without any prior assumptions on types of analyses to be Statistics Canada performed by the users.