Cookies on this website
We use cookies to ensure that we give you the best experience on our website. If you click 'Continue' we'll assume that you are happy to receive all cookies and you won't see this message again. Click 'Find out more' for information on how to change your cookie settings.
Skip to main content

Abstract

Synthetic data is data that are generated by an algorithm to have properties similar to real data. They may be useful whenever real data is too sensitive, too valuable, or too limited to meet research needs. Synthetic data may be used, for example, to teach population health science without releasing real patient data to students, or to develop statistical analyses while avoiding Gelman and Loken’s “garden of forking paths”. But the more closely synthetic data replicate the properties and patterns of real data, i.e. the more realistic they are, the greater the risk that they fail to achieve some of these objectives. Information contained in synthetic data could be used to learn about the real data on which they are based, potentially risking participant privacy or exhausting the utility of the real data for hypothesis testing. Conversely, attempts to reduce bias in certain machine learning models by augmenting real data with synthetic data may be defeated by a lack of realism. The ethical and epistemic problems that might motivate use of synthetic data, and the issues which may affect how it is generated and used, will be presented for discussion.