The Defence Science and Technology Laboratory (Dstl) has produced a framework for assessing the use of different types of synthetic data.
Its principal data scientist, Karen Walker, has outlined the framework in a blogpost that emphasises the need to choose a method for a specific purpose.
Synthetic data is artificially generated to mimic the characteristics and structure of real world data without exposing the sensitivities. The Ministry of Defence – the parent department of Dstl – sometimes uses it in sharing sensitive data with external experts.
The main elements of the framework include considering how versatile a method is in handling different types of data, how well the synthetic type mimics the statistical properties of the original data, and how the method preserves the utility of the original data while maintaining strong privacy levels.
Other features are to look at how easy it is to explain and influence the method output, how long it takes to run and whether there are specific computing requirements.
Different uses, different methods
“Overall, the particular synthetic data generation method chosen needs to be specific to the particular use of the data once synthesised,” Walker said. “Given the maturity of the research in this area, it is not currently realistic to use one method for all purposes.”
Dstl developed the framework with the Applied Intelligence Laboratories of defence technology company BAE, working on a number of open datasets similar in nature to those held in defence. These include tabular sets with numeric and categorical data, relational datasets, free text and GPS location data.
They identified 16 techniques and trialled three in detail. The work is ongoing and Dstl said it is looking to talk to other government bodies, having set up a #syntheticdata channel on the Slack platform for cross-government data science.
Image from iStock, loops7