Synthetic Data: To fear or embrace?
‘Synthetic data’. There’s something about this seemingly oxymoronic term that’s bound to strike unease into the hearts of researchers and statisticians across the world — and perhaps for good reason.
After decades of grounding our findings and decisions in real world information, the idea that we may be beginning to place our decision-making tools into synthetic hands may be an unnerving thought. But, whether we like it or not, the use of artificially generated data to mirror the statistical properties of real-world data is on the rise, and it may be too early to tell just quite what this means for us all.
As researchers whose aim is to close the say-do gap and get closer to real human behaviours, the instinctive reaction to this is one of horror. But evidence from early trials suggests that we should keep an open mind and explore where the opportunities might lie for ourselves and our clients.
As with most things, the lure of synthetic data lies in its cost. It offers rapid, inexpensive data sets that can either be used in isolation, or to augment an existing data set with gaps that require filling. These benefits are clear to those of us who have agonised over slow fieldwork and hard-to-reach samples, and synthetic data would likely be far quicker and cheaper than online surveying, let alone telephone or face-to-face interviews.
It means we could get to the insights faster, avoid data privacy concerns, and provide more flexibility than we could doing traditional fieldwork.
However, synthetic data should be applied pragmatically, and only in situations where it makes sense. For example, it’s likely less suitable for time sensitive trackers, or for capturing emerging behaviours, but it may be suitable for filling in the gaps of an existing dataset, or for an early-stage test of potential concepts.
One of the obvious downsides is the lack of authenticity or connection to real-world context, and our lack of ability to know how different it is. We aim to capture genuine human behaviours and emotions — can an algorithm really replicate that? Some of the best insights can lie in the edge cases, and synthetic data may smooth out those edges. There is also an ethical question around ‘replacing’ minority audiences with algorithms, which should be thoroughly explored.
We don’t have all the answers yet, so neither naively embracing or prematurely rejecting it are appropriate stances to take. We remain cautiously curious, and see our role as advisors to guide clients through this development in the market, helping them to reap the rewards while avoiding the pitfalls. We must also set the standard for using synthetic data responsibly, with our plans to run a trial comparing real and synthetic data directly kickstarting a test-and-learn from which we can build our understanding of the strengths and weaknesses.
The first of these trials will be testing synthetic data as ‘data augmentation’, working with a company who uses machine learning to complete challenging quotas or fill in missing data. This has the potential to address some of the challenges and unexpected issues which can hinder quant research projects, allowing us to be more agile and reactive.
We are going to be carrying out further trials, testing the uses and efficacy of synthetic data in different research contents – and we’d love to hear from you.
Want to take part in our trial? If you’re curious about synthetic data and want to know more, get in touch.