Dr James Poulten has a PhD in experimental quantum matter physics and a Master’s degree in physics from the University of Royal Holloway. In his time as a Data Science Consultant, he’s worked in multiple government departments on a range of projects with the Met Office, the Home Office, Number 10 and the Cabinet Office.
As Lead Data Scientist at Made Tech, he discusses the potential power of synthetic data in government and how we can use it to supercharge the analytics process.
In government there’s a conscious effort to shift toward backing up policies and important decisions with rigorous analytics. Data is the lifeblood of this analysis, but how can the sensitive data the government collects be used safely and without breaching the strict GDPR laws we have in place to protect it?
Synthetic data is an artificial data set. It’s created digitally rather than gathered from the real world. It keeps the mathematical look and feel of original data, without containing any direct link to real people. Using synthetic data means analysts don’t have to worry about personally identifying information being exposed. It can be generated using different machine learning techniques, each with their own strengths and
weaknesses.
Synthetic data generation is not a new field, but until recently, the time and cost of actually developing a solution has been prohibitive. But with recent advances in artificial intelligence (AI), that might be about to change, unlocking the power of synthetic data for everyone.
Benefits of synthetic data
So what is synthetic data and what are its benefits? Imagine you want to analyse the population of a town or some other group of users. Before you begin, you’re already entering a minefield of regulation and GDPR, which could take months to navigate.
Now imagine simulating that user group in a computer game, populating a computer-simulated town. The occupants of that town can have all the same information as a real-world group. Done well, this generated group could be a pretty accurate approximation of the real-life group. The main difference being, they’re entirely made up.
Analysts onboarded to a project can have near-immediate access to this simulated data, without any of the privacy concerns that accompany real-life data. This means, while the appropriate permissions and regulations safeguarding the real data are being worked through, analysts can already begin their work. This can save weeks of time. Then, once access has been granted, any analysis developed on your simulated data can be transferred directly to your real data.
The key point here is that synthetic data maps to simulated people, not real ones. So there’s no need to worry about privacy issues. It becomes a lot easier to share data between departments and build collaborative working environments. It’s also not subject to the same data-storage requirements and limitations.
Another benefit is the potential cost saving. Once the initial development costs are covered, synthetic data is cheaper to store, process and generate. As the data doesn’t contain personally identifiable information (PII), it means analysts and data scientists can access the data as soon as they join a project, rather than waiting for clearance, or access to more secure environments to be granted. The gains in efficiency and momentum can be massive.
Synthetic data and GANs
When it comes to building a synthetic data set, there are levels of complexity. As complexity increases, so does the utility – but this comes with a cost. Structural synthetic data can be used to verify code. It’s easy to generate and simple to deploy, but it holds little analytical information.
Building a robust synthetic data set is no easy task. The level of synthetic data that would allow analysts and data scientists to develop complex analytic models and playbooks can, in itself, take a team of data scientists to develop. Depending on your requirements, there’s likely a level of synthetic data that suits your needs, without going further.
At the more complex end of the scale, there are a number of different techniques that can be deployed. The method of the moment is to use generative adversarial networks (GANs). These rely on two opposing neural network models fighting against each other, one generative and one discriminative.
These models have received widespread media attention in recent years due to their effectiveness at producing images and text in the style of a chosen artist, genre or medium. An example is generative artificial intelligence (AI) art. If you’ve seen or read about AI art, it’s more than likely using a GAN. We can turn that same process towards generating tabular data.
The appeal of using a GAN, complex though it may be to build, is that the generator never sees the real data. Instead, it starts from a random place and learns to generate plausible data. As it does this, the data is continually checked against what is known as a discriminator – technology that essentially checks how realistic the synthetic data is.
Real-world examples
Synthetic data is rapidly gaining traction in the public sector. There have already been several reports by the ONS Data Science Campus, such as 2019’s Generative adversarial networks (GANs) for synthetic dataset generation with binary classes. According to the ONS, machine learning models trained on synthetic data, specifically US census data, were around 80% effective. This means the process for generating their model can be rapidly iterated on when introduced to the real data, and used to develop solutions almost
immediately.
There are also big advancements in healthcare. Medical start-up Curai is using this type of AI to develop their app, using a combination of synthetic data and natural data to train their neural networks. This has helped develop features in the app such as dermatology image classification and diagnostic models. Synthetic data has been used to supplement and enhance existing datasets to supercharge the performance of medical AI models.
Synthetic data in government
In government, the benefits of synthetic data are wide-reaching, from encouraging knowledge sharing within groups to collaboration across
departments. Using synthetic data can help us work with industries and academic institutions when crowd-sourcing innovation and similar activities. Right now, anonymising data for collaborations is an expensive and time-consuming activity. With synthetic data, those challenges are already solved.
Synthetic data can also be used internally on better analysis for improving services, from research to finding new ways to innovate. Using this type of data means you’re no longer limited by GDPR or other handling guidelines, opening up space for exploratory analysis.
Finally, synthetic data can be used in training material, developing and enhancing data literacy as a core skill across the public sector. Using robust synthetic data, we can develop training material without the usual PII concerns, closing the data literacy gap by providing as real-to-life examples as we can get.
Limitations
Like any technology, synthetic data is just a tool and like all tools it has its limitations. Though synthetic data does a great job of mimicking real data, it doesn’t and shouldn’t replace it exactly. Synthetic data gives us a good place to develop analysis processes, build models and train those models. But synthetic data always needs to be validated against real data.
Synthetic data is also unlikely to capture edge cases or anomalies. Think of the process as adding a level of fuzziness to your data set. You can see patterns, analyse relations and draw insight, but you might struggle to identify outliers or possibly deeper nuance. It is also worth noting that GANs can be difficult to build and train. Generating synthetic data is considerable work in itself. So there will be an upfront cost to consider.
Use in the public sector
Using synthetic data to enhance the analytics process could be just what the public sector needs. Along with the added possibilities that come with closing the data literacy gap, adopting this type of technology can bring huge benefits to those working in and using government services.
As with all machine learning techniques, it’s important to remember its risks and limitations and to understand the tool fully before you commit to using it. However, if used correctly, synthetic data has the potential to unlock sensitive data for analysis, speed up delivery and supercharge collaboration opportunities between departments.
In related news:
Leave a Reply