Synthetic data for AI/ML development - MOSTLY AI (original) (raw)

AI/ML development challenges

AI and Machine Learning is hungry for data. However, data for training models is often hard to come by. Often only 15-20% of customers consent to having their data used for analytics. The rest of the data and the insights contained are locked away. Due to privacy reasons, sensitive data is often off-limits both for in-house data science teams and for external AI or analytics vendors.

Even when some data is available, data quality often is an issue. Missing relevant data complicates AI/ML development and negatively impacts performance of models. Machine learning accuracy suffers when training data quality is insufficient.

The status quo in training data for machine learning and AI

The majority of AI/ML projects never make it into production due to the lack of high-quality training data. Most organizations do have the data, but it is locked up. Data owners are unwilling to or simply can't provide the necessary training data because of privacy and compliance reasons. Even if data owners are on board, legacy anonymization used in data preparation often reduces data utility. Traditional anonymization techniques need to strip away the granularity of data, ultimately leading to low quality models and suboptimal business decisions.

Synthetic for AI/ML development

For AI/ML development, synthetic training data is a great alternative to real data. Not only is it perfectly privacy-compliant, but due to the nature of the AI-powered synthesization process, the original data can be modified in certain ways.

For example, rare patterns and events can be upsampled in the synthetic training data, which can help ML performance to be significantly improved. Synthesization can also be used to generate more training data if the volume of training data was limited in the first place.

Another area where synthetic data can be of great benefit is fairness and explainability of AI models.

Fair AI and explainability challenges in AI and ML development

The status quo in fair AI and AI explainability

There are millions of AI algorithms already in production. Only a small portion of them were audited for fairness. Fair AI is still only talked about in the future tense by most AI engineers. Companies putting untested, biased algorithms into production run the risk of getting into serious trouble not only from a PR perspective but by way of making bad business decisions. After all, biased data will lead to biased business decisions, underserved minority groups, and inexplicable results. From faulty pricing models in insurance to suboptimal prediction outcomes in healthcare, algorithmic fairness is a long stretch away from reality.

The current landscape of fair AI and AI explainability is marked by a stark discrepancy between the growing recognition of their importance and the actual efforts undertaken to address them. While academic conferences, think tanks, and even some regulatory bodies are putting an increasing focus on the need for AI to be both fair and explainable, these discussions often don't translate into actionable steps within organizations.

Many companies are still in the early stages of understanding what it means to implement fair and explainable AI systems. The common practice of simply deleting sensitive attributes like race, ethnicity, or religion from datasets is a glaring example of the superficial approaches that fail to address the root cause of the problem. This not only perpetuates biases through proxy variables but also obfuscates the decision-making process, making it even harder to audit and explain the AI model's behavior.

The result is a landscape where algorithmic decisions, although increasingly critical in everything from loan approvals to medical diagnoses, lack both fairness and transparency. This undermines public trust in AI systems and exposes organizations to both ethical scrutiny and legal repercussions. And while there are tools and methods available for auditing algorithms, their adoption remains woefully limited, often considered as an afterthought rather than a fundamental part of AI development. Consequently, the industry is caught in a cycle of deploying algorithms that neither the creators nor the end-users fully understand or trust, perpetuating a status quo that is increasingly at odds with societal demands for fairness, accountability, and transparency.

Synthetic data for fair AI and AI explainability

Good quality AI-generated synthetic data can reduce bias in training datasets and can thus help to create fair AI systems.

For example, synthetic data generated by MOSTLY AI's synthetic data platform corrected a racial bias in crime prediction from 24% to just 1 % and narrowed the gap between high-earning men and women from 20% to 2% in the US census dataset. Read the Fairness Series to learn more about how fair synthetic data can help with reducing biases!

As for explainable AI, synthetic data can play a critical role in the auditing process. Auditors and regulators often require access to the data that trained a given model to validate its performance and ethical considerations. Sharing original, sensitive data might not be feasible due to privacy and regulatory constraints. However, synthetic data can be freely shared, as it encapsulates the statistical properties of the original data without the sensitive details. Thus auditors and teams evaluating trained models work with synthetic data, enabling more transparent and fair AI systems.

Using synthetic data for such audits provides an effective and privacy-compliant way to document, validate, and certify AI models, which is vital in gaining public trust and meeting regulatory standards.