9 types of bias in data analysis and how to avoid them (original) (raw)

There are various ways bias can show up in analytics, ranging from how a question is hypothesized and explored to how the data is sampled and organized. Addressing bias should be the top priority for anyone who works with data. If not careful, bias can be introduced at any stage, from defining and capturing the data set to running the analytics or AI and machine learning system.

Although data scientists can never completely eliminate bias in data analysis, they can take countermeasures to look for it and mitigate issues in practice. Avoiding bias starts by recognizing that data bias exists in the data itself, the people analyzing or using it and the analytics process. There are many adverse impacts of bias in data analysis, ranging from making bad decisions that directly affect the bottom line to adversely affecting certain groups of people.

What is bias in data analysis?

Bias is a statistical distortion that can occur at any stage in the data analytics lifecycle, including the measurement, aggregation, processing or analysis of data. Often, bias goes unnoticed until you've made some decision based on your data, such as building a predictive model that turns out to be wrong. Generative AI (GenAI) models and the processes of using them for analytics are also starting to introduce new types of bias.

These kinds of systemic problems can occur in a wide variety of ways, according to Bharath Thota, a partner in the Digital and Analytics practice of Kearney, a global strategy and management consulting firm. These include the ways teams measure, sample, observe and focus on the data analytics process.

In general, the best way to avoid bias often starts by identifying the source of bias and identifying countermeasures. Cross-validating models can also improve model robustness. Conducting exploratory data analysis can help detect potential biases early in the process.

Statistics textbooks are filled with basic types of bias. However, business analytics teams are increasingly running into new kinds of bias owing to changing business practices and the use of new technologies, such as generative AI. Here are nine types of bias in data analysis that are increasingly showing up and ways to address each of them.

1. Trained on the wrong thing

Data analytics teams sometimes go for big data instead of granular data. For example, a team might gather data on all stores in a retail chain's daily sales by week for a particular analysis. Inna Kuznetsova, CEO of ToolsGroup, a supply chain planning and optimization firm, said this can sometimes take more time and expense yet be far less useful in planning promotions than a smaller set of much more granular data. For example, in a small cluster of stores sharing similar demographics, tracking sales by the hour these stores are open for business would enable the stores to plan promotions that can be targeted to the needs of a particular customer set.

How to avoid

Start with the type of analysis and consider the best way to identify patterns within related data sets. Also, identify when certain data sets might not be relevant to a given analysis. For example, a standalone store of an upscale brand on a summer vacation island might not follow the regular pattern of large sales at Christmas. It makes most of its sales during summer and barely sells anything once the big city crowd leaves at the end of the season. "Bigger data is not useful for that store, but more granular data is," said Kuznetsova.

2. Confirmation bias

A confirmation bias results when researchers choose only the data that supports their own hypothesis. Confirmation bias is found most often when evaluating results.

"If the results tend to confirm our hypotheses, we don't question them any further," said Theresa Kushner, partner at Business Data Leadership, a data consulting company. "However, if the results don't confirm our hypotheses, we go out of our way to reevaluate the process, the data or the algorithms, thinking we must have made a mistake."

How to avoid

Develop a process to test for bias before sending a model off to users. Ideally, it might be good to run the testing with a different team that can look at the data, model and results with a fresh set of eyes to identify problems the original team might have missed.

3. Availability bias

Matt McGivern, managing director and enterprise data governance lead at Protiviti, said he is increasingly seeing a new kind of bias in which high-value data sets previously in the public domain are being locked behind paywalls or are no longer available. Depending upon the financial backing of the modelers and the types of data, future model results might be biased toward data sets that are still available for free within the public domain.

How to avoid

Depending upon the modeling use cases, creating high-quality synthetic data sets can help address the availability. Additionally, there might be some advantage in the future as more data sets previously only available to individual organizations are now opened up publicly, even if charges accompany them.

4. Temporal bias

It's important to consider how a specific prediction might change over different time windows, such as weekdays/weekends, end of month, seasons or holidays. Temporal bias can arise when data from specific times is used to make predictions or draw conclusions without accounting for potential changes or seasonality.

How to avoid

Patrick Vientos, principal advisory at Consilio an eDiscovery company, said possible mitigations include using time series analysis techniques, rolling windows for model training and evaluation, accounting for seasonality and cyclical patterns, and regularly updating models with new data.

5. AI infallibility bias

Generative AI models can craft authoritative-sounding prose in their responses, yet headlines of lawyers citing hallucinated cases have recently grabbed attention. Nick Kramer, vice president of applied solutions at SSA & Company, a global consulting firm, said he has also seen the same problem in business analytics cases, in which users rely on GenAI to do the math and trust the numbers or rush emails with incorrect facts.

How to avoid

Kramer recommended approaching AI as you would approach new hires with no experience. Analytics users who adopt generative AI tools to help interpret analytics need thorough training on the strengths and weaknesses of GenAI and large language models (LLM). It's also important to retain healthy skepticism toward the results models produce.

6. Optimist bias

Analysts or data scientists sometimes generate analyses or a collection of insights that are positive, hopeful and supportive of enterprise objectives. This can come at the expense of revealing the whole truth and ensuring both appropriate representation of what is most likely and appropriate risk identification and mitigation.

How to avoid

Donncha Carroll, partner and chief data scientist at corporate advisory and business transformation firm Lotis Blue Consulting, recommended teams normalize, recognize and reward accuracy and early identification of risks that the business needs to manage. This requires asking the right questions to draw out the right information and understand the value of a balanced perspective. It's also important to take the time to review the underpinnings of past business decisions to determine which insights and methodologies delivered the best results.

7. Ghost in the machine bias

Carroll is also starting to see cases where new AI tools are integrated into traditional analytics in a way that obfuscates how the insights are created. These sophisticated models can provide important and high-value insights. However, they also introduce complexity under the hood. For example, each answer could be a cobbling together of information from different sources, which makes it more difficult to understand if each component thread or source is represented accurately and appropriately weighted in coming up with the answer.

How to avoid

Carroll recommended starting by openly and honestly determining the level of impact associated with making bad decisions based on the answers provided by the system. Determine where the information creation process or pipeline is more machine-driven for the most important insights. Then, build in one or more human-in-the-loop steps in the process to audit the information and the method to avoid making dangerous mistakes.

8. Preprocessing bias

The staging and preparation of data can sometimes introduce preprocessing bias. Allie DeLonay, a senior data scientist for the data ethics practice at SAS, said decisions on variable transformations, how to handle missing values, categorization, sampling and other processes can introduce bias.

For example, when telehealth exploded during the pandemic, it introduced some systemic changes in the data available to healthcare professionals. As a result, data scientists had to consider how to process different data sets across the various processes. For example, data from the health monitoring devices collected by patients at home might need different processing steps than similar data collected by nurses at a hospital.

How to avoid

DeLonay said data scientists need to decide what to do when data is missing or might need to be processed differently. They need to be careful, particularly in domains such as healthcare, because these types of decisions have been shown in some studies to increase unfairness.

Suppose a data scientist uses primary care visit data to evaluate how the pandemic affected blood pressure values for patients with hypertension. In that case, they need to decide what to do with the data where vital signs are missing. Should they impute them? This is where they need to be careful because these types of decisions have been shown in some studies to increase unfairness.

9. Terminology bias

Generative AI models can also introduce bias in analytics when models are trained on public data that uses terminology that differs from an organization. This can lead to problems when running analytics against unique enterprise data. "What ends up happening is that generative AI does not understand company-specific terminology," said Arijit Sengupta, founder and CEO of the AI platform Aible. For example, one company might refer to a "sales zone," but the AI model might not interpret that as "sales territory."

How to avoid

Organizations must consider how representative their data is compared to the data an LLM is trained on. Sengupta said prompt augmentation can help in simple cases by translating company-specific words into better words for the LLM. More complex cases might require fine-tuning the LLM for more substantial differences.

Editor's note: This article was republished in July 2024 to improve the reader experience.

George Lawton is a journalist based in London. Over the last 30 years he has written more than 3,000 stories about computers, communications, knowledge management, business, health and other areas that interest him.