Testing generative AI apps (original) (raw)
With generative AI apps rising in popularity, QA professionals should be aware of their unique failure modes. Learn what testers need to consider when handling these tools.
Generative AI tools break in different ways than conventional apps. Problems such as hallucinations and bias create new issues for testers. In addition, these tools introduce new failure modes that testing professionals need to consider.
Generative AI application development can greatly enhance productivity by automating or streamlining daily tasks. Tools such as ChatGPT introduce new possibilities for building, maintaining and improving existing applications.
Sreekanth Menon, vice president and global leader of AI and machine learning services at Genpact, an IT services and consulting company, observed that the scope for generative AI tools in their current form includes the processes of business analysts, developers and testers. Generative AI apps have become good at understanding natural language and even code.
However, not all generative AI models offer the same capabilities. Differences in architecture, training and parameter count can lead to varying performance and risks.
What are generative AI's failure modes?
Generative AI apps don't break like normal apps. This raises new challenges for those who adopt AI into their workflow. QA professionals must extend how they think about failure to catch many of the newer problems generative AI introduces.
Timothy Martin, vice president of product at Yseop, a company that develops generative AI tools, said that closed-system large language models, such as OpenAI's GPT and Google's Gemini, are publicly available through apps and APIs but may present challenges due to limited technical details, data privacy issues, inaccurate output generation and bias. Enterprise users have little control over these models and generally must live with their failure modes and limitations. Users prompt the system in different ways to control output and optimize results.
Martin finds it helpful to use open source models, such as Google's Flan-UL2 or Databricks' Dolly, which offer more technical information and control and can be customized for specific tasks through training. "Optimizing models for narrow use cases in this way can limit failure modes and give exceptional results," Martin said.
Open source models offer transparency that helps QA pros do their jobs.
Martin also recommended adopting enterprise QA measures covering areas such as data quality, output accuracy, interpretability and continuous monitoring.
Belwadi Srikanth, vice president of product and design at Suki AI, an AI voice platform for healthcare, said some of the most common failure modes include AI responses with the following characteristics:
- They aren't formatted properly.
- They don't match the desired tone or response length.
- They return unusual inputs that may not match the desired behavior for edge cases.
- They don't fully understand the complexity of the task.
In most cases, testers can address these failure modes with effective prompting.
Teams need to choose the most suitable models based on their specific needs and implement strict QA measures to ensure the effectiveness, security, fairness and regulatory compliance of these tools.
Testing considerations for generative AI apps
While the new frontier of generative AI apps opens exciting possibilities, it also demands careful navigation. Enterprise organizations must understand the distinct capabilities and limitations of different AI models. Teams need to choose the most suitable models based on their specific needs and objectives. They should also implement strict QA measures to ensure the effectiveness, security, fairness and regulatory compliance of these tools.
Srikanth finds it helpful to assess generative AI models on these four aspects:
- Ability to meet regulatory requirements.
- Effect on user workflows.
- Amount of change management required.
- How inaccuracies can be identified and corrected.
Another emerging challenge revolves around the concept of prompt engineering. This can help teams develop an effective prompt template for how an AI system behaves around various inputs.
Srikanth recommended curating a diverse set of test inputs for evaluating AI. These include a representative sample of the diversity of common inputs and a set of important edge cases to cover, which could include adversarial user behavior.
"Having these data sets enables you to quickly iterate on a prompt template," Srikanth said.
Srikanth finds it's possible to maximize performance by filling the prompt template with clear and detailed instructions, many examples of how to respond to a range of inputs and detailed descriptions of how to handle every edge case that might show up. If the task is particularly complex, it may make sense to break it down into smaller components, each of which is handled with a different prompt template.
Does generative AI always increase efficiency?
Generative AI can often induce the redundancy that it was meant to alleviate in the first place, Menon cautioned.
Menon finds it most challenging to use ChatGPT and other generative AI tools for existing use cases because it introduces new complexities. In these scenarios, using generative AI often dramatically increases the Fibonacci scale, a measure of the number of story points used to quantify project complexity.
Generative AI can also increase the number of test cases QA pros must consider. Testers who have traditionally focused on code nuances also need to consider the memory aspects of these new generative tools.
For example, ChatGPT's GPT-4 version allows 32,000 tokens at one go. These limits include the token count from both the message array sent and the model response. The number of tokens in the message array combined with the value of the maximum token parameter must stay under these limits, or the model returns an error. As a result, QA testing teams cannot rely solely on the few-shot learning of ChatGPT.
In addition, testers need to watch out for AI hallucinations, which can lead to false positives or false negatives and require more time spent retesting.
George Lawton is a journalist based in London. Over the last 30 years, he has written more than 3,000 stories about computers, communications, knowledge management, business, health and other areas that interest him.
Next Steps
OpenAI and Apple's partnership, explained