Building Trust in AI-Driven QA: Ensuring Transparency and Explainability With GenAI

Getting people on board means showing them AI is there to help them do their jobs better, not take them away.

May 16th, 2025 5:00am by Saqib Jan

Featued image for: Building Trust in AI-Driven QA: Ensuring Transparency and Explainability With GenAI

Generative AI (GenAI) is helping many of us with things we do every day. It is also rapidly advancing quality assurance (QA), fueling breakthroughs in testing that promise to exponentially speed up delivery and achieve unprecedented scale, potentially shattering old limits on automation.

But getting testing teams to trust GenAI is also proving more challenging than it looks. A big part of the problem is the feeling that AI is a black box with unclear processes and outputs. People naturally worry if something they don’t understand can be reliable for ensuring software quality. And yes, there’s real concern about what these AI tools mean for human jobs in QA.

“Building trust with AI needs a focused effort centered on being open about how the AI works, especially the technical parts impacting reliability,” says Mayank Bhola, co-founder and head of product at LambdaTest, a cross-browser testing platform. “It also means sticking to strong ethical rules, particularly with sensitive data. Establishing that trust is essential for making AI-driven QA solutions truly work and gain traction in a technical environment,” he emphasizes.

QA is a critical underpinning for reliable software delivery, but a painstaking and resource-intensive process, with costs rising linearly. So, how are companies leveraging GenAI to navigate the complex testing landscape, build trust, and optimize for costs? I gathered insights from notable industry leaders to explore how organizations can tackle these layered challenges head-on.

Can You Trust AI With Your Data

Building trust in AI-driven QA starts with strong ethics. The most critical part is handling data right, keeping everything confidential and private. Any company faces a huge risk if sensitive information accidentally gets into AI models or shows up in test results.

Companies must ensure that AI’s data doesn’t include Personally Identifiable Information, or PII. Speaking from his experience launching Kane AI (a native GenAI test agent developed by his team at LambdaTest), Bhola calls using “anonymous data” the “number one ethical guideline.” This means putting processes in place to technically “mask all the personal identities,” like customer SSNs or other sensitive IDs, before that data ever goes into an AI tool or is used to generate a test case. So, rigorous data masking isn’t just a suggestion; it’s a technical necessity for privacy compliance.

There’s also a big worry about confidential company data being used to train AI models themselves, especially with third-party tools. Jon Matthews, VP of Engineering at Functionize, stresses the importance of knowing exactly what information is passed back to the AI model and ensuring that any data sent is anonymized.

Hugo Farinha, co-founder at Virtuoso QA, agrees that models should “never be trained on confidential inputs unless explicitly permitted.” This means organizations need clear contracts and technical controls at the network level — like using TLS for communication and segmenting traffic — to prevent their proprietary data from improving a vendor’s AI model without permission.

For companies in sensitive and highly regulated industries like finance or healthcare, even more robust technical isolation is often necessary. “In such cases, solutions designed to operate locally or within secure private clouds are frequently preferred,” says Marcus Merrell, Principal Technical Advisor at Sauce Labs. This architectural approach helps ensure sensitive data remains contained, preventing unintended exposure or use in broader vendor model training, thereby maintaining stricter data separation.

However, ethics isn’t just about the input data; it’s also about trusting the AI’s output. GenAI models can sometimes produce incorrect or misleading results. Michael Larsen, Development Test Engineer at ModelOp, who has conducted many experiments in his engineering purview, warns testers not to trust what the AI generates blindly. He points out that AI is “frequently wrong,” and, problematically, is “so confident in its wrongness.” This insight highlights the absolute need for human oversight and verification of AI-generated test cases or findings. Failing to check AI output manually can lead to what Larsen terms “Automated Irresponsibility,” where potential bugs or false positives are missed because the process looks complete and automated — giving a false sense of security.

Avoiding bias in test generation is another ethical challenge with technical roots. Karan Ratra, Senior Engineering Leader at Walmart, stresses that AI frameworks should not “consider a specific religion, caste or any specific behavior of humans or differentiate between genders” when creating test cases. This means building or configuring AI to ensure fairness and broad coverage, potentially by refining training data to be more diverse and representative. AI outputs shouldn’t “inadvertently expose sensitive user data or bias.” Addressing bias is an ongoing technical and data management task requiring continual refinement of the models or test data used.

Having clear policies around the responsible use of sensitive data and specific AI tools is also crucial, says Merrell. Publishing these policies and offering regular training helps teams understand the rules and act ethically in their day-to-day work with these technical tools. Picking only approved AI tools also helps manage technical and ethical risks, ensuring alignment with company standards.

Ethics isn’t something like a compliance checkbox — it’s the bedrock upon which trust in AI-driven QA must be built. It’s the base. Prioritizing data security, demanding transparency in AI behavior, and maintaining rigorous human oversight are non-negotiable steps toward responsible adoption.

What Are AI’s Strengths and Weaknesses for QA

A lot of hype surrounds AI in QA right now. But building trust means looking past all that excitement and getting really realistic about what GenAI tech can actually do in a technical workflow, and where its current limits are. Industry leaders caution against viewing GenAI as a silver bullet or a magical replacement for human testers. In an email interview, Merrell warns against vendors making claims that LLMs are “simply incapable of replacing the human mind when it comes to testing,” labeling such assertions as “selling snake oil.” So, you really need to be critical of sales pitches and ground your technical expectations in reality.

GenAI works much better as a helper, designed to augment human capabilities rather than automate everything entirely. It’s good at specific jobs it understands well, especially tasks involving structured data or code bases. It can be very helpful in generating test cases or synthetic test data. Merrell suggests feeding a database “schema (metadata)” to an LLM is a better practice than asking it to analyze data directly. The AI can use that structure to “form specific questions from natural language” or “generate customer data using any number of tools.”

Mike Finley, CTO and co-founder at AnswerRocket, notes its potential to “break through the ceiling that previously limited test automation” by helping “create hundreds of interesting test cases (and the expected results).” AI excels at creating unit tests because it validates code without going to production, and this is the fastest way enterprises can take advantage of generative coding by AI. Providing structured input helps AI perform well-defined generation tasks efficiently.

And the tech can assist with analysis and evaluation, too. Finley explains how GenAI, especially using “multimodal models,” can look at the “visual quality of a user experience,” evaluate results to see if “numbers run together or symbols are not displayed correctly,” or even “review feedback from human testers and quickly correlate key areas that are deficient.” These are tasks previously done poorly or manually. And this shows AI’s technical ability to process and find patterns in complex, multiformat data sources that traditional automation struggled with.

However, GenAI tools definitely have significant weaknesses that can undermine their perceived reliability in critical technical flows. Merrell points out that basic issues like “hallucinations and non-deterministic results” are inherent, “unsolvable problems with LLMs” for now. These issues stem from the probabilistic nature of the underlying models; the same prompt won’t always give the exact same output, which is a fundamental challenge for creating repeatable, reliable tests.

Engineering leaders have often raised concerns that because LLMs guess based on enormous, varied data, their output can sometimes be “low quality,” potentially hide “security vulnerabilities,” and is inherently “non-deterministic and opaque.” These technical limitations around consistency and verifiable quality lead some to avoid LLMs for critical tasks like unit test generation, preferring deterministic AI approaches instead.

Larsen adds that machines aren’t “good at nuance or shifting context” and struggle when they lack clear examples. This happens because AI excels at identifying patterns it’s seen before but falters in truly novel or subtly different scenarios where human understanding is required.

Finley notes it’s still tough to get AI to “act like a human user,” especially with all the unpredictable ways users interact with software, like rage clicking or getting confused. Simulating these non-logical human behaviors goes beyond the AI’s current capabilities for following defined steps or patterns.

Merrell points out that nearly every tool on the market with GenAI capabilities has less than two years of exposure to LLMs, calling the products “immature.” This lack of maturity means tools haven’t been tested enough in real-world QA contexts to back up all their claims, making it hard to separate fact from fiction. But while AI testing can help, testing something probabilistic adds another “layer of probabilities,” invariably adding delays and cost compared to deterministic checks.

Finley also brought up an interesting idea for the future as more software uses AI features: ” The problem of testing GenAI itself comes up.” He suggests solving AI problems often needs “more AI,” hinting that AI will end up testing other AI systems because “humans are not capable of adequately testing AI” alone, especially against complex attack vectors or prompt injections in a DevSecOps context. This implies a future where AI models are technically validated by other, potentially superior, AI models.

Understanding these technical details — what AI is actually good at (like data generation from schema or multimodal analysis), where its probabilistic nature causes problems (like hallucinations or non-determinism), and where it simply lacks human-like understanding or behavior simulation — is key. Bhola stresses that it is precisely this grasp of capabilities and limitations that allows teams to build real trust in the tech itself and apply it effectively in QA workflows.

Getting Your QA Team On Board With AI

Adopting AI tools in QA isn’t just about buying and dropping the software onto teams. It’s fundamentally about the people who need to use the tech daily, and whether they trust it. And that means tackling the big, unspoken worry upfront: is this AI going to take my job?

Leaders agree you have to address this fear head-on by showing teams that AI is there to help them, not replace them. Ratra of Walmart stresses it’s crucial to make it “very clear… that this AI is only for assisting them and improving their productivity rather than replacing them.” Larsen makes it relatable, calling AI a “precocious 10-year-old” assistant — capable but needs guidance.

John Yensen, President at Revotech Networks, sees AI best used as a “collaboration tool.” This consistent messaging, focusing on assistance and boosting productivity on technical tasks, helps calm fears about job security.

But it’s more than just keeping jobs. Jon Matthews, VP of engineering at Functionize, suggests positioning AI tools like “power-ups” that give testers “superpowers,” enabling them to “level up at their jobs.” This includes letting them “evolve their roles to take on higher-level responsibilities,” like expanding into test strategy or project management, because AI handles the busywork. It reframes AI as a technical enabler for career growth within the QA discipline.

And you have to be honest about what the tech can and can’t do in practice. Merrell advises being “up front about the risks of GenAI QA tools, as well as the strengths and weaknesses.” He says, “Your team will trust you if they see that you understand these things and don’t just hype the productivity gains.” Acknowledging the tech’s current limitations — like hallucinations or non-determinism — is important. And not pretending skepticism doesn’t exist is also helpful, because “Testing is a skeptical process!” he reasons. Testers are naturally wired to look for flaws — in software and in new tools.

Effective enablement through training and support is also critical for building confidence in the technical tool. Bhola recommends “structured onboarding and training programs” with “hands on workshops and role based training and certifications.” This provides foundational technical knowledge about how the AI works and how to use it. He suggests giving teams “training, sandbox time, and safe spaces for exploration.” This hands-on experimentation is very useful for testers to get comfortable with the AI’s technical behavior and quirks without fear of breaking production systems.

Providing sandboxes and allowing experimentation lets teams earn trust through direct interaction. Larsen emphasizes the need for transparency, advising companies to mention “where, when, and how extensively it is being used.” Passing off AI work as human effort might offer a short-term gain, but it kills trust long-term when the unexplainable AI output causes technical issues or failures in test coverage.

Starting with small, practical wins also helps demonstrate the tech’s worth in a tangible way within existing technical processes. Yensen suggests using “hands on trials” and showing “small wins through pilot styled projects” to integrate AI into technical workflows gradually. Farinha of Virtuoso QA, agrees, recommending “low-risk, high-impact use cases” first to show value before a bigger commitment, like using AI to summarize complex technical documentation.

And allowing testers’ agency to work with the tools and figure out how to deal with their deficiencies helps build ownership and trust, says Merrell. They can find the specific technical scenarios where the AI helps most and where human intervention or correction is necessary, becoming experts in the AI’s practical application.

So, empowering teams with clear communication, solid technical training, space to experiment with the tech, and realistic expectations is essential. Building that human trust is actually the real engine for successful AI adoption in a technical QA environment.

Remember, Testers With AI Will Replace Testers Without AI

Forget the idea that AI means human testers are out of a job. Leaders are really clear that the future of QA isn’t empty of people; it desperately needs both humans and AI collaborating closely. AI is best when it acts as a powerful assistant, designed to augment human technical abilities rather than trying to replace them entirely.

Finley states, “AI won’t remove all humans from any process.” He thinks AI will help companies “reorganize most tasks so that the repetitious bulk of work can be automated.” This automation removes tedious technical chores like writing boilerplate code, generating basic test data, or performing redundant checks. And that frees up testers to do the things only they can do.

Humans bring vital cognitive and technical skills to QA that current AI models cannot replicate. Larsen stresses you need “space for human discernment, especially in edge cases and ethical concerns,” noting that machines are “not good at nuance or shifting context.” Merrell is firm that LLMs are “simply incapable of replacing the human mind when it comes to testing.” He highlights humans’ strategic thinking, like understanding complex product requirements, predicting failure modes based on experience, or understanding industry-specific technical contexts that AI might miss.

Finley points out that humans “generalize from examples faster than AI does” and “identify singular solutions faster.” AI excels at finding patterns in massive datasets. Still, human intuition, rooted in broad experience, can often leap to a unique technical solution or identify a critical edge case much faster than an algorithm. He also notes, importantly, that humans “know how to act irrationally when emotions or feelings are involved,” helping simulate complex user behavior that goes beyond predictable technical paths.

Human ethical judgment remains indispensable, especially when dealing with sensitive data or ambiguous test results generated by AI. As Finley puts it, it will ultimately be “human badges on the line that stops poor software from getting pushed out” because final accountability and the decision to release software rest with people who understand the real-world impact, not just technical output.

Because AI handles the repetitive technical tasks, testers get a chance to evolve their roles into higher-value areas. Matthews sees AI tools letting people “level up at their jobs,” perhaps taking on bigger responsibilities like defining test strategy, managing complex testing projects, or becoming subject matter experts in the AI tools themselves and their technical limitations. Using AI effectively allows testers to become more strategic technical partners in development.

But this shift needs the right mindset. Larsen advocates for a “culture of skepticism, inquiry, and curiosity — not blind adoption.” He encourages testers to treat AI like a “bright but unreliable assistant.” This technical skepticism is critical for verifying the AI’s output and understanding its limitations in specific technical scenarios, ensuring reliability.

The truth isn’t that AI is replacing testers everywhere. It’s about how testers adapt and integrate these new technical tools into their skillset. Larsen has a strong prediction that captures this technical evolution: “GenAI won’t replace testers, but testers who can use and implement GenAI effectively WILL replace those who can’t, or won’t.” So, the way forward is for humans to master using AI tools to boost what they can do and increase their technical impact in the QA process.

Don’t Forget: Trust Is Built

Building trust in AI-driven QA is definitely isn’t simple. But doing it right is totally essential if companies want to actually use AI successfully in their technical workflows, not just try it out. Leaders across the industry make it clear that this trust doesn’t just happen on its own. You have to build it on purpose, piece by piece, deliberately addressing the technical and human aspects.

It starts with being super ethical and responsible with data, which means implementing technical safeguards like masking and anonymization to keep everything private and secure. It also needs tools that are truly transparent and explainable, offering logging and audit trails that show testers how the AI is making its technical decisions so people can verify the output and understand potential technical issues.

Getting people on board means showing them AI is there to help them do their jobs better, not take them away. Picking the right technical tools requires looking way past the marketing. You need to check for real transparency features, adherence to security standards, integration flexibility with your existing stack, and whether they actually deliver value in real-world technical scenarios.

Plus, companies have to be ready to handle the real problems. This includes realistic expectations about the tech’s current capabilities, the technical challenges of data governance and pipeline integration, the true ROI beyond simple time savings, and navigating the evolving DevSecOps landscape, AI impacts. Navigating these requires deliberate technical and organizational strategies.

AI in QA is fundamentally about people and tech working together. AI can handle repetitive technical tasks and find patterns in data. However, we still need human brains to think critically, make judgment calls, handle tricky situations, understand complex systems, and set the overall test strategy based on business context.

When organizations successfully build trust using strong ethical rules, being open about the tech’s realities, training their teams well on how to use the tools, and picking tools carefully based on technical merit and trustworthiness, they can unlock what AI can do for QA. And that means testers become more strategic technical partners, technical processes become smoother and more efficient, and everyone can feel more confident shipping reliable software systems.

Saqib Jan is a technology analyst with experience in application development, FinOps, and cloud technologies.