Most AI automation sales processes still spend too much time on the demo and not enough time on the operating model.
That is understandable. Demos are easy to love. They move fast, they compress complexity, and they make a workflow look cleaner than it usually is in production. A vendor uploads a file, classifies a ticket, routes an approval, drafts a reply, or extracts structured data in thirty seconds. Everyone in the room sees time savings. The buying team starts thinking about rollout.
Then the real questions arrive, usually from operations, security, legal, or the person who has to own the workflow after the pilot team moves on. Who approves actions when the model is uncertain? What gets logged, and can those logs be trusted later? Where is sensitive data redacted? What happens if a prompt change or model change silently makes the system worse next month? Can we roll back without shutting the whole operation down?
Those questions are not annoying procurement theater. They are the questions that separate a promising proof of concept from a vendor relationship your team can actually live with.
The mistake many buyers make is assuming these answers will sort themselves out after vendor selection. They rarely do. If approvals, logs, redaction, and rollback are vague during diligence, they usually remain vague during rollout. That is why vendor due diligence should be less about who gives the slickest demo and more about who can explain, in plain language, how the system behaves under pressure.
The demo is not the system
A demo proves that a workflow can be shown. It does not prove that it can be governed.
This is the core mental shift buyers need to make. In a controlled demonstration, inputs are cleaner, failure paths are hidden, and the operator knows what outcome the room is hoping for. Real operations are not like that. Real workflows have messy payloads, ambiguous edge cases, unexpected exceptions, policy changes, downstream outages, and humans who need to trust the tool even when it says "I am not sure."
That gap between demo quality and production quality is why vendor evaluation should always move quickly from capability questions to control questions. A vendor can have excellent model orchestration and still expose the business to unnecessary risk if approval paths are unclear, output handling is weak, or rollback is essentially manual panic.
The NIST AI RMF Playbook is useful here because it frames AI risk work across four practical functions: govern, map, measure, and manage. That is a much better buying lens than "which provider do you use" or "what benchmark score do you quote." It forces the conversation toward ownership, context, observability, and response.
When buyers skip that shift, they often end up choosing a vendor that looks strong in workshops and fragile in operations.
Start with workflow consequence, not model brand
The best diligence conversations begin with consequence. What exactly can this automation see, decide, recommend, route, or trigger? What business process is affected if it is wrong? What data classes move through it? Which humans still have to trust the output enough to act on it?
If those questions are blurry, the rest of diligence becomes generic. The vendor answers with platform language. The buyer asks for "security details." Both sides talk past each other.
A stronger pattern is to define one target workflow in concrete terms before asking technical questions. For example: invoice approvals over a specific amount, inbound support triage containing customer data, document extraction tied to contract workflows, or onboarding routing for regulated accounts. Once the workflow is concrete, you can test whether the vendor's controls match the actual consequence.
This is also where NIST's Generative AI Profile helps. It exists to extend the AI RMF into generative AI-specific risks and operating considerations. In practice, that means buyers should not ask only whether a vendor has controls in theory. They should ask whether those controls match the specific data, action authority, and human dependence inside the workflow they are buying.
A low-risk summarization tool and a workflow that can change customer state should not be evaluated with the same tolerance. Good vendors understand that immediately.
Ask where approvals live and who can change them
Approval design is usually where weak vendors start sounding vague.
A lot of platforms say they support human-in-the-loop review. That phrase is not enough. Buyers need to know where approval happens, what triggers it, who can override it, who can change the thresholds, and whether those changes are recorded. If the answer is mostly "our team can configure that for you," you have learned something important already.
Approval logic is not a cosmetic setting. It is part of the workflow's real authority model. If a model can draft, classify, route, or trigger actions, the approval surface defines how much autonomy the business is actually accepting. OWASP's guidance on LLM risk is useful here because prompt injection, insecure output handling, sensitive information disclosure, and excessive agency are not abstract vulnerabilities. They are concrete reasons to keep approval boundaries explicit.
In diligence, ask the vendor to walk one risky case end to end. Show me a low-confidence outcome. Show me a policy exception. Show me who gets notified. Show me whether a reviewer sees enough context to make a good decision. Show me whether the reviewer can approve only this action or accidentally expand permissions across the workflow.
Strong vendors answer this by describing roles, thresholds, queues, escalation states, and audit events. Weak vendors answer it with reassurance.
If approval logic matters to your process, this is also where human-in-the-loop AI workflow guardrails becomes relevant. The right vendor should sound like they already think that way.
Ask what gets logged, and whether the logs are operationally useful
Logging is another area where buyers get pleasant-sounding answers and very little usable detail.
You do not just want to know whether the vendor has logs. You want to know what the logs make possible. Can your team reconstruct who approved or overrode a step, which prompt or policy version was active, what the model returned, what downstream action occurred, and whether the workflow later failed? Can you correlate that across the system quickly enough to investigate a live problem?
NIST SP 800-53 is helpful here because it pushes in two directions buyers should care about. It emphasizes the value of a system-wide, time-correlated audit trail, and it also recommends limiting personally identifiable information in audit records when that information is not operationally required. Those two ideas belong together. Good logs let you reconstruct the workflow without becoming a second uncontrolled data store.
That means your diligence questions should be specific. Are logs time-correlated across workflow steps? Are approval events, retries, manual overrides, and rollback actions captured in a standardized format? How long are logs retained? Can customers access relevant records, or only the vendor? How is sensitive content minimized in those records? What happens when log capture fails?
A strong answer sounds operational. A weak answer sounds infrastructural. "We use a major cloud logging stack" is not the answer. "We record actor, workflow step, policy version, decision state, downstream action, and rollback status, and we let you expose only the fields needed for your own audit posture" is much closer.
Ask how redaction works before prompts, in logs, and in outputs
Redaction is where many vendor conversations become suspiciously hand-wavy.
Teams hear "we support redaction" and assume the problem is solved. It usually is not. Sensitive data can appear before the prompt is assembled, inside retrieval context, inside the model response, inside reviewer interfaces, and inside logs. A vendor that talks about redaction as one pre-processing filter is probably underdescribing the real risk.
OWASP explicitly calls out sensitive information disclosure as a top LLM application risk. That should shape the diligence conversation immediately. Ask the vendor where redaction runs, how configurable it is by workflow, and whether the same protections apply to stored prompts, logged events, reviewer screens, and exported records. Ask whether redaction rules can be tested before release and whether policy changes are versioned.
This is also a place where your own internal standards matter. If you already publish a security and data handling page, use it as a buying tool. Make vendors explain how their workflow design aligns with your data minimization and approval expectations instead of accepting their generic security summary.
A useful question is simple: show me the path sensitive data takes through one real workflow. If the vendor cannot show the path clearly, they probably cannot control it clearly either.
Ask how the vendor handles prompt injection and excessive agency
Many buyers still treat prompt injection like a niche application security issue. That is outdated.
OWASP lists prompt injection as the first major LLM application risk for a reason. It is not just about making a chatbot say something strange. In workflow systems, prompt injection can change routing, alter extracted meaning, push unsafe downstream actions, or surface sensitive data in places it should never appear. Combined with excessive agency, where the system can take action too freely, the operational consequences get serious fast.
The diligence question is not "are you aware of prompt injection." Every vendor will say yes. The real question is how they constrain it in the product they are selling you. Do they isolate tool access by workflow? Do they sanitize or filter untrusted context sources? Do they limit what the model can trigger directly? Do they monitor for anomalous input and output patterns? Do they require explicit human approval before higher-risk actions?
Ask them to show how they would contain a poisoned input in your use case. Ask what the fallback behavior is when the system detects ambiguity. Ask whether they can disable certain actions while still keeping the rest of the workflow available. This is often where the operational maturity gap becomes obvious.
Vendors who have thought deeply about agentic or action-oriented systems will usually answer with boundaries, scopes, and failure modes. Vendors who have mostly optimized the happy path will answer with confidence scores and general assurances.
Ask what rollback means in practice, not in theory
Rollback is one of the most revealing due diligence questions because weak systems rarely have a good answer.
A vendor may say they support rollback, but buyers need to unpack what that means. Can you roll back a prompt version without rolling back the entire workflow? Can you pause one risky branch while leaving low-risk tasks running? Can you revert an approval threshold change? If a model update degrades extraction quality, can you restore the previous behavior quickly enough to protect operations? If an integration breaks, does work queue safely for human review or disappear into retry loops?
This is where NIST's govern-map-measure-manage framing becomes practical again. Rollback belongs inside manage, not as an afterthought. A good vendor should be able to explain the trigger for rollback, the owner responsible, the evidence used to make the decision, and the post-change verification steps.
Weak rollback stories usually sound like one of two things. Either the vendor says their system should not need rollback because they test thoroughly, or they imply rollback is effectively a support ticket their own team handles behind the scenes. Neither answer is good enough for an operational buyer.
You want a vendor that treats rollback as a normal production capability, not an embarrassing exception. That mindset is closely related to how mature they will be when something actually goes wrong.
Ask how releases are monitored after go-live
A surprising number of diligence conversations stop once the buyer understands configuration. They never get to the question of what happens after release.
That is risky because many AI automation failures are not obvious on day one. They show up as slow drift. Approval queues increase quietly. Manual overrides climb. Extraction accuracy falls for one document subtype. Reviewers start compensating for low-quality output without escalating because they are trying to keep work moving. By the time leadership notices, the workflow is already carrying hidden cost.
The vendor should be able to explain how they monitor post-release behavior and how customers participate in that process. What signals do they recommend watching in the first week after a prompt or model change? Can they isolate issues by workflow, customer segment, or data subtype? How do they distinguish model problems from integration problems or policy problems?
This is where strong vendors start sounding like operators instead of software sales teams. They talk about exception rates, approval volume, manual rework, rollback triggers, and release verification. Weak vendors stay at the level of usage dashboards.
If your team is evaluating broader rollout, connect this to your own internal standards before the contract is signed. A vendor should be able to speak clearly to the expectations you already set in security and data handling and to the way you expect AI automation projects to behave once they are live.
What strong answers sound like
By the time you get through approvals, logs, redaction, prompt-injection handling, and rollback, you are usually hearing one of two kinds of vendor narrative.
The strong narrative is concrete. It names workflow roles, policy boundaries, event fields, release processes, and failure states. It shows where humans are expected to intervene and where they are not. It distinguishes visibility from authority. It acknowledges tradeoffs. It does not pretend the system is magically safe. It shows how the vendor manages the parts that are not.
The weak narrative is abstract. It leans on trust language, generic security posture statements, and platform diagrams that make every workflow look equally manageable. It often treats governance as documentation and risk as something to be handled through best practices rather than product design.
Buyers should trust the more concrete answer even when it sounds less polished. Mature vendors rarely sound magical. They sound like people who have already been through rollout pain and designed the product so you do not have to rediscover the same lessons.
The questions that reveal whether the vendor is real
If you only get one serious diligence session with a vendor, protect time for the questions that force operational clarity. Ask them to show where human approval lives in the workflow and who can change that rule. Ask for the event trail of one completed action, including policy version, reviewer decision, and downstream effect. Ask where sensitive data is minimized before prompts, in logs, and in reviewer views. Ask how they handle prompt injection or untrusted context when the workflow can actually do things, not just generate text. Ask what rollback looks like for a bad prompt, a bad model change, and a broken integration. Ask which parts of the workflow can be paused without taking the whole process offline. Ask what your own team should monitor in the first two weeks after launch.
Those questions work because they are hard to answer with posture alone. If the vendor can answer them clearly, you are probably having a real diligence conversation. If they cannot, you are mostly evaluating presentation quality.
Run due diligence like a workflow test, not procurement theater
The most effective buying teams do one thing differently: they turn diligence into a miniature workflow review instead of a generic vendor security exchange.
They bring one real use case. They map the consequence. They ask the vendor to show the operating controls around that use case. They involve the people who will own approvals, exceptions, support, and reporting after go-live. That makes the conversation much harder to fake and much easier to evaluate.
This approach also reduces friction between technical and non-technical stakeholders. Security sees concrete controls. Operations sees practical runbooks. Legal sees policy traceability. The business sponsor sees whether the rollout will be calm or chaotic.
If you are about to buy or pilot AI automation, resist the urge to end the evaluation once the workflow looks clever. Ask how the system stays safe, legible, and reversible when the clever part stops being enough.
That is the threshold between buying an impressive demo and buying an operational system.
If you want help pressure-testing a vendor against your actual workflow, use the project brief to outline the process, data, and approval constraints you are dealing with. If you would rather start with a shorter conversation, contact. The right diligence work is rarely long. It is just specific, and specificity is where weak vendors usually run out of answers.

