docs: cleanlab style followon (#8436)

* update-for-style

* Apply suggestions from code review

Co-authored-by: KimberlyFields <46325568+KimberlyFields@users.noreply.github.com>

* style-followon-and-remove-component

* link

* article

---------

Co-authored-by: KimberlyFields <46325568+KimberlyFields@users.noreply.github.com>
This commit is contained in:
Mendon Kissling 2025-06-23 16:08:42 -04:00 committed by GitHub
commit 33aed89f81
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 89 additions and 86 deletions

View file

@ -7,139 +7,142 @@ Unlock trustworthy Agentic, RAG, and LLM pipelines with Cleanlab's evaluation an
[Cleanlab](https://www.cleanlab.ai/) adds automation and trust to every data point going in and every prediction coming out of AI and RAG solutions.
This Langflow integration provides 3 modular components that assess and improve the **trustworthiness** of any LLM or RAG pipeline output, enabling critical oversight for safety-sensitive, enterprise, and production GenAI applications.
This Langflow integration provides three Langflow components that assess and improve the trustworthiness of any LLM or RAG pipeline output.
Use this bundle to:
- Quantify trustworthiness of ANY LLM response with a **0-1 score**
- Explain why a response may be good or bad
- Evaluate **context sufficiency**, **groundedness**, **helpfulness**, and **query clarity** with quantitative scores (for RAG/Agentic pipelines with context)
- Remediate low-trust responses with warnings or fallback answers
Use the components in this bundle to quantify the trustworthiness of any LLM response with a score between `0` and `1`, and explain why a response may be good or bad. For RAG/Agentic pipelines with context, you can evaluate context sufficiency, groundedness, helpfulness, and query clarity with quantitative scores. Additionally, you can remediate low-trust responses with warnings or fallback answers.
## Prerequisites
Before using these components, you'll need:
- [A Cleanlab API key](https://tlm.cleanlab.ai/)
- A [Cleanlab API key](https://tlm.cleanlab.ai/)
## CleanlabEvaluator
This component evaluates and explains the trustworthiness of a prompt and response pair using Cleanlab. For more information on how the score works, see the [Cleanlab documentation](https://help.cleanlab.ai/tlm/).
## Components
<details>
<summary>Parameters</summary>
### `CleanlabEvaluator`
**Inputs**
**Purpose:** Evaluate and explain the trustworthiness of a prompt + response pair using Cleanlab. More details on how the score works [here](https://help.cleanlab.ai/tlm/).
| Name | Type | Description |
|-------------------------|------------|-------------------------------------------------------------------------|
| system_prompt | Message | The system message prepended to the prompt. Optional. |
| prompt | Message | The user-facing input to the LLM. |
| response | Message | The model's response to evaluate. |
| cleanlab_api_key | Secret | Your Cleanlab API key. |
| cleanlab_evaluation_model | Dropdown | Evaluation model used by Cleanlab, such as GPT-4 or Claude. This does not need to be the same model that generated the response. |
| quality_preset | Dropdown | Tradeoff between evaluation speed and accuracy. |
#### Inputs
**Outputs**
| Name | Type | Description |
|-----------------------|------------|---------------------------------------------------------------------|
| system_prompt | Message | (Optional) System message prepended to the prompt |
| prompt | Message | The user-facing input to the LLM |
| response | Message | OpenAI's, Claude, etc. model's response to evaluate |
| cleanlab_api_key | Secret | Your Cleanlab API key |
| cleanlab_evaluation_model | Dropdown | Evaluation model used by Cleanlab (GPT-4, Claude, etc.) This does not need to be the same model that generated the response. |
| quality_preset | Dropdown | Tradeoff between evaluation speed and accuracy |
| Name | Type | Description |
|-------------------------|------------|-------------------------------------------------------------------------|
| score | number | Displays the trust score between `01`. |
| explanation | Message | Provides an explanation of the trust score. |
| response | Message | Returns the original response for easy chaining to the `CleanlabRemediator` component. |
#### Outputs
</details>
| Name | Type | Description |
|-----------------------|------------|---------------------------------------------------------------------|
| score | number | Trust score between 01 |
| explanation | Message | Explanation of the trust score |
| response | Message | Returns the original response for easy chaining to `CleanlabRemediator` component |
## CleanlabRemediator
---
This component uses the trust score from the [CleanlabEvaluator](#cleanlabevaluator) component to determine whether to show, warn about, or replace an LLM response. This component has configurables for the score threshold, warning text, and fallback message that you can customize as needed.
### `CleanlabRemediator`
<details>
<summary>Parameters</summary>
**Purpose:** Use the trust score from the `CleanlabEvaluator` component to determine whether to show, warn about, or replace an LLM response. This component has configurables for the score threshold, warning text, and fallback message which you can customize as needed.
**Inputs**
#### Inputs
| Name | Type | Description |
|-----------------------------|------------|-------------------------------------------------------------------------|
| response | Message | The response to potentially remediate. |
| score | number | The trust score from `CleanlabEvaluator`. |
| explanation | Message | The explanation to append if a warning is shown. Optional. |
| threshold | float | The minimum trust score to pass a response unchanged. |
| show_untrustworthy_response | bool | Whether to display or hide the original response with a warning if a response is deemed untrustworthy. |
| untrustworthy_warning_text | Prompt | The warning text for untrustworthy responses. |
| fallback_text | Prompt | The fallback message if the response is hidden. |
| Name | Type | Description |
|-----------------------------|------------|-----------------------------------------------------------------------------|
| response | Message | The response to potentially remediate |
| score | number | Trust score from `CleanlabEvaluator` |
| explanation | Message | (Optional) Explanation to append if warning is shown |
| threshold | float | Minimum trust score to pass response unchanged |
| show_untrustworthy_response| bool | Show original response with warning if untrustworthy |
| untrustworthy_warning_text | Prompt | Warning text for untrustworthy responses |
| fallback_text | Prompt | Fallback message if response is hidden |
**Outputs**
#### Output
| Name | Type | Description |
|-------------------------|------------|-------------------------------------------------------------------------|
| remediated_response | Message | The final message shown to user after remediation logic. |
| Name | Type | Description |
|-----------------------|------------|-----------------------------------------------------------------------------|
| remediated_response | Message | Final message shown to user after remediation logic |
</details>
## CleanlabRAGEvaluator
See example outputs below!
This component evaluates RAG and LLM pipeline outputs for trustworthiness, context sufficiency, response groundedness, helpfulness, and query ease. Learn more about Cleanlab's evaluation metrics [here](https://help.cleanlab.ai/tlm/use-cases/tlm_rag/).
---
Additionally, use the [CleanlabRemediator](#cleanlabremediator) component with this component to remediate low-trust responses coming from the RAG pipeline.
### `CleanlabRAGEvaluator`
<details>
<summary>Parameters</summary>
**Purpose:** Comprehensively evaluate RAG and LLM pipeline outputs by analyzing the context, query, and response quality using Cleanlab. This component assesses trustworthiness, context sufficiency, response groundedness, helpfulness, and query ease. Learn more about Cleanlab's evaluation metrics [here](https://help.cleanlab.ai/tlm/use-cases/tlm_rag/). You can also use the `CleanlabRemediator` component with this one to remediate low-trust responses coming from the RAG pipeline.
**Inputs**
#### Inputs
| Name | Type | Description |
|-----------------------------|------------|-------------------------------------------------------------------------|
| cleanlab_api_key | Secret | Your Cleanlab API key. |
| cleanlab_evaluation_model | Dropdown | Thevaluation model used by Cleanlab, such as GPT-4, or Claude. This does not need to be the same model that generated the response. |
| quality_preset | Dropdown | The tradeoff between evaluation speed and accuracy. |
| context | Message | The retrieved context from your RAG system. |
| query | Message | The original user query. |
| response | Message | The model's response based on the context and query. |
| run_context_sufficiency | bool | Evaluate whether context supports answering the query. |
| run_response_groundedness | bool | Evaluate whether the response is grounded in the context. |
| run_response_helpfulness | bool | Evaluate how helpful the response is. |
| run_query_ease | bool | Evaluate if the query is vague, complex, or adversarial. |
| Name | Type | Description |
|--------------------------|-----------|----------------------------------------------------------------------------|
| cleanlab_api_key | Secret | Your Cleanlab API key |
| cleanlab_evaluation_model | Dropdown | Evaluation model used by Cleanlab (GPT-4, Claude, etc.) This does not need to be the same model that generated the response. |
| quality_preset | Dropdown | Tradeoff between evaluation speed and accuracy |
| context | Message | Retrieved context from your RAG system |
| query | Message | The original user query |
| response | Message | OpenAI's, Claude, etc. model's response based on the context and query |
| run_context_sufficiency | bool | Evaluate whether context supports answering the query |
| run_response_groundedness| bool | Evaluate whether the response is grounded in the context |
| run_response_helpfulness | bool | Evaluate how helpful the response is |
| run_query_ease | bool | Evaluate if the query is vague, complex, or adversarial |
**Outputs**
#### Outputs
| Name | Type | Description |
|-------------------------|------------|-------------------------------------------------------------------------|
| trust_score | number | The overall trust score. |
| trust_explanation | Message | The explanation for the trust score. |
| other_scores | dict | A dictionary of optional enabled RAG evaluation metrics. |
| evaluation_summary | Message | A Markdown summary of query, context, response, and evaluation results. |
| Name | Type | Description |
|-----------------------|------------|-----------------------------------------------------------------------------|
| trust_score | number | Overall trust score |
| trust_explanation | Message | Explanation for trust score |
| other_scores | dict | Dictionary of optional enabled RAG evaluation metrics |
| evaluation_summary | Message | Markdown summary of query, context, response, and evaluation results |
</details>
---
## Cleanlab component example flows
## Example Flows
The following example flows show how to use the **CleanlabEvaluator** and **CleanlabRemediator** components to evaluate and remediate responses from any LLM, and how to use the `CleanlabRAGEvaluator` component to evaluate RAG pipeline outputs.
The following example flows show how to use the `CleanlabEvaluator` and `CleanlabRemediator` components to evaluate and remediate responses from any LLM, and how to use the `CleanlabRAGEvaluator` component to evaluate RAG pipeline outputs.
### Evaluate and remediate responses from an LLM
### Evaluate and remediate responses from any LLM
:::tip
Optionally, [Download](./eval_and_remediate_cleanlab.json) the Evaluate and Remediate flow and follow along.
:::
[Download](./eval_and_remediate_cleanlab.json) the flow to follow along!
This flow evaluates and remediates the trustworthiness of a response from any LLM using the `CleanlabEvaluator` and `CleanlabRemediator` components.
This flow evaluates and remediates the trustworthiness of a response from any LLM using the **CleanlabEvaluator** and **CleanlabRemediator** components.
![Evaluate response trustworthiness](./eval_response.png)
Simply connect the `Message` output from any LLM component (like OpenAI, Anthropic, or Google) to the `response` input of the `CleanlabEvaluator` component, along with connecting your prompt to its `prompt` input.
Connect the `Message` output from any LLM component to the `response` input of the **CleanlabEvaluator** component, and then connect the Prompt component to its `prompt` input.
That's it! The `CleanlabEvaluator` component will return a trust score and explanation which you can use however you'd like.
The **CleanlabEvaluator** component returns a trust score and explanation from the flow.
The `CleanlabRemediator` component uses this trust score and user configurable settings to determine whether to output the original response, warn about it, or replace it with a fallback answer.
The **CleanlabRemediator** component uses this trust score to determine whether to output the original response, warn about it, or replace it with a fallback answer.
The example below shows a response that was determined to be untrustworthy (score of .09) and flagged with a warning by the `CleanlabRemediator` component.
This example shows a response that was determined to be untrustworthy (a score of `.09`) and flagged with a warning by the **CleanlabRemediator** component.
![CleanlabRemediator Example](./cleanlab_remediator_example.png)
If you don't want to show untrustworthy responses, you can also configure the `CleanlabRemediator` to replace the response with a fallback message.
To hide untrustworthy responses, configure the **CleanlabRemediator** component to replace the response with a fallback message.
![CleanlabRemediator Example](./cleanlab_remediator_example_fallback.png)
### Evaluate RAG pipeline
The below flow is the `Vector Store RAG` example template, with the `CleanlabRAGEvaluator` component added to evaluate the context, query, and response. You can use the `CleanlabRAGEvaluator` with any flow that has a context, query, and response. Simply connect the `context`, `query`, and `response` outputs from any RAG pipeline to the `CleanlabRAGEvaluator` component.
This example flow includes the [Vector Store RAG](/vector-store-rag) template with the **CleanlabRAGEvaluator** component added to evaluate the flow's context, query, and response.
To use the **CleanlabRAGEvaluator** component in a flow, connect the `context`, `query`, and `response` outputs from any RAG pipeline to the **CleanlabRAGEvaluator** component.
![Evaluate RAG pipeline](./eval_rag.png)
Here is an example of the `Evaluation Summary` output from the `CleanlabRAGEvaluator` component.
Here is an example of the `Evaluation Summary` output from the **CleanlabRAGEvaluator** component.
![Evaluate RAG pipeline](./eval_summary_rag.png)
Notice how the `Evaluation Summary` includes the query, context, response, and all the evaluation results! In this example, the `Context Sufficiency` and `Response Groundedness` scores are low (0.002) because the context doesn't contain information about the query and the response is not grounded in the context.
The `Evaluation Summary` includes the query, context, response, and all evaluation results. In this example, the `Context Sufficiency` and `Response Groundedness` scores are low (a score of `0.002`) because the context doesn't contain information about the query, and the response is not grounded in the context.

View file

@ -240,13 +240,13 @@ module.exports = {
},
{
type: "doc",
id: "Integrations/Composio/integrations-composio",
label: "Composio",
id: "Integrations/Cleanlab/integrations-cleanlab",
label: "Cleanlab",
},
{
type: "doc",
id: "Integrations/Cleanlab/integrations-cleanlab",
label: "Cleanlab",
id: "Integrations/Composio/integrations-composio",
label: "Composio",
},
{
type: 'category',