> For the complete documentation index, see [llms.txt](https://whitepaper.aitech.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://whitepaper.aitech.io/agentforge/blocks/evaluator.md).

# evaluator

The Evaluator block uses AI to score and assess content quality based on metrics you define. Perfect for quality control, A/B testing, and ensuring your AI outputs meet specific standards.

## What You Can Evaluate

* **AI-Generated Content**: Score chatbot responses, generated articles, or marketing copy
* **User Input**: Evaluate customer feedback, survey responses, or form submissions
* **Content Quality**: Assess clarity, accuracy, relevance, and tone
* **Performance Metrics**: Track improvements over time with consistent scoring
* **A/B Testing**: Compare different approaches with objective metrics

## Configuration Options

### Evaluation Metrics

Define custom metrics to evaluate content against. Each metric includes:

* **Name**: A short identifier for the metric
* **Description**: A detailed explanation of what the metric measures
* **Range**: The numeric range for scoring (e.g., 1-5, 0-10)

Example metrics:

```
Accuracy (1-5): How factually accurate is the content?
Clarity (1-5): How clear and understandable is the content?
Relevance (1-5): How relevant is the content to the original query?
```

### Content

The content to be evaluated. This can be:

* Directly provided in the block configuration
* Connected from another block's output (typically an Agent block)
* Dynamically generated during workflow execution

### Model Selection

Choose an AI model to perform the evaluation:

* **OpenAI**: GPT-4o, o1, o3, o4-mini, gpt-4.1
* **Anthropic**: Claude 3.7 Sonnet
* **Google**: Gemini 2.5 Pro, Gemini 2.0 Flash
* **Other Providers**: Groq, Cerebras, xAI, DeepSeek
* **Local Models**: Any model running on Ollama

**Recommendation**: Use models with strong reasoning capabilities like GPT-4o or Claude 3.7 Sonnet for more accurate evaluations.

### API Key

Your API key for the selected LLM provider. This is securely stored and used for authentication.

## How It Works

{% stepper %}
{% step %}

### The Evaluator block takes the provided content and your custom metrics

{% endstep %}

{% step %}

### It generates a specialized prompt that instructs the LLM to evaluate the content

{% endstep %}

{% step %}

### The prompt includes clear guidelines on how to score each metric

{% endstep %}

{% step %}

### The LLM evaluates the content and returns numeric scores for each metric

{% endstep %}

{% step %}

### The Evaluator block formats these scores as structured output for use in your workflow

{% endstep %}
{% endstepper %}

## Inputs and Outputs

### Inputs

* **Content**: The text or structured data to evaluate
* **Metrics**: Custom evaluation criteria with scoring ranges
* **Model Settings**: LLM provider and parameters

### Outputs

* **Content**: A summary of the evaluation
* **Model**: The model used for evaluation
* **Tokens**: Usage statistics
* **Metric Scores**: Numeric scores for each defined metric

## Example Usage

Here's an example of how an Evaluator block might be configured for assessing customer service responses:

```yaml
# Example Evaluator Configuration
metrics:
  - name: Empathy
    description: How well does the response acknowledge and address the customer's emotional state?
    range:
      min: 1
      max: 5
  - name: Solution
    description: How effectively does the response solve the customer's problem?
    range:
      min: 1
      max: 5
  - name: Clarity
    description: How clear and easy to understand is the response?
    range:
      min: 1
      max: 5

model: Anthropic/claude-3-opus
```

## Best Practices

* **Use specific metric descriptions**: Clearly define what each metric measures to get more accurate evaluations
* **Choose appropriate ranges**: Select scoring ranges that provide enough granularity without being overly complex
* **Connect with Agent blocks**: Use Evaluator blocks to assess Agent block outputs and create feedback loops
* **Use consistent metrics**: For comparative analysis, maintain consistent metrics across similar evaluations
* **Combine multiple metrics**: Use several metrics to get a comprehensive evaluation


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://whitepaper.aitech.io/agentforge/blocks/evaluator.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
