evaluator

The Evaluator block uses AI to score and assess content quality based on metrics you define. Perfect for quality control, A/B testing, and ensuring your AI outputs meet specific standards.

What You Can Evaluate

  • AI-Generated Content: Score chatbot responses, generated articles, or marketing copy

  • User Input: Evaluate customer feedback, survey responses, or form submissions

  • Content Quality: Assess clarity, accuracy, relevance, and tone

  • Performance Metrics: Track improvements over time with consistent scoring

  • A/B Testing: Compare different approaches with objective metrics

Configuration Options

Evaluation Metrics

Define custom metrics to evaluate content against. Each metric includes:

  • Name: A short identifier for the metric

  • Description: A detailed explanation of what the metric measures

  • Range: The numeric range for scoring (e.g., 1-5, 0-10)

Example metrics:

Accuracy (1-5): How factually accurate is the content?
Clarity (1-5): How clear and understandable is the content?
Relevance (1-5): How relevant is the content to the original query?

Content

The content to be evaluated. This can be:

  • Directly provided in the block configuration

  • Connected from another block's output (typically an Agent block)

  • Dynamically generated during workflow execution

Model Selection

Choose an AI model to perform the evaluation:

  • OpenAI: GPT-4o, o1, o3, o4-mini, gpt-4.1

  • Anthropic: Claude 3.7 Sonnet

  • Google: Gemini 2.5 Pro, Gemini 2.0 Flash

  • Other Providers: Groq, Cerebras, xAI, DeepSeek

  • Local Models: Any model running on Ollama

Recommendation: Use models with strong reasoning capabilities like GPT-4o or Claude 3.7 Sonnet for more accurate evaluations.

API Key

Your API key for the selected LLM provider. This is securely stored and used for authentication.

How It Works

1

The Evaluator block takes the provided content and your custom metrics

2

It generates a specialized prompt that instructs the LLM to evaluate the content

3

The prompt includes clear guidelines on how to score each metric

4

The LLM evaluates the content and returns numeric scores for each metric

5

The Evaluator block formats these scores as structured output for use in your workflow

Inputs and Outputs

Inputs

  • Content: The text or structured data to evaluate

  • Metrics: Custom evaluation criteria with scoring ranges

  • Model Settings: LLM provider and parameters

Outputs

  • Content: A summary of the evaluation

  • Model: The model used for evaluation

  • Tokens: Usage statistics

  • Metric Scores: Numeric scores for each defined metric

Example Usage

Here's an example of how an Evaluator block might be configured for assessing customer service responses:

Best Practices

  • Use specific metric descriptions: Clearly define what each metric measures to get more accurate evaluations

  • Choose appropriate ranges: Select scoring ranges that provide enough granularity without being overly complex

  • Connect with Agent blocks: Use Evaluator blocks to assess Agent block outputs and create feedback loops

  • Use consistent metrics: For comparative analysis, maintain consistent metrics across similar evaluations

  • Combine multiple metrics: Use several metrics to get a comprehensive evaluation

Was this helpful?