Skip to main content

Model Evaluation

测试 AI 应用程序需要评估生成的内容,以确保 AI 模型没有产生幻觉响应。

一种评估响应的方法是使用 AI 模型本身进行评估。选择最适合评估的 AI 模型,这可能与用于生成响应的模型不同。

Spring AI 用于评估响应的接口是 Evaluator,定义如下:

@FunctionalInterface
public interface Evaluator {
EvaluationResponse evaluate(EvaluationRequest evaluationRequest);
}

评估的输入是 EvaluationRequest,定义如下:

public class EvaluationRequest {

private final String userText;

private final List<Content> dataList;

private final String responseContent;

public EvaluationRequest(String userText, List<Content> dataList, String responseContent) {
this.userText = userText;
this.dataList = dataList;
this.responseContent = responseContent;
}

...
}
  • userText: 用户的原始输入,作为 String 类型

  • dataList: 上下文数据,例如来自检索增强生成 (RAG) 的数据,附加到原始输入

  • responseContent: AI 模型的响应内容,作为 String 类型

相关性评估器

RelevancyEvaluatorEvaluator 接口的一个实现,旨在评估 AI 生成的响应与提供的上下文的相关性。这个评估器通过确定 AI 模型的响应是否与用户的输入和检索到的上下文相关,来帮助评估 RAG 流程的质量。

评估基于用户输入、AI 模型的响应和上下文信息。它使用提示模板来询问 AI 模型响应是否与用户输入和上下文相关。

这是 RelevancyEvaluator 使用的默认提示模板:

你的任务是评估查询的响应
是否符合提供的上下文信息。

你有两个选项来回答。要么是 YES 要么是 NO。

如果查询的响应
符合上下文信息则回答 YES,否则回答 NO。

查询:
{query}

响应:
{response}

上下文:
{context}

回答:

注意:你可以通过 .promptTemplate() 构建器方法提供自己的 PromptTemplate 对象来自定义提示模板。

在集成测试中的使用

以下是在集成测试中使用 RelevancyEvaluator 的示例,使用 RetrievalAugmentationAdvisor 验证 RAG 流程的结果:

@Test
void evaluateRelevancy() {
String question = "Anacletus 和 Birba 的冒险发生在哪里?";

RetrievalAugmentationAdvisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
.documentRetriever(VectorStoreDocumentRetriever.builder()
.vectorStore(pgVectorStore)
.build())
.build();

ChatResponse chatResponse = ChatClient.builder(chatModel).build()
.prompt(question)
.advisors(ragAdvisor)
.call()
.chatResponse();

EvaluationRequest evaluationRequest = new EvaluationRequest(
// 原始用户问题
question,
// 从 RAG 流程中检索到的上下文
chatResponse.getMetadata().get(RetrievalAugmentationAdvisor.DOCUMENT_CONTEXT),
// AI 模型的响应
chatResponse.getResult().getOutput().getText()
);

RelevancyEvaluator evaluator = new RelevancyEvaluator(ChatClient.builder(chatModel));

EvaluationResponse evaluationResponse = evaluator.evaluate(evaluationRequest);

assertThat(evaluationResponse.isPass()).isTrue();
}

你可以在 Spring AI 项目中找到几个使用 RelevancyEvaluator 来测试 QuestionAnswerAdvisor 功能的集成测试和 RetrievalAugmentationAdvisor

自定义模板

RelevancyEvaluator 使用默认模板来提示 AI 模型进行评估。你可以通过 .promptTemplate() 构建器方法提供自己的 PromptTemplate 对象来自定义此行为。

自定义 PromptTemplate 可以使用任何 TemplateRenderer 实现(默认情况下,它使用基于 StringTemplate 引擎的 StPromptTemplate)。重要要求是模板必须包含以下占位符:

  • query 占位符用于接收用户问题

  • response 占位符用于接收 AI 模型的响应

  • context 占位符用于接收上下文信息

事实检查评估器

FactCheckingEvaluatorEvaluator 接口的另一个实现,旨在评估 AI 生成的响应与提供的上下文的事实准确性。这个评估器通过验证给定陈述(声明)是否在逻辑上得到提供的上下文(文档)的支持,来帮助检测和减少 AI 输出中的幻觉。

'声明’和’文档’被呈现给 AI 模型进行评估。有一些专门用于此目的的较小且更高效的 AI 模型可用,例如 BespokeMinicheck,这有助于降低执行这些检查的成本,相比旗舰模型如 GPT-4。Minicheck 也可以通过 Ollama 使用。

使用

FactCheckingEvaluator 构造函数接受一个 ChatClient.Builder 作为参数:

public FactCheckingEvaluator(ChatClient.Builder chatClientBuilder) {
this.chatClientBuilder = chatClientBuilder;
}

评估器使用以下提示模板进行事实检查:

文档:{document}
声明:{claim}

其中 {document} 是上下文信息,{claim} 是要评估的 AI 模型响应。

示例

以下是如何使用基于 Ollama 的 ChatModel(特别是 Bespoke-Minicheck 模型)的 FactCheckingEvaluator 的示例:

@Test
void testFactChecking() {
// 设置 Ollama API
OllamaApi ollamaApi = new OllamaApi("http://localhost:11434");

ChatModel chatModel = new OllamaChatModel(ollamaApi,
OllamaOptions.builder().model(BESPOKE_MINICHECK).numPredict(2).temperature(0.0d).build())


// 创建 FactCheckingEvaluator
var factCheckingEvaluator = new FactCheckingEvaluator(ChatClient.builder(chatModel));

// 示例上下文和声明
String context = "地球是太阳系中第三颗行星,也是已知唯一存在生命的行星。";
String claim = "地球是太阳系中第四颗行星。";

// 创建 EvaluationRequest
EvaluationRequest evaluationRequest = new EvaluationRequest(context, Collections.emptyList(), claim);

// 执行评估
EvaluationResponse evaluationResponse = factCheckingEvaluator.evaluate(evaluationRequest);

assertFalse(evaluationResponse.isPass(), "声明不应该被上下文支持");

}

Spring AI Alibaba 实现

LaajEvaluator(Evaluator 接口实现)

LaajEvaluator通过实现 Evaluator 接口并添加下面三个变量提供具体实现类额外的规范

public abstract class LaajEvaluator implements Evaluator {

private ChatClient.Builder chatClientBuilder;//聊天客户端 builder

private String evaluationPromptText;//分析提示词文本

private ObjectMapper objectMapper;//负责序列化工作
}

AnswerCorrectnessEvaluator

源码分析

AnswerCorrectnessEvaluator负责评估 Query 返回的 Response 是否符合提供的 Context 信息

public class AnswerCorrectnessEvaluator extends LaajEvaluator {

private static final String DEFAULT_EVALUATION_PROMPT_TEXT = """
你的任务是评估 Query 返回的 Response 是否符合提供的 Context 信息。
你有两个选项来回答,要么是"YES"/"NO"。
如果查询的响应与上下文信息一致,回答"YES",否则回答"NO"。

Query: {query}
Response: {response}
Context: {context}

Answer: "
""";

@Override
public EvaluationResponse evaluate(EvaluationRequest evaluationRequest) {
// Add parameter validation
if (evaluationRequest == null) {
throw new IllegalArgumentException("EvaluationRequest must not be null");
}

//获取 response 和 context
var response = doGetResponse(evaluationRequest);
var context = doGetSupportingData(evaluationRequest);

//创建评估客户端,并且将评估提示词,问题 (query),需要评估的响应 (response),上下文信息 (context),最后进行评估操作
String evaluationResponse = getChatClientBuilder().build()
.prompt()
.user(userSpec -> userSpec.text(getEvaluationPromptText())
.param("query", evaluationRequest.getUserText())
.param("response", response)
.param("context", context))
.call()
.content();

//获取评估结果
boolean passing = false;
float score = 0;
if (evaluationResponse.toUpperCase().contains("YES")) {
passing = true;
score = 1;
}

return new EvaluationResponse(passing, score, "", Collections.emptyMap());
}
}
测试代码

以下是在集成测试中使用 AnswerCorrectnessEvaluator 的示例

class AnswerCorrectnessEvaluatorTests {

// Test constants
private static final String TEST_QUERY = "What is Spring AI?";//测试问题

private static final String TEST_RESPONSE = "Spring AI is a framework for building AI applications.";//测试问题的大模型客户端响应

private static final String TEST_CONTEXT = "Spring AI is a framework for building AI applications.";//测试问题正确答案

private static final String CUSTOM_PROMPT = "Custom evaluation prompt text";

private ChatClient chatClient;

private ChatClient.Builder chatClientBuilder;

private AnswerCorrectnessEvaluator evaluator;

//每个测试方法运行前的初始化代码
@BeforeEach
void setUp() {
// Initialize mocks and evaluator
chatClient = Mockito.mock(ChatClient.class);
chatClientBuilder = Mockito.mock(ChatClient.Builder.class);
when(chatClientBuilder.build()).thenReturn(chatClient);
evaluator = new AnswerCorrectnessEvaluator(chatClientBuilder);
}

//模拟聊天响应体
/**
* Helper method to mock chat client response
*/
private void mockChatResponse(String content) {
ChatClient.ChatClientRequestSpec requestSpec = Mockito.mock(ChatClient.ChatClientRequestSpec.class);
ChatClient.CallResponseSpec responseSpec = Mockito.mock(ChatClient.CallResponseSpec.class);

// Mock the chain of method calls
when(chatClient.prompt()).thenReturn(requestSpec);
when(requestSpec.user(any(Consumer.class))).thenReturn(requestSpec);
when(requestSpec.call()).thenReturn(responseSpec);
when(responseSpec.content()).thenReturn(content);
}

//评估正确的响应结果的测试
/**
* Test evaluation when the answer is correct according to the context. Should return
* a passing evaluation with score 1.0.
*/
@Test
void testEvaluateCorrectAnswer() {
// Mock chat client to return "YES" for correct answer
mockChatResponse("YES");

// Create evaluation request with matching response and context
EvaluationRequest request = createEvaluationRequest(TEST_QUERY, TEST_RESPONSE, TEST_CONTEXT);

// Evaluate and verify
EvaluationResponse response = evaluator.evaluate(request);
assertThat(response.getScore()).isEqualTo(1.0f);
}

//评估错误的响应结果的测试
/**
* Test evaluation when the answer is incorrect or inconsistent with the context.
* Should return a failing evaluation with score 0.0.
*/
@Test
void testEvaluateIncorrectAnswer() {
// Mock chat client to return "NO" for incorrect answer
mockChatResponse("NO");

// Create evaluation request with incorrect response
EvaluationRequest request = createEvaluationRequest(TEST_QUERY, "Spring AI is a database management system.",
TEST_CONTEXT);

// Evaluate and verify
EvaluationResponse response = evaluator.evaluate(request);
assertThat(response.getScore()).isEqualTo(0.0f);
}
}

AnswerFaithfulnessEvaluator

源码介绍

AnswerFaithfulnessEvaluator作为LaajEvaluator的另一个实现类主要功能是将 STUDENT ANSWER 根据一些 FACTS 通过预先设定好的评分标准进行评估,并且最终输出格式为 JSON。

public class AnswerFaithfulnessEvaluator extends LaajEvaluator {

private static final String DEFAULT_EVALUATION_PROMPT_TEXT = """
您是一名评测专家,能够基于提供的评分标准和内容信息进行评分。
您将获得一些 FACTS(事实内容)和 STUDENT ANSWER。

以下是评分标准:
(1) 确保 STUDENT ANSWER 的内容是基于 FACTS 的事实内容,不能随意编纂。
(2) 确保 STUDENT ANSWER 的内容没有超出 FACTS 的内容范围外的虚假信息。

Score:
得分为 1 意味着 STUDENT ANSWER 满足所有标准。这是最高(最佳)得分。
得分为 0 意味着 STUDENT ANSWER 没有满足所有标准。这是最低的得分。

请逐步解释您的推理,以确保您的推理和结论正确,避免简单地陈述正确答案。

最终答案按照标准的 json 格式输出,不要使用 markdown 的格式, 比如:
\\{"score": 0.7, "feedback": "STUDENT ANSWER 的内容超出了 FACTS 的事实内容。"\\}

FACTS: {context}
STUDENT ANSWER: {student_answer}
""";

@Override
public EvaluationResponse evaluate(EvaluationRequest evaluationRequest) {
// Add parameter validation
if (evaluationRequest == null) {
throw new IllegalArgumentException("EvaluationRequest must not be null");
}

//获取 response 和 context
var response = doGetResponse(evaluationRequest);
var context = doGetSupportingData(evaluationRequest);

//创建评估客户端,并且将评估提示词,需要评估的学生答案 (response),上下文信息 (context),最后进行评估操作
String llmEvaluationResponse = getChatClientBuilder().build()
.prompt()
.user(userSpec -> userSpec.text(getEvaluationPromptText())
.param("context", context)
.param("student_answer", response))
.call()
.content();

//将评估结果以 JSON 的格式读取
JsonNode evaluationResponse = null;
try {
evaluationResponse = getObjectMapper().readTree(llmEvaluationResponse);
} catch (JsonProcessingException e) {
throw new RuntimeException(e);
}

//获取响应内容中的评分和反馈结果两部分,并对评估结果进行基础判断(passing)
float score = (float) evaluationResponse.get("score").asDouble();
String feedback = evaluationResponse.get("feedback").asText();
boolean passing = score > 0;

//封装必要的响应信息并返回
return new EvaluationResponse(passing, score, feedback, Collections.emptyMap());
}
}
测试代码

以下是在集成测试中使用 AnswerFaithfulnessEvaluator 的示例

class AnswerFaithfulnessEvaluatorTests {

// Test constants
private static final String TEST_FACTS = "The Earth is the third planet from the Sun and the only astronomical object known to harbor life.";//测试问题正确答案

private static final String TEST_STUDENT_ANSWER = "The Earth is the third planet from the Sun and supports life.";//测试问题的大模型客户端响应

private static final String CUSTOM_PROMPT = "Custom evaluation prompt text";

private ChatClient chatClient;

private ChatClient.Builder chatClientBuilder;

private AnswerFaithfulnessEvaluator evaluator;

//每个测试方法运行前的初始化代码
@BeforeEach
void setUp() {
// Initialize mocks and evaluator
chatClient = Mockito.mock(ChatClient.class);
chatClientBuilder = Mockito.mock(ChatClient.Builder.class);
when(chatClientBuilder.build()).thenReturn(chatClient);

// Initialize evaluator with ObjectMapper
ObjectMapper objectMapper = new ObjectMapper();
evaluator = new AnswerFaithfulnessEvaluator(chatClientBuilder, objectMapper);
}

//模拟聊天响应体
/**
* Helper method to mock chat client response
*/
private void mockChatResponse(String content) {
ChatClient.ChatClientRequestSpec requestSpec = Mockito.mock(ChatClient.ChatClientRequestSpec.class);
ChatClient.CallResponseSpec responseSpec = Mockito.mock(ChatClient.CallResponseSpec.class);

// Mock the chain of method calls
when(chatClient.prompt()).thenReturn(requestSpec);
when(requestSpec.user(any(Consumer.class))).thenReturn(requestSpec);
when(requestSpec.call()).thenReturn(responseSpec);
when(responseSpec.content()).thenReturn(content);
}

//评估正确的响应结果的测试
/**
* Test evaluation when the student answer is faithful to the facts. Should return a
* passing evaluation with high score.
*/
@Test
void testEvaluateFaithfulAnswer() {
// Mock chat client to return a high score response
mockChatResponse("{\"score\": 1.0, \"feedback\": \"The answer is faithful to the facts.\"}");

// Create evaluation request with faithful answer
EvaluationRequest request = createEvaluationRequest(TEST_STUDENT_ANSWER, TEST_FACTS);

// Evaluate and verify
EvaluationResponse response = evaluator.evaluate(request);
assertThat(response.getScore()).isEqualTo(1.0f);
assertThat(response.getFeedback()).isEqualTo("The answer is faithful to the facts.");
}

//评估错误的响应结果的测试
/**
* Test evaluation when the student answer contains fabricated information. Should
* return a failing evaluation with low score.
*/
@Test
void testEvaluateUnfaithfulAnswer() {
// Mock chat client to return a low score response
mockChatResponse("{\"score\": 0.0, \"feedback\": \"The answer contains fabricated information.\"}");

// Create evaluation request with unfaithful answer
String unfaithfulAnswer = "The Earth is the third planet and has three moons.";
EvaluationRequest request = createEvaluationRequest(unfaithfulAnswer, TEST_FACTS);

// Evaluate and verify
EvaluationResponse response = evaluator.evaluate(request);
assertThat(response.getScore()).isEqualTo(0.0f);
assertThat(response.getFeedback()).isEqualTo("The answer contains fabricated information.");
}
}

AnswerRelevancyEvaluator

源码介绍

AnswerRelevancyEvaluator也是LaajEvaluator的一个继承类。主要功能:通过提供的正确基准答案对大模型客户端给出的响应 (STUDENT ANSWER) 进行评分。评分过程中要求 STUDENT ANSWER 不能出现内容前后冲突的情况。并且该评估模型要求输出格式为 JSON 类型。

public class AnswerRelevancyEvaluator extends LaajEvaluator {

//AnswerRelevancyEvaluator 评估模型的默认提示词
private static final String DEFAULT_EVALUATION_PROMPT_TEXT = """
您是一名评测专家,能够基于提供的评分标准和内容信息进行评分。
您将获得一个 QUESTION, GROUND TRUTH (correct) ANSWER 和 STUDENT ANSWER。

以下是评分标准:
(1) 基于提供的 GROUND TRUTH ANSWER 作为正确基准答案,对 STUDENT ANSWER 的事实性、准确性和相关性进行评分。
(2) 确保 STUDENT ANSWER 不包含任何冲突的陈述和内容。
(3) 可以接受 STUDENT ANSWER 比 GROUND TRUTH ANSWER 包含更多的信息,只要对于 GROUND TRUTH ANSWER 保证事实性、准确性和相关性.

Score:
得分为 1 意味着 STUDENT ANSWER 满足所有标准。这是最高(最佳)得分。
得分为 0 意味着 STUDENT ANSWER 没有满足所有标准。这是最低的得分。

请逐步解释您的推理,以确保您的推理和结论正确。
避免简单地陈述正确答案。

最终答案按照标准的 json 格式输出, 比如:
\\{"score": 0.7, "feedback": "GROUND TRUTH ANSWER 与 STUDENT ANSWER 完全不相关。"\\}

QUESTION: {question}
GROUND TRUTH ANSWER: {correct_answer}
STUDENT ANSWER: {student_answer}
""";

@Override
public EvaluationResponse evaluate(EvaluationRequest evaluationRequest) {
// Add parameter validation
if (evaluationRequest == null) {
throw new IllegalArgumentException("EvaluationRequest must not be null");
}

//获取 response 和 context
var response = doGetResponse(evaluationRequest);
var context = doGetSupportingData(evaluationRequest);

//创建评估客户端,并且将评估提示词,需要评估的问题 (question),学生答案 (response),正确答案 (context),最后进行评估操作
String llmEvaluationResponse = getChatClientBuilder().build()
.prompt()
.user(userSpec -> userSpec.text(getEvaluationPromptText())
.param("question", evaluationRequest.getUserText())
.param("correct_answer", context)
.param("student_answer", response))
.call()
.content();

//将评估结果以 JSON 的格式读取
JsonNode evaluationResponse = null;
try {
evaluationResponse = getObjectMapper().readTree(llmEvaluationResponse);
} catch (JsonProcessingException e) {
throw new RuntimeException(e);
}

//获取响应内容中的评分和反馈结果两部分,并对评估结果进行基础判断(passing)
float score = (float) evaluationResponse.get("score").asDouble();
String feedback = evaluationResponse.get("feedback").asText();
boolean passing = score > 0;

//封装必要的响应信息并返回
return new EvaluationResponse(passing, score, feedback, Collections.emptyMap());
}
}
测试代码:

以下是在集成测试中使用 AnswerRelevancyEvaluator 的示例

class AnswerRelevancyEvaluatorTests {

// Test constants
private static final String TEST_QUESTION = "What is the capital of France?";//测试问题

private static final String TEST_CORRECT_ANSWER = "The capital of France is Paris, which is also the largest city in the country.";//测试问题评估正确答案

private static final String TEST_STUDENT_ANSWER = "Paris is the capital city of France.";//测试问题的大模型客户端响应

private static final String CUSTOM_PROMPT = "Custom evaluation prompt text";

private ChatClient chatClient;

private ChatClient.Builder chatClientBuilder;

private AnswerRelevancyEvaluator evaluator;

//每个测试方法运行前的初始化代码
@BeforeEach
void setUp() {
// Initialize mocks and evaluator
chatClient = Mockito.mock(ChatClient.class);
chatClientBuilder = Mockito.mock(ChatClient.Builder.class);
when(chatClientBuilder.build()).thenReturn(chatClient);

// Initialize evaluator with ObjectMapper to avoid NPE
ObjectMapper objectMapper = new ObjectMapper();
evaluator = new AnswerRelevancyEvaluator(chatClientBuilder, objectMapper);
}

//模拟聊天响应体
/**
* Helper method to mock chat client response
*/
private void mockChatResponse(String content) {
ChatClient.ChatClientRequestSpec requestSpec = Mockito.mock(ChatClient.ChatClientRequestSpec.class);
ChatClient.CallResponseSpec responseSpec = Mockito.mock(ChatClient.CallResponseSpec.class);

// Mock the chain of method calls
when(chatClient.prompt()).thenReturn(requestSpec);
when(requestSpec.user(any(Consumer.class))).thenReturn(requestSpec);
when(requestSpec.call()).thenReturn(responseSpec);
when(responseSpec.content()).thenReturn(content);
}

//评估正确的响应结果的测试
/**
* Test evaluation when the student answer is relevant and accurate. Should return a
* passing evaluation with high score.
*/
@Test
void testEvaluateRelevantAnswer() {
// Mock chat client to return a high score response
mockChatResponse("{\"score\": 1.0, \"feedback\": \"The answer is accurate and relevant.\"}");

// Create evaluation request with relevant answer
EvaluationRequest request = createEvaluationRequest(TEST_QUESTION, TEST_STUDENT_ANSWER, TEST_CORRECT_ANSWER);

// Evaluate and verify
EvaluationResponse response = evaluator.evaluate(request);
assertThat(response.getScore()).isEqualTo(1.0f);
assertThat(response.getFeedback()).isEqualTo("The answer is accurate and relevant.");
}

//评估错误的响应结果的测试
/**
* Test evaluation when the student answer is irrelevant. Should return a failing
* evaluation with low score.
*/
@Test
void testEvaluateIrrelevantAnswer() {
// Mock chat client to return a low score response
mockChatResponse("{\"score\": 0.0, \"feedback\": \"The answer is completely irrelevant to the question.\"}");

// Create evaluation request with irrelevant answer
String irrelevantAnswer = "London is the capital of England.";
EvaluationRequest request = createEvaluationRequest(TEST_QUESTION, irrelevantAnswer, TEST_CORRECT_ANSWER);

// Evaluate and verify
EvaluationResponse response = evaluator.evaluate(request);
assertThat(response.getScore()).isEqualTo(0.0f);
assertThat(response.getFeedback()).isEqualTo("The answer is completely irrelevant to the question.");
}
}

Spring AI Alibaba 开源项目基于 Spring AI 构建,是阿里云通义系列模型及服务在 Java AI 应用开发领域的最佳实践,提供高层次的 AI API 抽象与云原生基础设施集成方案,帮助开发者快速构建 AI 应用。