Add eval system handoff doc by helen229 · Pull Request #15680 · Azure/azure-sdk-tools

helen229 · 2026-05-20T04:04:46Z

Adds tools/ai-evals/HANDOFF.md -- a structured overview of the AzSDK eval system (MCP server + skills + eval harness), the Gen 1 / Gen 1.5 / Gen 2 history, the Vally migration state, and the PRs that built each generation.

Target audience: someone picking up this work cold.

Draft -- feedback welcome before merge.

…kills primer

jeo02 · 2026-05-20T17:07:35Z

+
+## 1. The 30-second version
+
+We ship an **AI agent** that helps Azure SDK engineers do release work (create release plans, validate TypeSpec, edit APIView, etc.). The agent is **Copilot + the azsdk-cli MCP server + a set of SKILL.md instruction files**.


It's not really specific to the release work but it evaluates the whole set of instructions that are outside the scope of release work.

I'd take a look at the skills to see what else we do.

jeo02 · 2026-05-20T17:14:19Z

+Azure DevOps · APIView · GitHub · TypeSpec compiler · npm · dotnet · …
+```
+
+**The mock variant — `Azure.Sdk.Tools.Mock`:** a *separate* MCP server in the same repo (`tools/azsdk-cli/Azure.Sdk.Tools.Mock/`). It exposes the same tool names but returns pre-registered canned responses keyed on input arguments, and falls through to a generic `Success` when nothing matches. Evals point Copilot at this mock instead of the real `azsdk-cli` so a test run doesn't actually create release plans, post APIView comments, or hit Azure DevOps.


I would like to point out that we also can run the real mcp server and the evaluations do not need to limit themselves to the mock. A good example of this would be Shanghai's work which is without a mock.

jeo02 · 2026-05-20T17:15:17Z

+name: azsdk-common-prepare-release-plan
+description: |
+  **UTILITY SKILL**. USE FOR: "create release plan", "update release plan", "link SDK PR to plan", ...
+  DO NOT USE FOR: SDK code generation, pipeline troubleshooting, API review feedback.


not sure what this lines about I know this is an example though

jeo02 · 2026-05-20T17:16:08Z

+
+---
+
+## 3. Why evals are hard (the conceptual jump)


Also would like to add that we can set a baseline threshold of how many tests should pass to help with this.

jeo02 · 2026-05-20T17:22:45Z

+| Goal | Status |
+|---|---|
+| Vally framework chosen + adopted | ✅ Decision made |
+| `vally lint` integration | ⚠️ Possible today, not yet wired into core CI |


This currently works, as it is a github action. However the vally evals CI is present but not active as there are currently some issues.

helen229 added 2 commits May 19, 2026 15:45

Add eval system handoff doc

c94ecb2

Polish handoff doc: concrete eval locations and domain-specific MCP/S…

fc81a57

…kills primer

jeo02 reviewed May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval system handoff doc#15680

Add eval system handoff doc#15680
helen229 wants to merge 2 commits into
mainfrom
doc

helen229 commented May 20, 2026

Uh oh!

jeo02 May 20, 2026

Uh oh!

Uh oh!

jeo02 May 20, 2026

Uh oh!

jeo02 May 20, 2026

Uh oh!

jeo02 May 20, 2026

Uh oh!

jeo02 May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## 1. The 30-second version

		We ship an AI agent that helps Azure SDK engineers do release work (create release plans, validate TypeSpec, edit APIView, etc.). The agent is Copilot + the azsdk-cli MCP server + a set of SKILL.md instruction files.

Conversation

helen229 commented May 20, 2026

Uh oh!

jeo02 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jeo02 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

jeo02 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

jeo02 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

jeo02 May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants