Skip to content

Add eval system handoff doc#15680

Draft
helen229 wants to merge 2 commits into
mainfrom
doc
Draft

Add eval system handoff doc#15680
helen229 wants to merge 2 commits into
mainfrom
doc

Conversation

@helen229
Copy link
Copy Markdown
Member

Adds tools/ai-evals/HANDOFF.md -- a structured overview of the AzSDK eval system (MCP server + skills + eval harness), the Gen 1 / Gen 1.5 / Gen 2 history, the Vally migration state, and the PRs that built each generation.

Target audience: someone picking up this work cold.

Draft -- feedback welcome before merge.

Comment thread tools/ai-evals/HANDOFF.md

## 1. The 30-second version

We ship an **AI agent** that helps Azure SDK engineers do release work (create release plans, validate TypeSpec, edit APIView, etc.). The agent is **Copilot + the azsdk-cli MCP server + a set of SKILL.md instruction files**.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not really specific to the release work but it evaluates the whole set of instructions that are outside the scope of release work.

I'd take a look at the skills to see what else we do.

Comment thread tools/ai-evals/HANDOFF.md
Comment thread tools/ai-evals/HANDOFF.md
Azure DevOps · APIView · GitHub · TypeSpec compiler · npm · dotnet · …
```

**The mock variant — `Azure.Sdk.Tools.Mock`:** a *separate* MCP server in the same repo (`tools/azsdk-cli/Azure.Sdk.Tools.Mock/`). It exposes the same tool names but returns pre-registered canned responses keyed on input arguments, and falls through to a generic `Success` when nothing matches. Evals point Copilot at this mock instead of the real `azsdk-cli` so a test run doesn't actually create release plans, post APIView comments, or hit Azure DevOps.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to point out that we also can run the real mcp server and the evaluations do not need to limit themselves to the mock. A good example of this would be Shanghai's work which is without a mock.

Comment thread tools/ai-evals/HANDOFF.md
name: azsdk-common-prepare-release-plan
description: |
**UTILITY SKILL**. USE FOR: "create release plan", "update release plan", "link SDK PR to plan", ...
DO NOT USE FOR: SDK code generation, pipeline troubleshooting, API review feedback.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what this lines about I know this is an example though

Comment thread tools/ai-evals/HANDOFF.md

---

## 3. Why evals are hard (the conceptual jump)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also would like to add that we can set a baseline threshold of how many tests should pass to help with this.

Comment thread tools/ai-evals/HANDOFF.md
| Goal | Status |
|---|---|
| Vally framework chosen + adopted | ✅ Decision made |
| `vally lint` integration | ⚠️ Possible today, not yet wired into core CI |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently works, as it is a github action. However the vally evals CI is present but not active as there are currently some issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants