Conversation
|
|
||
| ## 1. The 30-second version | ||
|
|
||
| We ship an **AI agent** that helps Azure SDK engineers do release work (create release plans, validate TypeSpec, edit APIView, etc.). The agent is **Copilot + the azsdk-cli MCP server + a set of SKILL.md instruction files**. |
There was a problem hiding this comment.
It's not really specific to the release work but it evaluates the whole set of instructions that are outside the scope of release work.
I'd take a look at the skills to see what else we do.
| Azure DevOps · APIView · GitHub · TypeSpec compiler · npm · dotnet · … | ||
| ``` | ||
|
|
||
| **The mock variant — `Azure.Sdk.Tools.Mock`:** a *separate* MCP server in the same repo (`tools/azsdk-cli/Azure.Sdk.Tools.Mock/`). It exposes the same tool names but returns pre-registered canned responses keyed on input arguments, and falls through to a generic `Success` when nothing matches. Evals point Copilot at this mock instead of the real `azsdk-cli` so a test run doesn't actually create release plans, post APIView comments, or hit Azure DevOps. |
There was a problem hiding this comment.
I would like to point out that we also can run the real mcp server and the evaluations do not need to limit themselves to the mock. A good example of this would be Shanghai's work which is without a mock.
| name: azsdk-common-prepare-release-plan | ||
| description: | | ||
| **UTILITY SKILL**. USE FOR: "create release plan", "update release plan", "link SDK PR to plan", ... | ||
| DO NOT USE FOR: SDK code generation, pipeline troubleshooting, API review feedback. |
There was a problem hiding this comment.
not sure what this lines about I know this is an example though
|
|
||
| --- | ||
|
|
||
| ## 3. Why evals are hard (the conceptual jump) |
There was a problem hiding this comment.
Also would like to add that we can set a baseline threshold of how many tests should pass to help with this.
| | Goal | Status | | ||
| |---|---| | ||
| | Vally framework chosen + adopted | ✅ Decision made | | ||
| | `vally lint` integration | ⚠️ Possible today, not yet wired into core CI | |
There was a problem hiding this comment.
This currently works, as it is a github action. However the vally evals CI is present but not active as there are currently some issues.
Adds
tools/ai-evals/HANDOFF.md-- a structured overview of the AzSDK eval system (MCP server + skills + eval harness), the Gen 1 / Gen 1.5 / Gen 2 history, the Vally migration state, and the PRs that built each generation.Target audience: someone picking up this work cold.
Draft -- feedback welcome before merge.