JGC-499 - Replace 'safe to test' label with single build-gate environment approval#3535
Conversation
…ment approval Introduce build-gate.yml orchestrator: a single `build-gate` environment deployment approval (skipped on push/dispatch) releases frogbot and all integration-test suites for fork/PR runs. Convert frogbot and the 25 test workflows to reusable (workflow_call) workflows invoked behind the gate, drop the per-workflow 'safe to test' label conditions, and delete removeLabel.yml. Add a build-gate-success aggregator job that needs all suites, so branch protection can require a single stable check instead of matrix-expanded suite contexts.
barakharyati
left a comment
There was a problem hiding this comment.
Looks good 2 issues 1 security related and the second looks like a bug.
| # "Build Gate / build-gate-success" instead of the matrix-expanded suite checks. | ||
| # Recover a failed suite with "Re-run failed jobs" (re-runs the suite + this job, | ||
| # not the approval gate) — no re-approval and no new commit needed. | ||
| build-gate-success: |
There was a problem hiding this comment.
Can we add gate also here?
There was a problem hiding this comment.
Done in 27645d29. This also fixes the if: always() issue below.
| # not the approval gate) — no re-approval and no new commit needed. | ||
| build-gate-success: | ||
| name: build-gate-success | ||
| if: always() |
There was a problem hiding this comment.
The if: always () cusing an issue here
build-gate-success doesn’t directly depend on gate, only on jobs after it. If gate is rejected, those jobs may be skipped, and the final check can still pass because it allows skipped jobs.
There was a problem hiding this comment.
Good catch. Fixed in 27645d29 by adding gate to needs: a rejected gate now fails the check, and an unapproved one keeps it pending. Kept if: always() so it still reports a definitive result when a suite fails.
| cancel-in-progress: true | ||
|
|
||
| permissions: | ||
| pull-requests: write |
There was a problem hiding this comment.
It looks like there is a bug here (not security)
I remember that the caller workflow cannot assign permissions it doesn't have
In that case, the caller does not have id-token: write, and it assigns it to a different workflow like .github/workflows/helmTests.yml by triggrering it
There was a problem hiding this comment.
Correct. Fixed in 27645d29 by granting id-token: write per-job on the four OIDC suites (access, helm, huggingface, oidc).
Address PR review: - Add `gate` to build-gate-success needs so a rejected/unapproved deployment fails (or blocks) the required check instead of passing on skipped suites. - Grant id-token: write on the access, helm, huggingface and oidc caller jobs, since a reusable workflow cannot escalate beyond the caller's token scopes.
|
LGTM we can merge it |

Summary
Replaces the per-workflow
safe to testlabel gate with a singlebuild-gateGitHub Environment deployment approval. A maintainer approves once per fork/PR run, and that single approval releases frogbot and all integration-test suites. Trustedpushtomasterand manualworkflow_dispatchruns skip the approval entirely.What changed
build-gate.ymlorchestrator (name: Build Gate)pull_request_target [opened, synchronize, reopened],pushtomaster,workflow_dispatch.gatejob carriesenvironment: ${{ github.event_name == 'pull_request_target' && 'build-gate' || '' }}— only fork/PR runs hit the environment and wait for approval;push/workflow_dispatchresolve to an empty environment (no gate).frogbot+go+ 24 others), eachneeds: gate+uses: ./.github/workflows/<suite>.yml+secrets: inherit.build-gate-successaggregator job (needs:all suites,if: always()) that fails if any suite failed/was cancelled — see required-checks note below.frogbot-scan-pull-request.yml+ 25*Tests.yml)on:→workflow_call:+workflow_dispatch:(droppedpushandpull_request_target: [labeled]).safe to testjob conditions.workflow_dispatchinputs preserved (incl. poetry'spoetry_version);env:/permissions:blocks intact.frogbotno longer declares its ownenvironment: frogbot— the gate now lives in the orchestrator.evidenceTestsandghostFrogTestskeep their existingif: false(independently disabled).removeLabel.yml— the label mechanism it served is gone.Security model
gate, before any job checks out PR head code or touches secrets.pull_request_target, localuses:reusable workflows resolve from the base ref, so a PR cannot alter the gate or workflow logic — only the executed test/product code, which is exactly what the approval covers.Recovering a failed run
Use "Re-run failed jobs": it re-runs the failed suite(s) and
build-gate-success(reusing already-passed suites) while not re-runninggate, so no re-approval and no new commit are needed — the single required check flips green on the same SHA. ("Re-run all jobs" re-runsgateand re-triggers approval.)Create the
build-gateEnvironment (Settings → Environments) with Required reviewers = the maintainer team.Verify frogbot secrets are repo/org-level, not scoped to the old
frogbotenvironment. frogbot no longer declares an environment and receives secrets viasecrets: inherit(which passes repo/org secrets, not environment secrets). IfFROGBOT_URL,FROGBOT_ACCESS_TOKEN,JF_SMTP_*were stored asfrogbot-environment secrets, move them to repo/org level (or ontobuild-gate).Repoint branch-protection required status checks to the single aggregator:
Build Gate / build-gate-successRemove the stale old per-workflow contexts. The aggregator is the only check that needs to be required — it transitively gates every suite, is matrix-independent (won't break when a suite's matrix changes), and stays pending until the approval gate is released.
(Optional cleanup) Delete the
safe to testlabel and retire the now-unusedfrogbotenvironment.