Skip to content

MINOR: Reduce close delay in testFailureToFenceEpoch#22347

Open
JiayaoS wants to merge 1 commit into
apache:trunkfrom
JiayaoS:fix/unitTest-testFailureToFenceEpochWithTV1
Open

MINOR: Reduce close delay in testFailureToFenceEpoch#22347
JiayaoS wants to merge 1 commit into
apache:trunkfrom
JiayaoS:fix/unitTest-testFailureToFenceEpochWithTV1

Conversation

@JiayaoS
Copy link
Copy Markdown
Contributor

@JiayaoS JiayaoS commented May 22, 2026

Reduce the request timeout for the producer used by testFailureToFenceEpoch so graceful close does not wait on the default 30s timeout after the fencing probe leaves the producer in an empty transaction.

Tested by running testFailureToFenceEpochWithTV1 and testFailureToFenceEpochWithTV2.

@github-actions github-actions Bot added triage PRs from the community tests Test fixes (including flaky tests) storage Pull requests that target the storage module small Small PRs labels May 22, 2026
@JiayaoS
Copy link
Copy Markdown
Contributor Author

JiayaoS commented May 22, 2026

I used the following loop to re-run both affected tests 10 times and summarize their runtimes. It removes only the test result directories between runs, so Gradle can keep compilation tasks up-to-date while still re-running the tests.

OUT=/tmp/fence-epoch-runs.tsv
XML=storage/build/test-results/test/TEST-org.apache.kafka.tiered.storage.integration.TransactionsTest.xml
: > "$OUT"

for run in $(seq 1 10); do
  echo "===== run ${run}/10 ====="
  rm -rf storage/build/test-results/test storage/build/reports/tests/test storage/build/test-results/binary/test
  ./gradlew :storage:test \
    --tests org.apache.kafka.tiered.storage.integration.TransactionsTest.testFailureToFenceEpochWithTV1 \
    --tests org.apache.kafka.tiered.storage.integration.TransactionsTest.testFailureToFenceEpochWithTV2

  sed -n 's/.*<testcase name="\([^"]*testFailureToFenceEpochWithTV[12][^"]*\)".* time="\([^"]*\)".*/\1\t\2/p' "$XML" |
    awk -F "\t" -v run="$run" '{
      test = $1
      sub(/^.*testFailureToFenceEpochWithTV/, "testFailureToFenceEpochWithTV", test)
      sub(/ .*/, "", test)
      print run "\t" test "\t" $2
    }' >> "$OUT"
done

summarize() {
  local label="$1"
  awk -v label="$label" '
    { times[++count] = $1; sum += $1; if ($1 > max) max = $1 }
    END {
      if (count == 0) {
        printf "%s: count=0\n", label
        exit
      }
      if (count % 2 == 1) {
        median = times[(count + 1) / 2]
      } else {
        median = (times[count / 2] + times[count / 2 + 1]) / 2
      }
      printf "%s: count=%d, max=%.3fs, median=%.3fs, avg=%.3fs\n", label, count, max, median, sum / count
    }
  '
}

for test_name in testFailureToFenceEpochWithTV1 testFailureToFenceEpochWithTV2; do
  awk -v test="$test_name" '$2 == test { print $3 }' "$OUT" | sort -n | summarize "$test_name"
done

awk '{ print $3 }' "$OUT" | sort -n | summarize "ALL"

Results:

Test Count Max Median Average
testFailureToFenceEpochWithTV1 10 19.959s 18.305s 18.570s
testFailureToFenceEpochWithTV2 10 16.832s 16.718s 16.718s
All runs 20 19.959s 17.341s 17.644s

long producerId;
int initialProducerEpoch;

try (var producer1 = createTransactionalProducer(clusterInstance, "transactional-producer")) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about producer1.close(duration 5 secs) in finally ? This will force a hard close after the time out, instead of waiting for pending reqs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about producer1.close(duration 5 secs) in finally ? This will force a hard close after the time out, instead of waiting for pending reqs.

Thanks, that’s a fair point. I considered close(Duration.ofSeconds(5)) as well, but it changes the cleanup behavior to force-close after the timeout. The current approach keeps the try-with-resources close graceful, while bounding the pending requests with a shorter request timeout. Maybe we should decide which behavior is more appropriate for this test.

@github-actions github-actions Bot removed the triage PRs from the community label May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved small Small PRs storage Pull requests that target the storage module tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants