feat(ci): add scripts to run Spark SQL test suites locally by andygrove · Pull Request #4405 · apache/datafusion-comet

andygrove · 2026-05-22T13:30:03Z

Which issue does this PR close?

N/A. This adds local developer tooling and has no associated issue.

Rationale for this change

The spark_sql_test.yml workflow runs Apache Spark's own SQL test suites with Comet enabled, but there is no convenient way to reproduce that run on a developer machine. Debugging a Spark SQL test failure currently means reconstructing the steps by hand: clone Spark at a version tag, apply the Comet diff, build Comet, and run the right build/sbt shard with the right environment.

What changes are included in this PR?

New bash scripts under dev/ci/spark-sql-tests/ that reproduce the spark_sql_test.yml workflow locally:

config.sh: per-version configuration and the seven CI module-shard definitions, copied from spark_sql_test.yml.
setup-spark.sh: maintains a persistent apache/spark checkout and applies the matching dev/diffs/<version>.diff, preserving Spark's build artifacts across runs.
run.sh: builds Comet, runs the selected module shard(s) with build/sbt using the same environment as CI, and prints a PASS/FAIL summary. Supports SKIP_BUILD and SKIP_SPARK_SETUP for fast iteration.
README.md: usage, prerequisites, and environment variables.

The Spark version is selected with a SPARK_VERSION env var (default 4.1.1), supporting all four versions in the CI matrix: 3.4.3, 3.5.8, 4.0.2, and 4.1.1. config.sh derives the build profile and CI JDK per version and mirrors the matrix test-group isolation: every version runs with SERIAL_SBT_TESTS=1 except Spark 4.0, which forks a dedicated JVM per leak-prone Parquet/Orc suite. The Spark checkout and logs are namespaced by version so switching versions does not discard build artifacts or overwrite logs.

The scripts also point PySpark at a nonexistent interpreter by default. Spark 4.x's DataSourceManager probes for Python data sources during query analysis by spawning a Python worker. The CI container has no python3 so the probe is skipped there, but on a developer machine that has python3 the worker can hang indefinitely (the JVM-side read has no idle timeout), stalling suites such as GlobalTempViewSuite. Skipping the probe matches CI behavior.

How are these changes tested?

These scripts orchestrate a multi-hour external test run, so they are not exercised end-to-end in CI. They were verified with bash -n and shellcheck -x (both clean), with smoke tests of run.sh argument handling (--help, unknown-module rejection, unsupported-version rejection), and by confirming config.sh derives the correct build profile, JDK, ref, and test-group settings for each of the four supported versions. Running GlobalTempViewSuite locally confirmed it passes with the Python probe skipped, where it otherwise hangs indefinitely. The module definitions and build/sbt arguments match spark_sql_test.yml exactly.

Add bash scripts under dev/ci/spark-sql-tests/ that reproduce the spark_sql_test.yml GitHub Actions workflow on a developer machine for Apache Spark 4.1. They run Spark's own SQL test suites with Comet enabled, which is useful for debugging a Spark SQL test failure locally instead of waiting on CI. - config.sh: shared configuration and the seven CI module-shard definitions, copied from spark_sql_test.yml - setup-spark.sh: maintains a persistent apache/spark checkout and applies dev/diffs/4.1.1.diff, preserving build artifacts across runs - run.sh: builds Comet, runs the selected module shard(s), and prints a PASS/FAIL summary - README.md: usage, prerequisites, and environment variables Only Spark 4.1 is supported for now. [skip ci]

Spark 4.1's DataSourceManager probes for Python data sources during query analysis by spawning a python3 worker. The CI amd64/rust container has no python3, so the probe is skipped there. On a developer machine that has python3 the worker can hang indefinitely, since the JVM-side read has no idle timeout by default, stalling suites such as GlobalTempViewSuite. Point PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON at a nonexistent interpreter so the probe is skipped, matching CI. The value is overridable for developers who want to run the Python-dependent suites.

The local Spark SQL test scripts hardcoded Spark 4.1.1. Select the version with a SPARK_VERSION env var instead, supporting all four versions from the spark_sql_test.yml CI matrix: 3.4.3, 3.5.8, 4.0.2, and 4.1.1 (default 4.1.1). config.sh derives SPARK_SHORT and the CI JDK per version, and mirrors the matrix test-group isolation: every version runs with SERIAL_SBT_TESTS=1 except Spark 4.0, which forks a dedicated JVM per leak-prone Parquet/Orc suite. run.sh builds the sbt environment as an array so the 4.0 case omits SERIAL_SBT_TESTS entirely. The Spark checkout and logs are namespaced by version (apache-spark-<version>, logs/<version>/) so switching versions does not reset away each version's build artifacts or overwrite logs.

andygrove added 3 commits May 22, 2026 07:29

andygrove changed the title ~~feat(ci): add scripts to run Spark SQL test suite locally for Spark 4.1~~ feat(ci): add scripts to run Spark SQL test suites locally May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): add scripts to run Spark SQL test suites locally#4405

feat(ci): add scripts to run Spark SQL test suites locally#4405
andygrove wants to merge 3 commits into
apache:mainfrom
andygrove:ci-spark-sql-local-tests

andygrove commented May 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andygrove commented May 22, 2026 •

edited

Loading