feat(ci): add scripts to run Spark SQL test suites locally#4405
Draft
andygrove wants to merge 3 commits into
Draft
feat(ci): add scripts to run Spark SQL test suites locally#4405andygrove wants to merge 3 commits into
andygrove wants to merge 3 commits into
Conversation
Add bash scripts under dev/ci/spark-sql-tests/ that reproduce the spark_sql_test.yml GitHub Actions workflow on a developer machine for Apache Spark 4.1. They run Spark's own SQL test suites with Comet enabled, which is useful for debugging a Spark SQL test failure locally instead of waiting on CI. - config.sh: shared configuration and the seven CI module-shard definitions, copied from spark_sql_test.yml - setup-spark.sh: maintains a persistent apache/spark checkout and applies dev/diffs/4.1.1.diff, preserving build artifacts across runs - run.sh: builds Comet, runs the selected module shard(s), and prints a PASS/FAIL summary - README.md: usage, prerequisites, and environment variables Only Spark 4.1 is supported for now. [skip ci]
Spark 4.1's DataSourceManager probes for Python data sources during query analysis by spawning a python3 worker. The CI amd64/rust container has no python3, so the probe is skipped there. On a developer machine that has python3 the worker can hang indefinitely, since the JVM-side read has no idle timeout by default, stalling suites such as GlobalTempViewSuite. Point PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON at a nonexistent interpreter so the probe is skipped, matching CI. The value is overridable for developers who want to run the Python-dependent suites.
The local Spark SQL test scripts hardcoded Spark 4.1.1. Select the version with a SPARK_VERSION env var instead, supporting all four versions from the spark_sql_test.yml CI matrix: 3.4.3, 3.5.8, 4.0.2, and 4.1.1 (default 4.1.1). config.sh derives SPARK_SHORT and the CI JDK per version, and mirrors the matrix test-group isolation: every version runs with SERIAL_SBT_TESTS=1 except Spark 4.0, which forks a dedicated JVM per leak-prone Parquet/Orc suite. run.sh builds the sbt environment as an array so the 4.0 case omits SERIAL_SBT_TESTS entirely. The Spark checkout and logs are namespaced by version (apache-spark-<version>, logs/<version>/) so switching versions does not reset away each version's build artifacts or overwrite logs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A. This adds local developer tooling and has no associated issue.
Rationale for this change
The
spark_sql_test.ymlworkflow runs Apache Spark's own SQL test suites with Comet enabled, but there is no convenient way to reproduce that run on a developer machine. Debugging a Spark SQL test failure currently means reconstructing the steps by hand: clone Spark at a version tag, apply the Comet diff, build Comet, and run the rightbuild/sbtshard with the right environment.What changes are included in this PR?
New bash scripts under
dev/ci/spark-sql-tests/that reproduce thespark_sql_test.ymlworkflow locally:config.sh: per-version configuration and the seven CI module-shard definitions, copied fromspark_sql_test.yml.setup-spark.sh: maintains a persistentapache/sparkcheckout and applies the matchingdev/diffs/<version>.diff, preserving Spark's build artifacts across runs.run.sh: builds Comet, runs the selected module shard(s) withbuild/sbtusing the same environment as CI, and prints a PASS/FAIL summary. SupportsSKIP_BUILDandSKIP_SPARK_SETUPfor fast iteration.README.md: usage, prerequisites, and environment variables.The Spark version is selected with a
SPARK_VERSIONenv var (default4.1.1), supporting all four versions in the CI matrix: 3.4.3, 3.5.8, 4.0.2, and 4.1.1.config.shderives the build profile and CI JDK per version and mirrors the matrix test-group isolation: every version runs withSERIAL_SBT_TESTS=1except Spark 4.0, which forks a dedicated JVM per leak-prone Parquet/Orc suite. The Spark checkout and logs are namespaced by version so switching versions does not discard build artifacts or overwrite logs.The scripts also point PySpark at a nonexistent interpreter by default. Spark 4.x's
DataSourceManagerprobes for Python data sources during query analysis by spawning a Python worker. The CI container has nopython3so the probe is skipped there, but on a developer machine that haspython3the worker can hang indefinitely (the JVM-side read has no idle timeout), stalling suites such asGlobalTempViewSuite. Skipping the probe matches CI behavior.How are these changes tested?
These scripts orchestrate a multi-hour external test run, so they are not exercised end-to-end in CI. They were verified with
bash -nandshellcheck -x(both clean), with smoke tests ofrun.shargument handling (--help, unknown-module rejection, unsupported-version rejection), and by confirmingconfig.shderives the correct build profile, JDK, ref, and test-group settings for each of the four supported versions. RunningGlobalTempViewSuitelocally confirmed it passes with the Python probe skipped, where it otherwise hangs indefinitely. The module definitions andbuild/sbtarguments matchspark_sql_test.ymlexactly.