Skip to content

Latest commit

 

History

History
77 lines (60 loc) · 15.6 KB

File metadata and controls

77 lines (60 loc) · 15.6 KB

How the diff works (high level)

  • User experience: Good user experience is very important. Design output, progress, errors, help, and logging with the user in mind.

  • Code style: Keep the code minimal and easy to read. Use descriptive names for variables and functions so it is easier to guess what they are doing.

  • Platforms: All code must run on Linux, macOS, and Windows. Avoid OS-specific behavior unless guarded (e.g. Linux directory batching); use portable APIs and test on all three platforms where possible.

  • Scripts: The following shell scripts will be created and maintained:

    • run.sh — Run the program (go run .). No arguments.
    • build.sh — Build the most optimized binary to bin/ffd (e.g. go build -ldflags="-s -w" -o bin/ffd .). On Windows the output may be bin/ffd.exe. No arguments. Do not use -race or -gcflags="-N -l".
    • test.sh — Run unit tests (go test .). No arguments.
    • smoke-tests.sh — Run smoke tests. Usage: ./smoke-tests.sh (run all), ./smoke-tests.sh <test-name> (run one), or ./smoke-tests.sh ls (list the names of tests that can be run individually). Uses compiled bin/ffd and test data under ./test. Each test is independent so they can be parallelized later.
    • perf-test.sh — Run performance tests. No arguments. Builds an optimized binary, generates data under ./test, runs timed scenarios at multiple file counts, writes human-readable timing output and appends a row to perf-results.csv. May run for a long time.
  • Scope: Compare two directory trees recursively (recurse into all subdirectories). The first CLI argument is the left directory, the second is the right directory; output uses "left only" / "right only" accordingly. Compare only regular files; include hidden files and dotfiles. For directories that exist only on one side, report them as "left only" or "right only" as appropriate. "Different" means either: only in one side (path exists in one tree but not the other), or content differs (same relative path in both, but file contents not identical). If one or both roots are empty (no files), report no differences or list all discovered paths as left-only/right-only as applicable.

  • CLI / help: Implement the CLI using Cobra. Running the tool with no arguments (or by itself) must print help and usage information. Help and usage must document all options, including the hash algorithm (e.g. --hash) and the list of available hash names. Exit codes: 0 = success; 1 = usage error (invalid or missing arguments); 2 = fatal error (e.g. I/O, permission); 3 = completed but non-fatal errors occurred (user should check error log). Support an optional quiet or script-friendly mode (e.g. --quiet) that suppresses progress and the final "check the error log" message when piping or scripting.

  • Speed strategy:

    • Read both directory trees at the same time in separate goroutines. Do not apply any back pressure on reading directories; read as fast as possible so we know the total number of files yet to be processed (for progress).
    • Store discovered files in a set (keyed by relative path) so that when the same relative path appears in both trees we can form a pair. As file pairs are discovered, put them in a queue. A fixed number of worker goroutines (N) pull from the queue and compare each pair (size/timestamp, then hash if needed). Do not use a goroutine per directory or per pair without a cap.
    • The number of workers N is the limit on how many file pairs can be compared concurrently. Default N to the number of CPUs on the system. Make it settable via a --workers argument on the command line.
    • For "same path, both sides": first check whether the two files have the same size and timestamp; if so, consider them the same and skip content comparison. Otherwise compare by content hash. Support multiple hash algorithms selectable via a CLI argument (e.g. --hash); default to xxHash. Include at least two or three options (e.g. xxhash, sha256, md5) so users can try different hashes. Output the hash as a hex string in the per-file details. Use modification time (mtime) for the timestamp; normalize to a common granularity (e.g. seconds or filesystem minimum) to avoid false differences across filesystems.
    • When hashing file content: for small files load the whole file into memory to hash; for large files stream as we hash. Use a size threshold (default 10MB). Files smaller than the threshold are read in full; files larger are streamed. Use the same size as the threshold for the streaming read buffer (one threshold config for both). Make the threshold settable via a CLI argument (e.g. --threshold or --size-threshold) so different values can be tried. When comparing a pair we already know both files are the same size, so for both files in the pair use the same strategy: either both small (read both in full) or both large (stream both).
    • On Linux, read directory entries in batches (e.g. via getdents64) to reduce syscalls. Use a good default batch size (e.g. 4096 or 8192 entries) and make it settable on the command line (e.g. --dir-batch-size).
    • Minimize syscalls and allocations; reuse buffers where possible.
  • Memory strategy: Keep memory use low while processing:

    • Use a single path representation: store the two directory roots once and one relative path string per pair. Do not duplicate path strings; share or intern relative paths so the same path is not stored multiple times in the queue or results.
    • For files above the size threshold, stream reads and hash incrementally (buffer size = threshold). For files below the threshold, read in full to hash (see Speed strategy).
    • Reuse buffers (e.g. a fixed buffer per worker from a pool) for reading and hashing; avoid per-file allocations where possible.
    • Bound concurrency with the worker limit (N) so that at most N pair comparisons are in progress at once. Do not throttle or apply back pressure to the directory readers; they run as fast as possible and feed the queue so we know total pending for the progress indicator.
  • Output: For each file regarded as different, state why it is different and show consistent details. Selectable via a format argument (e.g. --format). Stream diff results to stdout; do not hold all differing paths in memory. Emit in the user-chosen format (text, table, json, or yaml) as each result is ready. Progress: Send the progress indicator to stderr so it does not mix with diff output; when stdout is not a TTY (e.g. piped), omit the progress indicator or show it only on stderr.

    • Reasons: If size is different, treat the file as changed and do not check content; report that size has changed. If size is the same and modification time is the same, treat as unchanged (skip). If size or time differs, check hash; if hash differs, the file is different.
    • Per-file details: For each differing file show: path/name, size, modification time (mtime), and hash (if computed, as hex for the chosen algorithm). For files that exist only on one side, show the same details (no hash) and indicate "left only" or "right only".
    • Tree and order: Present output as an ASCII tree by default. Sort directory and file names case-sensitively (e.g. C locale) for consistent, reproducible order across platforms.
    • Formats (argument to choose):
      • text: Regular text with ASCII tree as above.
      • table: Text table; no tree.
      • json: Tree structure in JSON.
      • yaml: Tree structure in YAML.
  • Progress: Show a progress indicator that updates on screen as the diff runs. It must show number of files processed vs number of files pending. The number of files pending will increase as the tool progresses (more files are discovered while walking the trees).

  • Testing:

    • Unit tests: Use test-driven development (TDD). For each function, write a failing unit test first, then write the code so the test passes. Create a unit test for every function. Run with go test .. If you cannot create a failing test before writing a function, stop and ask for directions.
    • Smoke tests: Implement smoke tests as shell scripts that run against the compiled executable (bin/ffd). Write smoke tests as you go where possible: for each change that introduces CLI-testable behavior, add a smoke test in that same commit; introduce or extend the harness as needed. Use a comprehensive set of small scenarios; put example directory trees and files under ./test. Each smoke test must be independent (so they can be parallelized later). Run all with ./smoke-tests.sh (no arguments); run one with ./smoke-tests.sh <test-name>; list available tests with ./smoke-tests.sh ls. Cover at least: no-args (help), identical dirs (no differences), one file different (size, time, or content), files only on left or only on right, and each output format (text, table, json, yaml).
    • Performance tests: Implement performance tests in a separate shell script (e.g. ./perf-test.sh) because they may take a long time. Run all performance tests against a maximally optimized build of the tool. Build with optimization: use go build with no flags that disable optimization (do not use -race or -gcflags="-N -l"). For a release-style build use -ldflags="-s -w" to strip debug and symbol table (smaller binary; Go has no -O2-style flag—the default is already optimized). The perf script should build this optimized binary (e.g. go build -ldflags="-s -w" -o bin/ffd .) before running the timed scenarios. The script must generate the files and directories to compare under the ./test directory (do not rely on pre-existing files; e.g. use subdirs like ./test/perf/left and ./test/perf/right or a timestamped subdir per run). Test scenarios: all files the same; most files the same; some files the same; no files the same (on either left or right); some or many files on left only; some or many files on right only. Run each scenario at increasing file counts: 0, 1, 10, 100, 1,000, 10,000, and 100,000 files. Write human-readable output showing timing (e.g. total time, time per file); this output can be committed to Git as performance snapshots. Maintain a CSV file (e.g. perf-results.csv): create it with a header row if missing (e.g. date_iso,scenario,file_count,total_sec,time_per_file_sec). Use ISO 8601 date and a period for decimal separators. Append one row per run so performance can be charted over time.
  • Documentation: Keep the README in sync with actual CLI behavior (arguments, options, formats, usage). Comment every function to indicate its intent. Document public functions and packages (e.g. godoc-style comments). Ensure the CLI prints clear help and usage when run with no arguments or with a help flag. Document any non-obvious behavior or constraints in the spec or code comments.

  • Security:

    • Validate inputs: require that the two arguments are existing directories (or fail with a clear error).
    • Resolve paths safely and do not allow path traversal outside the given roots (e.g. reject or normalize paths that escape the directory).
    • Do not execute file contents or interpret them as code; only read for hashing and comparison. Prefer minimal permissions when reading files.
    • Do not pass user-provided paths to exec, shell, or subprocesses; use them only for open/read/stat.
    • For symbolic links: do not follow them, or resolve and reject any path that leaves the declared roots.
    • When a file or directory is unreadable (e.g. permission denied), fail or report it explicitly; do not silently skip.
    • Keep dependencies up to date; run go mod verify and vulnerability checks (e.g. govulncheck) in CI or before release.
    • Prefer vendoring or pinning dependencies for reproducible builds; if distributing binaries, use checksummed downloads and document the expected checksums.
  • Logging and error handling: Represent the logging system as a single type Logger (one instance per run, created at start and passed to all components that need to log).

    • Logger: A struct that holds: the secure temp directory path; paths to the two open log files (main = all output, errors = errors only); a non-fatal error count; and a mutex so concurrent goroutines can call it safely. Filenames include date and sequence (e.g. ffd-YYYYMMDD-NNN-main.log, ffd-YYYYMMDD-NNN-errors.log) so old runs are not overwritten.
    • API: Log(msg string) — write to main log only (for “directory X”, “file Y”, “comparison Z → outcome”). LogError(err error) — write to both logs, increment non-fatal count, do not abort. Fatal(err error) — write to both logs, print to stderr, then exit with code 2. PrintLogPaths() — print the two log file paths to stderr (skip when stdout is not a TTY). NonFatalCount() int — return the count for exit code 3 and the “check the error log” message.
    • Lifecycle: Create the Logger once after validating args (e.g. in root command); pass the same instance to walkers, workers, and output. Call PrintLogPaths() (and close/flush) before exit.
    • Write logs to a secure temporary directory (e.g. os.MkdirTemp with restricted permissions). The main log must record every directory and file discovered (with details) and every comparison performed (with outcome). The error log must record every error.
    • Fatal errors: Abort the program immediately, print the error to stderr, and write it to both the main log and the error log. Then exit with a non-zero code.
    • Non-fatal errors: Write to the main log and the error log; do not abort. Do not print them immediately; maintain an error count. At the end of the run, if any non-fatal errors occurred, tell the user that errors occurred and they should check the error log for details.
    • Before the program finishes (whether successfully or after reporting fatal errors), print the locations of the main log file and the error log file so the user can inspect them. When stdout is not a TTY (non-interactive or piped), this message can be omitted or written only to the log to keep stdout clean for scripting.
    • Never swallow or ignore an error. Every error must be reported to the log files. Fatal errors must abort immediately; non-fatal errors are only recorded in the logs and reflected in the final error count and end message.
  • CI: Create a GitHub Actions workflow that runs on push/PR (e.g. .github/workflows/ci.yml). Use the existing shell scripts where possible (e.g. ./build.sh, ./test.sh, ./smoke-tests.sh). The workflow must: build for Linux; run unit tests; run smoke tests; and run security checks (e.g. go mod verify, govulncheck or equivalent). All steps must pass for the workflow to succeed.

  • Release: Create a GitHub Actions workflow for releases (e.g. .github/workflows/release.yml), triggered by a release or tag. Use the existing shell scripts where possible for building and testing. Bash is available in GitHub Actions runners; use Bash for building and testing the Windows release too (e.g. cross-compile with GOOS=windows, then run smoke tests or sanity checks on the Windows binary in a Windows job if needed). Before building executables: build the project and run unit tests (e.g. via ./build.sh, ./test.sh). Then build executables for Windows, Linux, and macOS (e.g. GOOS/GOARCH matrix; on Windows the binary is ffd.exe). After building, run smoke tests against the built binaries (e.g. Linux runner runs ./smoke-tests.sh with the Linux binary). Create a GitHub Release and attach the built executables as release assets. As specified in the Security section: provide checksummed downloads—generate and attach checksums (e.g. SHA-256) for each executable and document the expected checksums in the release notes so users can verify downloads.