Skip to content

Fix large GEDCOM files failing to load or freezing the browser#296

Merged
PeWu merged 15 commits into
PeWu:masterfrom
robekl:fix/large-gedcom-loading
Jun 4, 2026
Merged

Fix large GEDCOM files failing to load or freezing the browser#296
PeWu merged 15 commits into
PeWu:masterfrom
robekl:fix/large-gedcom-loading

Conversation

@robekl

@robekl robekl commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Large GEDCOM files (10K+ individuals) either failed silently in Firefox/Safari or froze Chrome for 10–20+ seconds. This PR fixes five bugs in the viewer. The underlying chart rendering bug (exponential pedigree traversal) is addressed separately in PeWu/topola#110.

Bugs fixed

1. Firefox and Safari reject the uploaded file immediately

navigate(..., {state: {data: gedcom}}) passes the GEDCOM through history.pushState. Firefox caps this at 640KB; Safari at ~512KB. A 10MB GEDCOM caused an immediate SecurityError — the chart never loaded in either browser.

Fix: Store the GEDCOM in a bounded in-memory store (gedcom_store.ts, capped to 1 entry — new upload evicts the previous). navigate() carries only a content fingerprint hash.

2. JSON.stringify blocks the browser for 10–20 seconds

prepareData serialized the full TopolaData (33MB total) to sessionStorage on every load — always failing the 5MB limit.

Fix: Skip serialization when the raw GEDCOM exceeds 512KB.

3. Chart re-renders 3–5 times per load, each freezing the browser

The Chart useEffect has no dependency array, so it ran after every React render. Each call re-ran the full D3 layout.

Fix: Return early in renderChart() when neither initialRender nor resetPosition is set.

4. Progress indicator disappears before the chart renders

The "Loading…" pill was shown only during AppState.LOADING, disappearing before the D3 layout ran.

Fix: Persistent fixed overlay driven by loadingStatus state, cleared via onFirstRender callback after the chart DOM is ready (intentionally pre-animation — the chart is interactive at this point, and keeping the pill during the 400ms fade-in would create a confusing overlap).

5. Donatso chart: progress pill never cleared

DonatsoChart did not call onFirstRender, so loadingStatus was never cleared.

Fix: Add onFirstRender? to DonatsoChartProps and call it in the first-render branch of DonatsoChart's useEffect.

Additional improvements

  • Async parsing with progressconvertGedcom is async with await setTimeout(0) yields between the four major steps, keeping the spinner responsive and allowing incremental GC. An onProgress callback threads through the full load pipeline (all five datasources: uploaded, URL, Google Drive, WikiTree, embedded) showing step-by-step status.
  • Search index idle pre-build — Lunr index (~600ms for 23K individuals) is scheduled via requestIdleCallback after data loads. Event listener for idle callbacks uses const-scoped handles to eliminate type casts.
  • Fast file fingerprint — replaced full-file md5() (~240ms for 10MB) with metadata + 4KB content sample (~1ms), extracted to src/util/file_fingerprint.ts.
  • Embedded listener cleanup — the window.addEventListener in EmbeddedDataSource.loadData was never removed, accumulating listeners across calls. Now removed on first settlement.
  • Test isolationgedcom_store.spec.ts adds beforeEach(clearStoredGedcom) to prevent Jest module cache from causing order-dependent test failures.

Tests

All passing: 89 unit tests, 13 visual regression tests (macOS snapshots updated to include new progress overlay).

Dependency note

The root-cause chart rendering fix (exponential pedigree traversal in topola) is in PeWu/topola#110. This viewer PR is independently mergeable.

Co-authored-by: Claude Sonnet 4.6 noreply@anthropic.com

robekl and others added 6 commits June 1, 2026 23:29
…M store

Firefox caps history.pushState state at 640KB; Safari at ~512KB. The upload
handler was passing the raw GEDCOM string through navigate(..., {state}),
which calls history.pushState. A 10MB GEDCOM caused an immediate SecurityError
in Firefox and Safari, preventing the chart from ever loading.

Add a bounded in-memory store (gedcom_store.ts) capped to 1 entry — uploading
a new file evicts the previous one, preventing unbounded memory growth during
a session. navigate() now carries only a file fingerprint hash in the URL;
loadGedcom() retrieves the GEDCOM from the store by hash.

Update embedded.ts to store its injected GEDCOM under the key 'embedded'
before calling loadGedcom(), keeping the data path consistent.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…B sample hash

The upload handler called md5() over the full GEDCOM string before navigating
— ~240ms for a 10MB file, blocking the main thread before the user saw
any feedback.

Replace with a hash over file.name|file.size|file.lastModified plus the first
and last 4KB of content. Runtime drops to ~1ms. The 4KB sample captures the
GEDCOM header (software name, export date, submitter info) which differs
between any two real exports, providing strong collision resistance for a local
session cache without a cryptographic guarantee.

Extract fileFingerprint to src/util/file_fingerprint.ts so it can be unit
tested independently of the React component.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…gress callback

convertGedcom was synchronous, blocking the main thread for the full parse
duration (~1.2s parseGedcom + ~700ms gedcomEntriesToJson in Safari). The
browser spinner could not update and GC could not run between steps.

Make convertGedcom async and insert await setTimeout(0) yields between the
four major steps (parse, convert, normalize, index). This breaks the work
into chunks, allows incremental GC, and keeps the spinner responsive.

Add onProgress?(status: string) => void to convertGedcom, prepareData,
loadGedcom, loadFromUrl, loadAndPrepareFile, and the DataSource interface.
All data source implementations (UploadedDataSource, GedcomUrlDataSource,
GoogleDriveDataSource) forward the callback down the load pipeline.
WikiTree and Embedded accept the parameter but do not use it.

Also skip the sessionStorage JSON.stringify for GEDCOM files over 512KB.
The serialized TopolaData (chartData 7.85MB + raw parse-gedcom tree 25MB
= 33MB) always exceeded the 5MB sessionStorage limit, wasting 10-20s in
Safari on a write that was guaranteed to fail.

Simplify loadGedcom to (hash, onProgress?) now that the GEDCOM string is
always retrieved from the in-memory store rather than passed as a parameter.
Remove console.log timing instrumentation and window.__topolaDebug globals
that were left in from debugging.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rough D3 layout

Two related issues caused the chart to freeze or appear unresponsive:

1. The Chart useEffect has no dependency array, so it ran after every React
   render. Each call re-created a JsonDataProvider from all 23K individuals,
   re-ran the ancestor/descendant layout, and called getComputedTextLength for
   every visible text element (~500ms per call in Chrome, longer in Safari).
   For unrelated state changes this froze the browser 3-5 extra times per load.
   Fix: return early in renderChart() when neither initialRender nor
   resetPosition is set — the SVG is already correct.

2. The "Loading..." progress pill was mounted only in AppState.LOADING. When
   data finished parsing, React transitioned to SHOWING_CHART and unmounted the
   pill before the D3 layout (the slow part) had run. Users saw a blank screen
   with no feedback for several seconds.
   Fix: replace the state-gated pill with a persistent fixed overlay driven by
   a loadingStatus string, visible through both LOADING and the initial chart
   render. Add onFirstRender() callback to Chart; App clears loadingStatus when
   the first D3 layout completes.

Wire the onProgress callback from the load pipeline into setLoadingStatus so
users see step-by-step progress ("Step 1/4: parsing GEDCOM..." through
"Rendering chart (23,909 people)...").

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Building the full-text search index over 23K individuals takes ~600ms
synchronously. This was deferred to the first keystroke, freezing the input
field for up to 3 seconds in Safari with no user feedback.

Schedule the build via requestIdleCallback (5s timeout) after data loads,
with a 200ms setTimeout fallback for browsers without requestIdleCallback.
The index is typically warm before the user types. getOrBuildIndex() still
builds synchronously on the first keystroke if idle scheduling has not yet
fired (e.g. user types immediately after load).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ingerprint

gedcom_store.spec.ts
  - Returns undefined for unknown hash
  - Stores and retrieves GEDCOM by hash
  - Evicts the previous entry when a new file is stored (bounded-to-1 behavior)
  - Returns undefined for a hash that was evicted

load_data_store.spec.ts
  - loadGedcom throws ERROR_LOADING_UPLOADED_FILE when hash is absent from store
  - loadGedcom successfully parses GEDCOM retrieved from the in-memory store
  - onProgress callback receives all four step messages in the correct order

upload_menu.spec.ts (via src/util/file_fingerprint.ts)
  - Returns a valid 32-character hex hash
  - Produces identical hashes for identical inputs
  - Differs for different filename, file size, content beyond 4KB boundary,
    and image file list

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@PeWu

PeWu commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Thank your for all the fixes! Do you think you could ask Claude to split the PR into multiple smaller ones? Like one per fix? It will make reviewing the changes easier.

@PeWu

PeWu commented Jun 3, 2026

Copy link
Copy Markdown
Owner

In the Donatso chart, the "Rendering chart" pill doesn't disappear.
To reproduce, start with npm start and open http://localhost:3000/#/view?url=https%3A%2F%2Fraw.githubusercontent.com%2FPeWu%2Ftopola%2Frefs%2Fheads%2Fmaster%2Fdemo%2Fdata%2Ffamily.ged&view=donatso

Also makes npm run test:visual fail.

@PeWu

PeWu commented Jun 3, 2026

Copy link
Copy Markdown
Owner

I reviewed the code. You don't need to split it (unless you already did 😄 )
I left some minor comments. The most important thing is to make tests pass

The progress pill persisted indefinitely on the Donatso chart because
DonatsoChart did not call onFirstRender — only the D3-based Chart did.

Add onFirstRender? to DonatsoChartProps and call it from the existing
useEffect on the first render, consistent with how Chart handles it.
Pass onFirstRender={() => setLoadingStatus('')} from renderChart in app.tsx.

Update visual regression snapshots for macOS (darwin) to reflect the new
progress pill and its correct dismissal across all chart types. The 12
non-Donatso failures were pre-existing snapshot staleness; chart-donatso
was the one newly introduced by this PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@robekl

robekl commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Fixed the Donatso pill issue — added onFirstRender to DonatsoChartProps (called from its first-render useEffect branch), consistent with how the D3 Chart component already handles it. Updated all visual regression snapshots for macOS; the 12 non-Donatso snapshot mismatches were pre-existing staleness, chart-donatso was the only new failure. All tests pass: unit (89), visual (13).

robekl added 6 commits June 3, 2026 22:38
The _onProgress parameter accepted by loadData was never forwarded to the
onMessage event handler, silently dropping all step-progress callbacks for
embedded GEDCOM loads. The progress pill stayed at 'Loading…' for the full
parse duration instead of showing the four-step breakdown.

Thread onProgress through the event-listener closure by adding it as a
parameter to onMessage and passing it to loadGedcom.
Jest caches module instances, so the module-level let variables persisted
across tests. Running with --randomize or in parallel could cause test 1
('returns undefined for unknown hash') to see a hash left by a previous test.

Export clearStoredGedcom() for testing and call it in beforeEach so each test
starts from a clean slate regardless of execution order.
onFirstRender fires when the chart SVG is in the DOM, before the D3
fade-in animation completes (~400ms). A reviewer noted this as a potential
issue, but moving it into animationPromise.then() would keep the progress
pill visible for 400ms after the chart is already visible and interactive,
creating a confusing overlap. The current timing is correct.

Add a comment explaining the design intent.
Both call sites in the loadData callback omitted the onProgress parameter
that all other sources (UPLOADED, GEDCOM_URL, GOOGLE_DRIVE) already forward.
WikiTree's implementation ignores it today (_onProgress), but wiring it here
ensures parity and makes future progress reporting possible without touching
app.tsx again. Embedded now receives it and threads it to loadGedcom.
The shared 'let handle' variable was typed as a union of setTimeout and
requestIdleCallback return types, requiring unsafe casts (handle as number,
handle as ReturnType<typeof setTimeout>) in the cleanup closures. Each
cleanup closure captured the shared variable by reference rather than value.

Use a const declaration in each branch so TypeScript infers the correct
type automatically and no casts are needed. The early-return form also
makes the two branches structurally symmetric.
The listener registered in loadData was never removed, causing it to
accumulate on every loadData call. In a multi-file workflow the second
load would have two listeners firing on the same message, the first
with a stale resolve/reject pair.

Wrap resolve and reject so the listener removes itself on first
settlement, making the registration effectively once-per-load.
@robekl

robekl commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Fixed — added onFirstRender? to DonatsoChartProps and call it in the first-render branch of DonatsoChart's useEffect, consistent with how the D3 Chart component handles it.

While fixing this I also addressed a few issues surfaced during code review:

  • onProgress now threads through all five datasources — WikiTree and Embedded were previously omitting it at the call site. Embedded additionally needed it wired through the event-listener closure into loadGedcom.
  • Embedded event listener leak fixedwindow.addEventListener in EmbeddedDataSource.loadData was never removed. It now wraps resolve/reject to self-remove on first settlement.
  • gedcom_store test isolation — added beforeEach(clearStoredGedcom) so tests are independent of execution order (Jest caches module-level state).
  • Search useEffect handle scoping — replaced shared let handle union type with const-scoped handles in each branch, eliminating the type casts.

All 89 unit tests and 13 visual tests pass.

@PeWu PeWu merged commit 621d581 into PeWu:master Jun 4, 2026
4 checks passed
@PeWu

PeWu commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants