Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their scalability is increasingly strained by the size of real-world graphs in domains like recommender systems, fraud detection, and molecular biology.
Graph condensation — the task of generating a smaller synthetic graph that retains the performance of models trained on the original — has emerged as a promising solution. However, the dominant approach of gradient matching introduces a fundamental contradiction: it requires training on the full dataset to create the compressed version, thereby undermining the goal of efficiency.
Worse still, these methods suffer from:
- High computational overhead
- Poor generalization across GNN architectures
- Brittle reliance on specific model configurations
Equally concerning is the community's reliance on misleading evaluation protocols such as node compression ratios, which fail to reflect true resource savings, condensation overhead, and illusory application to neural architecture search.
These shortcomings are not incidental — they are systemic, and they obstruct meaningful progress.
| Data | #Nodes | # Edges | #Classes | #Features |
|---|---|---|---|---|
| Cora | 2,708 | 10,556 | 7 | 1,433 |
| Citeseer | 3,327 | 9,104 | 6 | 3,703 |
| Pubmed | 19,717 | 88,648 | 3 | 500 |
| Flickr | 89,250 | 899,756 | 7 | 500 |
| Ogbn-arxiv | 169,343 | 2,315,598 | 40 | 128 |
| 232,965 | 23,213,838 | 41 | 602 | |
| MAG240M | 244,160,499 | 1,728,364,232 | 153 | 768 |
Cora, Citeseer, and Pubmed can be directly downloaded from Planetoid (planetoid). For Flickr, Ogbn-arxiv, and Reddit, we use the datasets provided by GraphSAINT. They are available on this Google Drive link provided by the GraphSAINT team. Download the files and unzip them to datasets at the root directory. For MAG240M, we used link
We have specified the library versions for reproducibilty in requirements.txt
Techniques like Gcond, GDEM, GCSR take as an input param, the compression ratio of nodes since the algorithm requires the final num_nodes param of generated synthetic graph as input. To standardize this wrt techniques using byte size compression, we make the following claim: WLOG, assume size of generated graph as sum of node ids (ints), sum of features (can be ints or double) and sum of edges (written as 2 nodes). Assign space contribution of ints to 2 and features to 4.
With the following in mind, for techniques mentioned above we note the following:
- They generate dense outputs, ie for n synthetic nodes, there would be n^2 directed edges. They use adjacency matrix with weighted edges
- Irrespective of the original feature space, they generate real valued features (hence double)
Therefore we simplify and obtain our Byte Size B = 4n^2 +4nd (d is feature dimensionality). A simple quadratic which can be solved given B for n.
We have used memory_profiler, tracemalloc and time modules on the standard implementations of the used techniques to evaluate various statistics
We use the original implementations of all techniques with above modifications. These implementations can be found at:
- Gcond: https://github.com/ChandlerBang/GCond
- GDEM: https://github.com/liuyang-tian/GDEM
- GCSR: https://github.com/zclzcl0223/GCSR
- Bonsai: https://github.com/idea-iitd/Bonsai/tree/main
- GEOM: https://github.com/NUS-HPC-AI-Lab/GEOM
- GC-SNTK: https://github.com/WANGLin0126/GCSNTK
- ExGC: https://github.com/MangoKiller/EXGC
We have added scripts for train_gcond.py train_gcsr.py, train_gdem.py, train_bonsai.py along with a bash script train_all.sh. We use the edge weight variants of GAT and GIN to train on the synthetic dataset for these methods