GitHub - idea-iitd/Graph-Condensation-Position-paper

Abstract

Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their scalability is increasingly strained by the size of real-world graphs in domains like recommender systems, fraud detection, and molecular biology.

Graph condensation — the task of generating a smaller synthetic graph that retains the performance of models trained on the original — has emerged as a promising solution. However, the dominant approach of gradient matching introduces a fundamental contradiction: it requires training on the full dataset to create the compressed version, thereby undermining the goal of efficiency.

Worse still, these methods suffer from:

High computational overhead
Poor generalization across GNN architectures
Brittle reliance on specific model configurations

Equally concerning is the community's reliance on misleading evaluation protocols such as node compression ratios, which fail to reflect true resource savings, condensation overhead, and illusory application to neural architecture search.

These shortcomings are not incidental — they are systemic, and they obstruct meaningful progress.

Datasets Used

Data	#Nodes	# Edges	#Classes	#Features
Cora	2,708	10,556	7	1,433
Citeseer	3,327	9,104	6	3,703
Pubmed	19,717	88,648	3	500
Flickr	89,250	899,756	7	500
Ogbn-arxiv	169,343	2,315,598	40	128
Reddit	232,965	23,213,838	41	602
MAG240M	244,160,499	1,728,364,232	153	768

Cora, Citeseer, and Pubmed can be directly downloaded from Planetoid (planetoid). For Flickr, Ogbn-arxiv, and Reddit, we use the datasets provided by GraphSAINT. They are available on this Google Drive link provided by the GraphSAINT team. Download the files and unzip them to datasets at the root directory. For MAG240M, we used link

Experimentation & Techniques used

We have specified the library versions for reproducibilty in requirements.txt

Techniques like Gcond, GDEM, GCSR take as an input param, the compression ratio of nodes since the algorithm requires the final num_nodes param of generated synthetic graph as input. To standardize this wrt techniques using byte size compression, we make the following claim: WLOG, assume size of generated graph as sum of node ids (ints), sum of features (can be ints or double) and sum of edges (written as 2 nodes). Assign space contribution of ints to 2 and features to 4.

With the following in mind, for techniques mentioned above we note the following:

They generate dense outputs, ie for n synthetic nodes, there would be n^2 directed edges. They use adjacency matrix with weighted edges
Irrespective of the original feature space, they generate real valued features (hence double)

Therefore we simplify and obtain our Byte Size B = 4n^2 +4nd (d is feature dimensionality). A simple quadratic which can be solved given B for n.

We have used memory_profiler, tracemalloc and time modules on the standard implementations of the used techniques to evaluate various statistics

We use the original implementations of all techniques with above modifications. These implementations can be found at:

Gcond: https://github.com/ChandlerBang/GCond
GDEM: https://github.com/liuyang-tian/GDEM
GCSR: https://github.com/zclzcl0223/GCSR
Bonsai: https://github.com/idea-iitd/Bonsai/tree/main
GEOM: https://github.com/NUS-HPC-AI-Lab/GEOM
GC-SNTK: https://github.com/WANGLin0126/GCSNTK
ExGC: https://github.com/MangoKiller/EXGC

We have added scripts for train_gcond.py train_gcsr.py, train_gdem.py, train_bonsai.py along with a bash script train_all.sh. We use the edge weight variants of GAT and GIN to train on the synthetic dataset for these methods

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
load_saint_dataset.py		load_saint_dataset.py
models.py		models.py
requirements.txt		requirements.txt
timing.py		timing.py
train_all.sh		train_all.sh
train_bonsai.py		train_bonsai.py
train_gcond.py		train_gcond.py
train_gcsr.py		train_gcsr.py
train_gdem.py		train_gdem.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Datasets Used

Experimentation & Techniques used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Abstract

Datasets Used

Experimentation & Techniques used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages