Skip to content

idea-iitd/Graph-Condensation-Position-paper

 
 

Repository files navigation

Abstract

Graph Neural Networks (GNNs) are powerful tools for learning from graph-structured data, but their scalability is increasingly strained by the size of real-world graphs in domains like recommender systems, fraud detection, and molecular biology.

Graph condensation — the task of generating a smaller synthetic graph that retains the performance of models trained on the original — has emerged as a promising solution. However, the dominant approach of gradient matching introduces a fundamental contradiction: it requires training on the full dataset to create the compressed version, thereby undermining the goal of efficiency.

Worse still, these methods suffer from:

  • High computational overhead
  • Poor generalization across GNN architectures
  • Brittle reliance on specific model configurations

Equally concerning is the community's reliance on misleading evaluation protocols such as node compression ratios, which fail to reflect true resource savings, condensation overhead, and illusory application to neural architecture search.

These shortcomings are not incidental — they are systemic, and they obstruct meaningful progress.

Datasets Used

Data #Nodes # Edges #Classes #Features
Cora 2,708 10,556 7 1,433
Citeseer 3,327 9,104 6 3,703
Pubmed 19,717 88,648 3 500
Flickr 89,250 899,756 7 500
Ogbn-arxiv 169,343 2,315,598 40 128
Reddit 232,965 23,213,838 41 602
MAG240M 244,160,499 1,728,364,232 153 768

Cora, Citeseer, and Pubmed can be directly downloaded from Planetoid (planetoid). For Flickr, Ogbn-arxiv, and Reddit, we use the datasets provided by GraphSAINT. They are available on this Google Drive link provided by the GraphSAINT team. Download the files and unzip them to datasets at the root directory. For MAG240M, we used link

Experimentation & Techniques used

We have specified the library versions for reproducibilty in requirements.txt

Techniques like Gcond, GDEM, GCSR take as an input param, the compression ratio of nodes since the algorithm requires the final num_nodes param of generated synthetic graph as input. To standardize this wrt techniques using byte size compression, we make the following claim: WLOG, assume size of generated graph as sum of node ids (ints), sum of features (can be ints or double) and sum of edges (written as 2 nodes). Assign space contribution of ints to 2 and features to 4.

With the following in mind, for techniques mentioned above we note the following:

  1. They generate dense outputs, ie for n synthetic nodes, there would be n^2 directed edges. They use adjacency matrix with weighted edges
  2. Irrespective of the original feature space, they generate real valued features (hence double)

Therefore we simplify and obtain our Byte Size B = 4n^2 +4nd (d is feature dimensionality). A simple quadratic which can be solved given B for n.

We have used memory_profiler, tracemalloc and time modules on the standard implementations of the used techniques to evaluate various statistics

We use the original implementations of all techniques with above modifications. These implementations can be found at:

  1. Gcond: https://github.com/ChandlerBang/GCond
  2. GDEM: https://github.com/liuyang-tian/GDEM
  3. GCSR: https://github.com/zclzcl0223/GCSR
  4. Bonsai: https://github.com/idea-iitd/Bonsai/tree/main
  5. GEOM: https://github.com/NUS-HPC-AI-Lab/GEOM
  6. GC-SNTK: https://github.com/WANGLin0126/GCSNTK
  7. ExGC: https://github.com/MangoKiller/EXGC

We have added scripts for train_gcond.py train_gcsr.py, train_gdem.py, train_bonsai.py along with a bash script train_all.sh. We use the edge weight variants of GAT and GIN to train on the synthetic dataset for these methods

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 95.1%
  • Shell 4.9%