Performance Discrepancy in Reproducing Results for AdvHotPotQA and FEVER

First of all, thank you for your great work!

While reproducing the results from the paper, I noticed that the performance I obtained was lower than the reported results. 
I would like to check if I might have missed anything in my setup.

I conducted experiments on AdvHotPotQA and FEVER using GPT-3.5-turbo, and the results are as follows:

| Model            | Dataset       | Paper Performance | Reproduced Performance |
|-----------------|--------------|------------------|-----------------------|
| GPT-3.5-turbo  | AdvHotPotQA  | 42.9             | 0.3636                |
| GPT-3.5-turbo  | FEVER        | 63.1             | 0.605                 |

Below are the scripts I used for my experiment:

- Wikidata DB Script

```
$ wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
$ python preprocess_dump.py --input_file ./latest-all.json.gz --out_dir ./processed_wiki
$ python build_index.py --input_dir ./processed_wiki --output_dir ./processed_wiki/indices --num_chunks 16

$ python server.py --data_dir ./processed_wiki --chunk_number 0 --host_ip <server IP> --port 23546
$ python server.py --data_dir ./processed_wiki --chunk_number 1 --host_ip <server IP> --port 23547
…
$ python server.py --data_dir ./processed_wiki --chunk_number 15 --host_ip <server IP> --port 23562
```

- ToG 2.0 Execution Script

```
$ python main_tog2.py \
--dataset hotpot_e \
--max_length 256 \
--temperature_exploration 0 \
--temperature_reasoning 0 \
--width 3 \
--depth 3 \
--remove_unnecessary_rel True \
--LLM_type_rp gpt-3.5-turbo-16k \
--LLM_type gpt-3.5-turbo \
--opeani_api_keys <openai_api_key> \
--embedding_model_name bge-bi \
--relation_prune_combination True \
--num_sents_for_reasoning 10 \
--topic_prune True \
--self_consistency_threshold 0.8 \
--clue_query True
```

I would appreciate any insights on what might be causing this performance discrepancy. 
Please let me know if there are any additional steps or configurations I should check.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Discrepancy in Reproducing Results for AdvHotPotQA and FEVER #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Dataset	Paper Performance	Reproduced Performance
GPT-3.5-turbo	AdvHotPotQA	42.9	0.3636
GPT-3.5-turbo	FEVER	63.1	0.605

Performance Discrepancy in Reproducing Results for AdvHotPotQA and FEVER #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions