Skip to content

Performance Discrepancy in Reproducing Results for AdvHotPotQA and FEVER #1

@hipros

Description

@hipros

First of all, thank you for your great work!

While reproducing the results from the paper, I noticed that the performance I obtained was lower than the reported results.
I would like to check if I might have missed anything in my setup.

I conducted experiments on AdvHotPotQA and FEVER using GPT-3.5-turbo, and the results are as follows:

Model Dataset Paper Performance Reproduced Performance
GPT-3.5-turbo AdvHotPotQA 42.9 0.3636
GPT-3.5-turbo FEVER 63.1 0.605

Below are the scripts I used for my experiment:

  • Wikidata DB Script
$ wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
$ python preprocess_dump.py --input_file ./latest-all.json.gz --out_dir ./processed_wiki
$ python build_index.py --input_dir ./processed_wiki --output_dir ./processed_wiki/indices --num_chunks 16

$ python server.py --data_dir ./processed_wiki --chunk_number 0 --host_ip <server IP> --port 23546
$ python server.py --data_dir ./processed_wiki --chunk_number 1 --host_ip <server IP> --port 23547
…
$ python server.py --data_dir ./processed_wiki --chunk_number 15 --host_ip <server IP> --port 23562
  • ToG 2.0 Execution Script
$ python main_tog2.py \
--dataset hotpot_e \
--max_length 256 \
--temperature_exploration 0 \
--temperature_reasoning 0 \
--width 3 \
--depth 3 \
--remove_unnecessary_rel True \
--LLM_type_rp gpt-3.5-turbo-16k \
--LLM_type gpt-3.5-turbo \
--opeani_api_keys <openai_api_key> \
--embedding_model_name bge-bi \
--relation_prune_combination True \
--num_sents_for_reasoning 10 \
--topic_prune True \
--self_consistency_threshold 0.8 \
--clue_query True

I would appreciate any insights on what might be causing this performance discrepancy.
Please let me know if there are any additional steps or configurations I should check.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions