Skip to content

Commit b4b66a0

Browse files
committed
docs: Improved README
1 parent 5d67345 commit b4b66a0

1 file changed

Lines changed: 47 additions & 20 deletions

File tree

README.md

Lines changed: 47 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,46 +1,73 @@
1-
# Apistemic Benchmarks
1+
# Company-data focused LLM Benchmarks
2+
We've spent thousands of dollars evaluating LLM performance with company data - so you can skip straight to the results.
3+
Watch this repo to get a notification as soon as there's a new benchmark or model.
24

35
Since we do a lot of LLM-based company analysis at [apistemic](https://apistemic.com),
4-
we decided to have one central place to keep track of all the benchmarks we do.
6+
we decided to have one central place to keep track of all the benchmarks.
57
This repo thus covers many business/company-related LLM benchmarks.
68

7-
## Implicit Company Knowledge in Embeddings
8-
**Goal**:
9-
With this benchmark, we want to explore how well LLMs understand companies (and markets).
9+
## How well do LLMs understand companies?
10+
Firstly, we want to evaluate how much inherent knowledge LLMs have about companies and markets.
11+
To do this, we just use company names in all benchmarks without any further context provided.
12+
13+
### Benchmark: Measuring company knowledge inherent in embeddings {#embeddings-benchmark}
14+
To measure the LLMs' company knowledge in both width and depth,
15+
we embed company names in this benchmark.
16+
With the assumption being that the more inherent knowledge an LLM has about companies,
17+
the more information its embeddings contain.
1018

1119
**Methodology**:
1220
To measure inherent company knowlege, we prompt the name of companies to get an embedding.
13-
These embeddings are then used as the only input for a complex regression task,
14-
i.e. scoring the competitiveness of two companies.
21+
These embeddings are then used as the only inputs for a complex regression task,
22+
namely scoring the competitiveness of two companies via an SVM.
1523
A task, that requires a wide and deep understanding of markets, individual companies, business models, and more.
1624

1725
**Dataset**:
18-
The dataset for this benchmark is provided by [apistemic markets](https://markets.apistemic.com).
19-
The data contains expert evaluations of competitive positioning between company pairs,
20-
where industry professionals assessed relative competitiveness using a standardized five-point scale.
21-
These assessments span diverse sectors, encompassing companies of varying sizes and geographic locations
22-
to ensure comprehensive coverage across different market contexts.
26+
See [Competitive Positioning Dataset from Apistemic Markets](#competitive-positioning-dataset).
2327

2428
**Results**:
2529
![benchmark of LLM embeddings](.data/plots/r2-scores-boxplot.png)
2630

27-
## LLM-based Company Competitiveness Scoring
28-
**Goal**:
29-
Evaluate how well different LLMs can directly assess company competitiveness compared to human raters.
31+
### Benchmark: Measuring inherent company knowledge by rating competitiveness {#rating-benchmark}
32+
As a second benchmark to measure company knowledge,
33+
we use the same task as before and prompt the LLMs directly this time.
34+
We thus provide each LLM with the same instructions a human rater got
35+
and ask it to rate the competititveness of two companies.
36+
Our assumption is that the more knowledge (and understanding) an LLM has, both in width and depth,
37+
the better it can perform a competitiveness evaluation.
3038

3139
**Methodology**:
32-
We prompt LLMs to rate the competitiveness between company pairs on a 1-5 scale.
40+
This benchmark prompts LLMs to rate the competitiveness of company pairs on a 1-5 scale.
3341
We previously prompted human raters to do the same with the same prompts.
34-
3542
The LLMs receive only company names and must use their internal knowledge to assess competitive relationships.
36-
Performance is measured using R² scores and Spearman correlations against expert human evaluations.
37-
While R² should rate overall similarity to human raters, Spearman correlations between human and LLM ratings should indicate directional correctness,
43+
Performance is then measured using R² scores and Spearman correlations against expert human evaluations.
44+
While R² should rate overall similarity to human raters,
45+
Spearman correlations between human and LLM ratings should indicate directional correctness,
3846
i.e. whether the LLM has a sense of competitiveness more generally.
3947

4048
**Dataset**:
41-
Same expert-evaluated competitive positioning dataset from [apistemic markets](https://markets.apistemic.com).
49+
See [Competitive Positioning Dataset from Apistemic Markets](#competitive-positioning-dataset).
4250

4351
**Results**:
4452
![LLM R² scores](.data/plots/r2-scores-barplot.png)
4553

4654
![LLM Spearman correlations](.data/plots/spearman-correlations-barplot.png)
55+
56+
## Datasets
57+
58+
Our benchmarks are based on proprietary datasets.
59+
This sections covers a description of each dataset used.
60+
61+
### Competitive Positioning Dataset from Apistemic Markets {#competitive-positioning-dataset}
62+
63+
**Source**: [apistemic markets](https://markets.apistemic.com)
64+
65+
**Description**:
66+
Expert evaluations of competitive positioning between company pairs,
67+
where industry professionals assessed relative competitiveness using a standardized five-point scale.
68+
These assessments span diverse sectors, encompassing companies of varying sizes and geographic locations
69+
to ensure comprehensive coverage across different market contexts.
70+
71+
**Used in**:
72+
- [Benchmark: Measuring company knowledge inherent in embeddings](#embeddings-benchmark)
73+
- [Benchmark: Measuring inherent company knowledge by rating competitiveness](#rating-benchmark)

0 commit comments

Comments
 (0)