|
1 | | -# Apistemic Benchmarks |
| 1 | +# Company-data focused LLM Benchmarks |
| 2 | +We've spent thousands of dollars evaluating LLM performance with company data - so you can skip straight to the results. |
| 3 | +Watch this repo to get a notification as soon as there's a new benchmark or model. |
2 | 4 |
|
3 | 5 | Since we do a lot of LLM-based company analysis at [apistemic](https://apistemic.com), |
4 | | -we decided to have one central place to keep track of all the benchmarks we do. |
| 6 | +we decided to have one central place to keep track of all the benchmarks. |
5 | 7 | This repo thus covers many business/company-related LLM benchmarks. |
6 | 8 |
|
7 | | -## Implicit Company Knowledge in Embeddings |
8 | | -**Goal**: |
9 | | -With this benchmark, we want to explore how well LLMs understand companies (and markets). |
| 9 | +## How well do LLMs understand companies? |
| 10 | +Firstly, we want to evaluate how much inherent knowledge LLMs have about companies and markets. |
| 11 | +To do this, we just use company names in all benchmarks without any further context provided. |
| 12 | + |
| 13 | +### Benchmark: Measuring company knowledge inherent in embeddings {#embeddings-benchmark} |
| 14 | +To measure the LLMs' company knowledge in both width and depth, |
| 15 | +we embed company names in this benchmark. |
| 16 | +With the assumption being that the more inherent knowledge an LLM has about companies, |
| 17 | +the more information its embeddings contain. |
10 | 18 |
|
11 | 19 | **Methodology**: |
12 | 20 | To measure inherent company knowlege, we prompt the name of companies to get an embedding. |
13 | | -These embeddings are then used as the only input for a complex regression task, |
14 | | -i.e. scoring the competitiveness of two companies. |
| 21 | +These embeddings are then used as the only inputs for a complex regression task, |
| 22 | +namely scoring the competitiveness of two companies via an SVM. |
15 | 23 | A task, that requires a wide and deep understanding of markets, individual companies, business models, and more. |
16 | 24 |
|
17 | 25 | **Dataset**: |
18 | | -The dataset for this benchmark is provided by [apistemic markets](https://markets.apistemic.com). |
19 | | -The data contains expert evaluations of competitive positioning between company pairs, |
20 | | -where industry professionals assessed relative competitiveness using a standardized five-point scale. |
21 | | -These assessments span diverse sectors, encompassing companies of varying sizes and geographic locations |
22 | | -to ensure comprehensive coverage across different market contexts. |
| 26 | +See [Competitive Positioning Dataset from Apistemic Markets](#competitive-positioning-dataset). |
23 | 27 |
|
24 | 28 | **Results**: |
25 | 29 |  |
26 | 30 |
|
27 | | -## LLM-based Company Competitiveness Scoring |
28 | | -**Goal**: |
29 | | -Evaluate how well different LLMs can directly assess company competitiveness compared to human raters. |
| 31 | +### Benchmark: Measuring inherent company knowledge by rating competitiveness {#rating-benchmark} |
| 32 | +As a second benchmark to measure company knowledge, |
| 33 | +we use the same task as before and prompt the LLMs directly this time. |
| 34 | +We thus provide each LLM with the same instructions a human rater got |
| 35 | +and ask it to rate the competititveness of two companies. |
| 36 | +Our assumption is that the more knowledge (and understanding) an LLM has, both in width and depth, |
| 37 | +the better it can perform a competitiveness evaluation. |
30 | 38 |
|
31 | 39 | **Methodology**: |
32 | | -We prompt LLMs to rate the competitiveness between company pairs on a 1-5 scale. |
| 40 | +This benchmark prompts LLMs to rate the competitiveness of company pairs on a 1-5 scale. |
33 | 41 | We previously prompted human raters to do the same with the same prompts. |
34 | | - |
35 | 42 | The LLMs receive only company names and must use their internal knowledge to assess competitive relationships. |
36 | | -Performance is measured using R² scores and Spearman correlations against expert human evaluations. |
37 | | -While R² should rate overall similarity to human raters, Spearman correlations between human and LLM ratings should indicate directional correctness, |
| 43 | +Performance is then measured using R² scores and Spearman correlations against expert human evaluations. |
| 44 | +While R² should rate overall similarity to human raters, |
| 45 | +Spearman correlations between human and LLM ratings should indicate directional correctness, |
38 | 46 | i.e. whether the LLM has a sense of competitiveness more generally. |
39 | 47 |
|
40 | 48 | **Dataset**: |
41 | | -Same expert-evaluated competitive positioning dataset from [apistemic markets](https://markets.apistemic.com). |
| 49 | +See [Competitive Positioning Dataset from Apistemic Markets](#competitive-positioning-dataset). |
42 | 50 |
|
43 | 51 | **Results**: |
44 | 52 |  |
45 | 53 |
|
46 | 54 |  |
| 55 | + |
| 56 | +## Datasets |
| 57 | + |
| 58 | +Our benchmarks are based on proprietary datasets. |
| 59 | +This sections covers a description of each dataset used. |
| 60 | + |
| 61 | +### Competitive Positioning Dataset from Apistemic Markets {#competitive-positioning-dataset} |
| 62 | + |
| 63 | +**Source**: [apistemic markets](https://markets.apistemic.com) |
| 64 | + |
| 65 | +**Description**: |
| 66 | +Expert evaluations of competitive positioning between company pairs, |
| 67 | +where industry professionals assessed relative competitiveness using a standardized five-point scale. |
| 68 | +These assessments span diverse sectors, encompassing companies of varying sizes and geographic locations |
| 69 | +to ensure comprehensive coverage across different market contexts. |
| 70 | + |
| 71 | +**Used in**: |
| 72 | +- [Benchmark: Measuring company knowledge inherent in embeddings](#embeddings-benchmark) |
| 73 | +- [Benchmark: Measuring inherent company knowledge by rating competitiveness](#rating-benchmark) |
0 commit comments