Motivation for New Benchmarks
Popular benchmarks for reporting model performance today are seriously lacking.
Benchmarks are based on contrived academic datasets. It is far more relevant to study how models perform on industry-specific tasks where these models will be used.
Live leaderboards are often compromised. Researchers release datasets openly but this data is integrated into pre-training corpora making the evaluation results inaccurate. Bad actors fine-tune their models on evaluation sets making openly hosted leaderboards irrelevant.
The results posted by companies building the models are biased. Each time large language model providers share results for a new model developed they do so with cherry-picked demo examples or with an evaluation regimen they have optimized the model to perform well in.
Our Plans
To address these problems, we are building custom benchmarks for specific tasks that mimic real industry use cases. To avoid dataset leakage, we keep the data we use private and secure. We review these models as a neutral third-party, meaning we provide unbiased evaluation, and do not cherry-pick tasks. We work closely with researcher and industry members, but intend our reports to be accessible by general audiences.
We are continually expanding the scope of our benchmarks to include more domains and task types, while evaluating more language model methods as they are made available. Reach out if you have an interest in contributing or have any ideas we should consider.
Vals AI Platform
We use our own evaluation infrastructure to create these benchmarks. It allows us to collect review criteria from subject-matter experts, then run evaluation of any LLM model, at scale. Not only can this platform expose model performance on these general domains, it can also evaluate any LLM application on task-specific data. We currently are extending early access to this platform on a case-by-case basis. If this is of interest, get in touch.