method | model name | accuracy | score | consistency |
---|
DRAGON | gpt-3.5-turbo |
4.058
|
3.632
|
3.735
|
DRAGON | gpt-4 | 3.97 | 3.567 | 3.689 |
DRAGON | nous-hermes-13b | 3.776 | 3.389 | 3.566 |
curator | human |
4.326
|
4.069
|
4.13
|
- A comparison of base performance of DRAGON on definition generation when compared with existing editor-provided definitions. Evaluator scores shown for three score categories (accuracy, consistency, and overall score). Evaluators evaluated definitions generated by three different models, alongside existing ontology definitions. Evaluators were not shown the source of definitions until after evaluation