GENIE: A Fine-Grained Measure for Novelty

Large Language Models have consistently demonstrated a lack of creativity and diversity across tasks. We propose GENIE (Granular Evaluation of Novel Ideas with Explainability) — a fine-grained evaluation metric that measures the novelty of responses along task-specific features with respect to a population of responses.

📄 Paper Code 🤗 Data

How GENIE works

GENIE evaluates novelty in four stages: automatic feature discovery, question generation, population building, and population-relative dissimilarity scoring. The figure shows how GENIE is instantiated on the creative writing task.

Q-GENIE Visualizer

Q-GENIE scores help pinpoint the details that make a response unique with respect to the population. Given a question, explore how novel answers extracted from target responses are relative to answers extracted from population responses. Click the population bubble to view its contents. Target answers that are near the population are less novel (i.e. more in-distribution with the population).

Dataset statistics

6

Creative Writing Features

3,404

Population responses

4,500

Target responses

39

Models used

50

Writing prompts

Novelty score distribution by feature

Kernel density estimates of GENIE scores across the target document set, per feature.

Model novelty scores

We computed the GENIE scores for target responses generated by 18 models across 50 creative writing prompts, with respect to a population of 4,500 responses generated by 21 models. Select a feature to view model rankings and use the filters to narrow by model family or type.

Family

Type

Switch to this tab to load scores…