Documentation Index
Fetch the complete documentation index at: https://evalprotocol.io/llms.txt
Use this file to discover all available pages before exploring further.
The data loader module provides a standard way to feed evaluation data into tests. Use it to:
- Build reusable input sources (adapters, files, generators)
- Parameterize datasets with clear variant labeling
- Preprocess inputs consistently (e.g., expand multi-turn data)
Components
DynamicDataLoader
Uses callables that return lists of EvaluationRow. Each callable becomes a labeled variant.
from eval_protocol import DynamicDataLoader
from eval_protocol.models import EvaluationRow
def my_generator() -> list[EvaluationRow]:
# Fetch or generate rows here (adapters, DB, etc.)
return []
data_loader = DynamicDataLoader(
generators=[my_generator],
)
InlineDataLoader
Use when you have rows or raw messages inline.
from eval_protocol import InlineDataLoader
from eval_protocol.models import EvaluationRow, Message
inline_rows = [
EvaluationRow(messages=[
Message(role="user", content="Hello"),
Message(role="assistant", content="Hi there!"),
])
]
loader = InlineDataLoader(rows=inline_rows, id="demo", description="Two-turn chat")
Preprocessing
All loaders support an optional preprocess_fn applied before returning rows. For example, expand multi-turn traces into multiple test cases:
from eval_protocol import DynamicDataLoader, multi_turn_assistant_to_ground_truth
DynamicDataLoader(
generators=[my_generator],
preprocess_fn=multi_turn_assistant_to_ground_truth,
)
Using with evaluation_test
from eval_protocol import evaluation_test, SingleTurnRolloutProcessor
@evaluation_test(
data_loaders=data_loader,
rollout_processor=SingleTurnRolloutProcessor(),
)
async def test_llm_judge(row: EvaluationRow) -> EvaluationRow:
return await aha_judge(row)
Each loader emits one or more variants. For each variant, Eval Protocol stores metadata on every row under row.input_metadata.dataset_info:
data_loader_type: loader class (e.g., DynamicDataLoader)
data_loader_variant_id: callable name or inline id
data_loader_variant_description: docstring/description
data_loader_num_rows: original count before preprocessing
data_loader_num_rows_after_preprocessing: final count
This enables clear tracking of which inputs produced which results in the UI.
Example with an Adapter
from eval_protocol import evaluation_test, aha_judge, DynamicDataLoader, SingleTurnRolloutProcessor
from eval_protocol.adapters.langfuse import create_langfuse_adapter
def langfuse_data_generator():
adapter = create_langfuse_adapter()
return adapter.get_evaluation_rows(limit=50, sample_size=10)
@evaluation_test(
data_loaders=DynamicDataLoader(generators=[langfuse_data_generator]),
rollout_processor=SingleTurnRolloutProcessor(),
)
async def test_llm_judge(row: EvaluationRow) -> EvaluationRow:
return await aha_judge(row)
API Reference
DynamicDataLoader
class DynamicDataLoader(EvaluationDataLoader):
generators: Sequence[Callable[[], list[EvaluationRow]]]
InlineDataLoader
class InlineDataLoader(EvaluationDataLoader):
rows: list[EvaluationRow] | None
messages: Sequence[list[Message]] | None
id: str
description: str | None
EvaluationDataLoader
class EvaluationDataLoader(ABC):
preprocess_fn: Callable[[list[EvaluationRow]], list[EvaluationRow]] | None
def variants(self) -> Sequence[DataLoaderVariant]: ...
def load(self) -> list[DataLoaderResult]: ...
Source Code
See the Python source for full details: eval_protocol/data_loader/models.py