Show HN: Local RAG Eval Harness – reproducible benchmarksfor retrieval pipelines

1 points by myroslavmokhamm 5 hours ago

Demo: REPLACE_WITH_DEMO_URL (no login) Repo: REPLACE_WITH_GITHUB_REPO_URL (profile: https://github.com/myroslav-abdeljawwad)

What it is A small toolkit to run reproducible, local evaluations for retrieval-augmented generation (RAG). It ships with a CLI + notebooks, fixed seeds, and a baseline config so results are easy to compare across machines.

Why Most RAG repos don’t show repeatable benchmarks. This tries to make “evals-first” the default.

Features - Metrics: Hit@K, MRR, Exact Match, grounded accuracy, latency, token-cost - Seeds + config-locked runs (YAML) - Plug-in chunkers (by structure/semantics), retrievers, and rerankers - One-command local runs (Docker optional) - Minimal HTML report + CSV/Parquet exports

Try it with public data - Harvard Dataverse (DVN/6TI8KI): https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi...

Stack Python, Typer CLI, pytest, SQLite/Postgres, Docker (optional)

What I’d love feedback on 1) Which metrics you actually trust in day-to-day work 2) Chunking heuristics that generalize across domains 3) Reranker swaps that improve grounded accuracy without killing latency

Roadmap - Dataset adapter registry - Built-in failure-mode explorer - Tiny web UI for run diffs

License MIT

About me — Myroslav Mokhammad Abdeljawwad (GitHub: https://github.com/myroslav-abdeljawwad)