Show HN: Local RAG Eval Harness – reproducible benchmarksfor retrieval pipelines
Demo: REPLACE_WITH_DEMO_URL (no login) Repo: REPLACE_WITH_GITHUB_REPO_URL (profile: https://github.com/myroslav-abdeljawwad)
What it is A small toolkit to run reproducible, local evaluations for retrieval-augmented generation (RAG). It ships with a CLI + notebooks, fixed seeds, and a baseline config so results are easy to compare across machines.
Why Most RAG repos don’t show repeatable benchmarks. This tries to make “evals-first” the default.
Features - Metrics: Hit@K, MRR, Exact Match, grounded accuracy, latency, token-cost - Seeds + config-locked runs (YAML) - Plug-in chunkers (by structure/semantics), retrievers, and rerankers - One-command local runs (Docker optional) - Minimal HTML report + CSV/Parquet exports
Try it with public data - Harvard Dataverse (DVN/6TI8KI): https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi...
Stack Python, Typer CLI, pytest, SQLite/Postgres, Docker (optional)
What I’d love feedback on 1) Which metrics you actually trust in day-to-day work 2) Chunking heuristics that generalize across domains 3) Reranker swaps that improve grounded accuracy without killing latency
Roadmap - Dataset adapter registry - Built-in failure-mode explorer - Tiny web UI for run diffs
License MIT
About me — Myroslav Mokhammad Abdeljawwad (GitHub: https://github.com/myroslav-abdeljawwad)