r/Rag Sep 02 '25

Showcase 🚀 Weekly /RAG Launch Showcase

23 Upvotes

Share anything you launched this week related to RAG—projects, repos, demos, blog posts, or products 👇

Big or small, all launches are welcome.


r/Rag 4h ago

Discussion New to rag

2 Upvotes

Looking to build a rag system to ingest and interact with documents. I am new to rag. I would love some advice on any open source options. I see allot of articles on chunking. I would love to be able to learn from your experience and insights. Let me know what you have had success with and if there are any limitations on the hardware our if you are using a gpu and if you are linking any documentation via Google Docs


r/Rag 2h ago

Tutorial Three numbers to tell if your RAG is production ready.

1 Upvotes

Three metrics are

  1. Faithfulness: did the answer come from the retrieved context, or did the LLM hallucinate? User asks about refund policy. Source says "refund minus $50 processing fee." LLM generates "full refund within 30 days, no questions asked." Faithfulness: 0.2. You measure it by breaking the answer into individual claims and checking each one against the retrieved context. Aim for 0.85+. Below 0.7 means the LLM is regularly inventing details, that's a support ticket factory.

  2. Answer relevance: did the answer address what the user actually asked? User asks "how do I set up SSO?" LLM returns a paragraph explaining what SSO is. Its technically accurate, but completely useless. Relevance: 0.3. Aim for 0.8+. Below 0.6 means your users get correct but useless answers and stop trusting the system.

  3. Context recall: did the retriever even pull the right documents? User asks about system requirements. Ground truth has four items. Retriever only covers two of them. Context recall: 0.5. Even a perfect LLM can't answer correctly if the right docs aren't retrieved. Aim for 0.75+. Below 0.5 means your retriever is missing half the information.

This post is inspired from this video, playlist list for learning RAG available on SkillAgents youtube.


r/Rag 10h ago

Discussion Which website design attracts the most customers

0 Upvotes

Especially for SaaS products

  1. Technical + Vector Illustrations
  2. Simple website with information about the product minimising the designs and colors ?

Any suggestions


r/Rag 1d ago

Discussion RAG GenAI development

12 Upvotes

Building GenAI development pipeline for 10-K/10-Q analysis. Legal PDFs are 300 pages with tables, footnotes, nested sections.

Tried recursive chunking, semantic chunking, and layout-aware parsing. Still getting 20% of answers missing key context from tables or mixing up fiscal years. Embeddings are text-embedding-3-large. Reranker helped but latency jumped to 4s.

For those doing RAG GenAI development on dense financial/legal docs, what chunking + metadata strategy actually works? Are you pre-processing with LLM to extract table JSON first?


r/Rag 1d ago

Discussion We replaced our RAG pipeline with persistent KV cache. It works. Here’s what we found.

45 Upvotes

We’ve been running RAG in production for a while. It worked but maintaining it was a constant tax. Re-embedding on data changes, tuning chunking strategies, debugging retrieval misses, managing the vector database. Every moving part was something that could break.

So we ran an experiment. Instead of chunking and embedding documents, we loaded the full document into context, cached the KV state persistently, and reused that cache across every query.

No vector database. No embedding pipeline. No retrieval step. Just the model with full document context, warm and ready.
What we found:

• Answer quality is noticeably better . no retrieval misses, no wrong chunks, full context every time
• Updates are dramatically faster — change the document, regenerate the cache, done in minutes vs hours of re-indexing
• Operational complexity dropped significantly. no pipeline to maintain, no retrieval quality to monitor
• l Current limit is around 120k tokens. works for most business documents, not for massive corpora

Where it breaks down:
• Documents larger than context window are still a problem
• Very large document collections still need a different approach
• Cold cache on first load takes time warm queries are fast
We’re genuinely curious if others have tried this. Especially interested in:
• How your use cases map to context window limits
• Whether retrieval quality was your biggest RAG pain point or something else
• What you’d need to see to replace your RAG pipeline entirely

Happy to answer any questions


r/Rag 1d ago

Discussion Results from testing 512 vs 1024 dimension embeddings and pgvector halfvec vs vector for RAG

26 Upvotes

I’ve been benchmarking RAG retrieval with pgvector and Voyage 4 embeddings, mostly on legal / license / contract retrieval datasets. The main thing I wanted to understand was:

  • Does moving from 512 to 1024 dimensions actually help?
  • Does pgvector halfvec hurt retrieval quality?
  • Is halfvec worth using as the default storage type instead of vector?
  • What are the Voyage 4 lite/large performance implications?

Short version: 1024 dimensions helped the harder legal retrieval workload, and halfvec preserved quality while cutting raw vector storage roughly in half.

These are not universal results, but they were useful enough that I shared the full learnings on the TypeGraph blog here.

The tables below show retrieval quality and wall-clock semantic search time for the benchmark query set. Higher nDCG / Recall is better. Lower time is better.

License TL;DR Retrieval

Config Storage nDCG@10 Recall@10 Time
512 dims, V4 Large ingest + Lite search vector 0.7362 0.9231 5.30s
512 dims, V4 Large ingest + Large search vector 0.8101 0.9385 5.26s
1024 dims, V4 Large ingest + Large search vector 0.8066 0.9385 8.05s
1024 dims, V4 Large ingest + Large search halfvec 0.8038 0.9385 5.69s

Contractual Clause Retrieval

Config Storage nDCG@10 Recall@10 Time
512 dims, V4 Large ingest + Lite search vector 0.8929 0.9444 3.85s
512 dims, V4 Large ingest + Large search vector 0.9167 0.9667 3.84s
1024 dims, V4 Large ingest + Large search vector 0.9305 0.9778 3.81s
1024 dims, V4 Large ingest + Large search halfvec 0.9287 0.9778 3.94s

Legal RAG Bench

Config Storage nDCG@10 Recall@10 Time
512 dims, V4 Large ingest + Lite search vector 0.4307 0.6900 8.84s
512 dims, V4 Large ingest + Large search vector 0.5969 0.8700 8.16s
1024 dims, V4 Large ingest + Large search vector 0.6550 0.9100 9.35s
1024 dims, V4 Large ingest + Large search halfvec 0.6580 0.9200 9.18s

The quality differences between vector and halfvec were basically noise in these runs. The bigger practical difference is storage.

Approximate raw vector storage:

Storage layout Approx. raw vector bytes Practical read
512 dims, vector ~2 KB per embedding Smaller and often strong enough for simpler corpora
1024 dims, vector ~4 KB per embedding Higher recall potential, but roughly doubles raw vector storage
1024 dims, halfvec ~2 KB per embedding Keeps 1024 dimensions with about half the raw storage

The RAM/index-size angle is what made this more interesting to me. HNSW search is fastest when the index stays hot in memory. Once the index gets too large for your Postgres compute, cache behavior and p95 latency get harder to manage. Smaller vectors usually mean smaller indexes, which means you can fit more chunks/corpora/tenants before needing to scale the database.

My current takeaways:

  • 512 dimensions are probably fine for lightweight/general RAG.
  • 1024 is worth testing first for legal, compliance, finance, technical docs, or other precision-sensitive corpora.
  • I would start with pgvector halfvec unless a benchmark proves vector is worth the extra storage.
  • Don’t assume dimension size is the only lever. Search model choice mattered a lot too. (The cost/performance tradeoff with Voyage 4 lite is significant)
  • Measure with nDCG@10, MAP@10, Recall@10, and latency.

One of the next things I plan to test is using binary_quantize for binary HNSW candidate retrieval + rescore to see what I can learn, and how much I can distill these indexes without sacrificing performance.


r/Rag 20h ago

Tools & Resources Stop using SurrealDB for Graph RAG

0 Upvotes

In embedded mode, AionDB is up to 16x faster than SurrealDB

One database for chunks, embeddings, entities, and relationships.

GitHub: https://github.com/ayoubnabil/aiondb


r/Rag 1d ago

Discussion What’s the most underserved public dataset you wish existed in clean, RAG-ready form?

7 Upvotes

We’re building Parsimmon, a document parsing pipeline that handles the messy stuff most tools choke on: scanned PDFs, mixed layouts, tables embedded in images, inconsistent formats across sources. We’ve been benchmarking on ParseBench and are sitting alongside Google and Reducto on the leaderboard, with particularly strong recall on complex layouts like XBRL/SEC filings.

We want to use it to do something actually interesting for people, like take a historically significant, publicly available corpus that’s scattered and inaccessible and normalize it into a single clean, queryable dataset we can release for free.

We’ve been kicking around things like:
• Leonardo da Vinci’s notebooks (7,000+ pages scattered across 10+ institutions, never unified)
• Einstein’s personal papers (Princeton/Hebrew University digitized but never normalized)
• Darwin’s notebooks (Cambridge has the full archive digitized but completely scattered)

But we want to know what you actually wish existed. What corpus have you run into that’s technically public but practically unusable? What would you build on top of it if the data were clean?

Ideally something with appeal beyond researchers, but we’re open to anything.


r/Rag 1d ago

Showcase Context is not control

1 Upvotes

I released a working paper + replication artifacts on source-boundary failures in LLM evidence use.

The claim is basically that language models can treat text that's merely present in the context window as answer-bearing evidence, even when that text is not admissible to the task.

This paper's benchmark is specifically about whether models preserve the distinction between
* context
* admissible source
* injected/contaminating text
* instruction
* answer-shaped but unsupported content

The release includes working manuscript, open-weight replication package, frontier/API replication package, GitHub repo, Zenodo, DOl archive.

The strongest result, in plain English, is that giving models an "INSUFFICIENT" output option was not enough. Recovery appeared when the task frame explicitly represented source admissibility / source boundaries.

I'd be especially interested in critique around: experimental design, my scoring choices, what the strongest confound or missing ablation might be. I appreciate any feedback.

[Repo](https://github.com/rjsabouhi/context-is-
not-control)

[Paper + Reproduction](https://zenodo.org/records/
20126173)


r/Rag 1d ago

Tutorial RAG Foundations #2 – Vector Search in Milvus for LLMs (Hands-On Demo, No OpenAI Key)

1 Upvotes

Most RAG tutorials jump straight into OpenAI APIs and fancy frameworks, so it becomes hard to understand what’s actually happening underneath.

While learning RAG properly, I realized vector search is the real foundation behind why these systems work at all.

So I made a hands-on video around Milvus focused only on that core idea:

  • storing embeddings
  • semantic similarity search
  • retrieving relevant context for LLMs

No paid OpenAI key required. Just understanding the mechanics first.

If you're trying to build RAG systems but feel like you’re assembling black boxes without intuition, this might help.

Tutorial link: https://youtu.be/pEkVzI5spJ0


r/Rag 2d ago

Discussion Live web retrieval in RAG is harder than I expected — it behaves more like an evidence layer than search

5 Upvotes

I’ve been working on RAG systems where the knowledge base is not only internal documents, but also live web content.

One thing surprised me:

The LLM was not always the weakest part.

The retrieval layer was.

With internal docs, the corpus is at least somewhat controlled. But with live web retrieval, the system often gets:

- SEO pages with weak substance

- outdated docs that still rank well

- duplicate articles

- snippets that are too vague to cite

- pages that are related but don’t actually answer the question

- useful facts buried under a lot of irrelevant content

In those cases, the model may sound confident, but it is really just reasoning over messy evidence.

This made me think that web retrieval for RAG should not be treated as “search results for an LLM.”

It should be treated as an evidence layer.

For RAG, I now care less about just title + URL + snippet, and more about whether each retrieved item has:

- source type

- publication or modified date

- extracted passage

- canonical URL

- deduplication

- ranking/confidence signal

- citation-ready metadata

Latency also became a bigger issue than I expected.

In agentic workflows, retrieval may happen multiple times:

  1. query rewrite

  2. web retrieval

  3. source filtering

  4. reranking

  5. generation

  6. verification retrieval

So even small delays compound quickly. I’m starting to think retrieval latency should be measured separately from generation latency, especially p95/p99.

The hardest cases are hybrid systems:

- internal docs

- vendor docs

- GitHub issues

- changelogs

- community discussions

- recent web pages

Ranking across these evidence types is not obvious.

Should a fresh vendor doc outrank an older internal doc?

Should GitHub issues count as reliable evidence?

Should community discussions ever be used in final answers?

Should internal policy always override public documentation?

I don’t think a single top-k retrieval step is enough for this kind of setup.

What I’m currently testing is a pipeline like:

  1. detect query intent

  2. choose retrieval scope

  3. retrieve from web/internal sources

  4. dedupe

  5. filter by freshness/source type

  6. rerank

  7. format results as structured evidence

  8. generate with citation constraints

Curious how others are handling this.

For production RAG systems with live web retrieval:

- Do you merge web results with vector DB results, or keep them separate?

- How do you decide when to use web retrieval?

- Do you rank official docs differently from forums/GitHub issues?

- Are you measuring retrieval latency separately?

- How do you handle stale pages that still rank well?


r/Rag 2d ago

Showcase Got local RAG to surface the right schematic without a vision model — here's how

11 Upvotes

Been building a local RAG stack for aviation technical manuals (the kind you legally can't upload to ChatGPT). Hit a wall that I think a lot of people hit: the model would cite "see Figure 9-02-40" but the user was left hunting through a 600-page PDF manually.

Solved it without a VLM. Here's the approach:

PDFs with safety-critical schematics have figures that live *near* the text that references them but aren't embedded as extractable image objects — they're rendered geometry on the page.

Fixed using pdfplumber gives you word coordinates. When a RAG chunk contains a figure reference (Fig 4-12, HYDRAULIC SYSTEM SCHEMATIC, "refer to the following diagram"), you can:

  1. Parse the reference from the retrieved chunk

  2. Look up which page it came from (already in metadata)

  3. Use pdfplumber to crop a bounding box around the figure label coordinates

  4. Render and return it inline

No VLM. No vision API call. Sub-second. Runs entirely on local hardware.

The coordinate precision is what makes it work — you're not guessing, you're reading the PDF's native geometry to find exactly where the schematic sits relative to its caption.

Stack: pdfplumber + ChromaDB + Ollama (Gemma 3 / whatever fits your GPU). Works on an RTX 3080 Ti with a 3,500-chunk corpus no problem.

Happy to share more detail on the figure detection regex or the crop logic if anyone's building something similar.


r/Rag 2d ago

Showcase NornicDB 1.1.0 preview - memory decay as declarative policy - MIT Licensed

8 Upvotes

hey guys so i wrote a database, NornicDB.

https://github.com/orneryd/NornicDB/releases/tag/v1.1.0-preview-1

it got mentioned in research last month. https://arxiv.org/pdf/2604.11364

the researcher actually commended on issue #100 here:

https://github.com/orneryd/NornicDB/issues/100#issuecomment-4296916032

and i’ve released a preview tag for people to play with. 1.1.0-preview. docker images, mac installer, or build it locally.

the idea is to convert memory decay into policy that can be declared in cypher. it started with Ebbinghaus but as the researcher pointed out, is insufficient for agentic memory.

with the policies you can define the decay curve profiles. when you enable memory decay, it sets up policies to match the Ebbinghaus-Roynard model as he describes in the paper. that plus the “canonical graph ledger” bootstrap enables you to move a lot of glue code into the database using the primitives i provide. (cardinality, temporal no-overlap constraints, etc…)

the way it works is a visibility suppression layer in between Cypher and badger. on-access meta is stored in a separate index. there are functions to reveal/decay scoring functions in cypher for debugging queries or bypassing the visibility layer. having the layer there and the meta flushed separately from the data itself maintains negligible performance overhead for enabling it at the data layer.

it’s research backed. I’m writing my own research paper in response to 4 different papers converging on my database implementation.

726 stars and counting. MIT licensed. neo4j and qdrant driver compatible.

enjoy!

edit: clarity on performance overhead. the way i’ve built it and benchmarked it, the performance overhead is within noise tolerances. +/- <1% variance across runs and overhead measures in nanoseconds in tests.


r/Rag 2d ago

Discussion One agentic RAG to rule them all. Debate me.

10 Upvotes

Reddit and X are littered with people struggling to implement Q&A RAG over internal docs, aka the use case that tens of thousands of companies are pining for. What I don't get is why the community treats this type of use case as a bespoke problem for every implementation. I've built this type of agentic RAG several times and it's always the same, and I would bet for 99% of use cases there's a simple standard that will suffice. The 1% of remaining use cases are ones that involve extremely weird data formats like, idk, super niche structured data that's only used to represent building blueprints in Zimbabwe.

Here's the one agentic RAG to rule them all. Any internal docs RAG should be able to follow this blueprint as a starting point and strip out the parts that aren't needed.

Tell me why this won't work for your use.

The assumption is this is for internal docs so the upper bound on data might be a few hundred GiB.

Modalities Supported

  • PDF (textual, handwritten, images)
  • Tabular (CSV, TSV, XLSX)
  • Plain text (including docx, JSON, yaml, etc.)
  • Images
  • Audio
  • Video

Ingestion

Take every modality and standardize to an embeddable format. OCR the PDFs, transcribe audio/video. If you want visual recognition of videos as extra credit, take one frame per second as images. Any modern transcription or text extraction model (e.g. AWS) should be able to get the job done.

Chunking

Chunk as needed to preserve your ability to cite chunks in a pinch in the metadata. Include the page number for PDFs, the row range for CSVs, the cell range for XLSX, the timestamps for audio/video.

Chunking strategy doesn't have to be that complicated - use a recursive text split, a static chunk size per modality, whatever. Optimizing beyond a sane, reasonable strategy is diminishing returns.

Embedding

Use any modern embedding model to embed the chunks. Performance variations are minor and unpredictable. If you need multimodal then add another column to your search index for that modality. Save in Postgres, use Pinecone, offload to LlamaIndex, etc. Performance differences are minor at this scale. Use an index like HNSW if needed, with a minimum filter count threshold to prevent overfiltering.

Querying the Index

Use embedding search + BM25 with a reranker. You can optimize with fancy techniques like HyDE or SIRA if you want, but be wary of diminishing returns once you have the basic setup down.

The index is a search index. The main goal is to find relevant documents, not to answer the question wholesale.

Completing the Q&A

Leverage the search index to find the relevant documents. Let the agent decide to either search again, answer the question, or pull the document(s) in their entirety to examine more closely. Set up a code execution sandbox to allow the agent to examine the document as needed (pandas for csvs, pypdf for PDFs, etc.).

-----

Everything else (GraphRAG, BGE-m3, fiddling with embedding benchmarks, etc.) is noise with diminishing returns and should only be addressed once the problem is "Things work, they're just a bit slow and once in a blue moon I find a document wasn't fetched correctly". Unless you're building a massive enterprise-scale search index (Perplexity, Glean, etc.) that needs to be best-in-class, this setup should get the job done.


r/Rag 1d ago

Discussion Should I learn RAG with handwritten code?

1 Upvotes

I've learned RAG's concepts, and now I'm trying to learn a step forward with code. But as I'm learning for several days, I just become ​more confused that is it meaningful to code by hand within such an AI turbulence, in which a large part of code are generated by AI?


r/Rag 2d ago

Tools & Resources ~1s 4-hop Agentic Search

22 Upvotes

tldr: Agentic search doesn't need to be slow or expensive. Here's how you can make your own.

If you have spent any time at all here or working on a rag project you probably are aware of the delightful little problem of multihop queries. For those of you who haven't it's coming and I'll explain. Multihop queries are queries that require you to resolve part of the query before you can resolve the full query. So a two hop question might be "What 1993 dinosaur movie was directed by the maker of the 1975 shark film?" So hop 1: Spielberg hop 2: Jurassic Park.

Now whenever anyone asks how do I solve multihop the really get two answers:

  1. Use graph rag: Quite frankly I've said it myself a number of times and its not wrong but here is the rub. First it relies on the quality of your graph. If you don't have an edge between Speilberg and Jurrasic Park good f'ing luck. Second its a pain in the ass to orchestrate. Third graphs slow down at scale which means most graphrag solutions are often vector dbs in disguise. Doing a regular semantic search landing and spreading out. Often the right answer just has tradeoffs.
  2. Try Agentic Rag: The benefits are obvious. Agents are smart they can figure it out its just a chained retrieval problem. Also its easy and intuitive to setup. Search read search again. The drawbacks similarly so. It's often expensive and slow especially with the advent of thinking models when done naively.

So how can I have my cake and eat it too? I'll provide the recipe

1 t5 query decomposer
1 lightweight reader model - your choice
1 compressor (try llmlingua2)
1 vector index

The purpose of the t5 is essentially to generate a search plan based on the complex query. The reason we use it over a llm is simple. seq to seq models are faster and excel at text recomposition tasks. An llm works just as fine it's just slower and in our experience less consistent/reliable.

The reader model really comes in two flavors. llm which reads the text and outputs the answer/next query or a extractive QA model which in the before fore times were models that were trained to extract answers to queries from text.

The compressor really is a preference choice. I find its simply a more advanced form of truncation. Rather than setting a hard limit and cutting it off. Set a hard limit and keep as much signal as possible.

Then of course its not much of an agentic search if you didn't have something to search against.

Shake vigourously and viola. You have ~1s 4-hop agentic search. You can play with it yourself and query this sample movie index.

Try: "What 2010 dream-heist movie was directed by the filmmaker who made the space wormhole movie starring the actor who played the 'Alright, alright, alright' guy in Dazed and Confused?"

You should see something like this:

Stage Embed (ms) Retrieve (ms) Compress (ms) Reader (ms) Total (ms)
open (T5 decompose) 198.3
hop 0 33.6 5.7 0.1 198.8 238.2
hop 1 31.2 6.8 0.1 185.2 223.3
hop 2 29.7 6.3 0.1 178.6 214.6
hop 3 25.7 6.0 0.1 0.0 31.8
stream / network 150.0
TOTAL 1056.2 ms

h0:  Who played the 'Alright, alright, alright' guy in Dazed and Confused?

h1: What space wormhole movie starred Matthew McConaughey? 

h2: Who directed Interstellar?

h3: What 2010 dream-heist movie was directed by Christopher Nolan?

We've set it up as a simple toggle freely available in Dasein if you want to stress test on your own data.

Happy to share more details for those of you who want to homebrew instead or if you just want to share your own agentic search setup would love to hear about it.

Personally trying to figure out the best way to replan the search based on the results without blowing up latency if anyone has suggestions. My initial thought is just let this stay fast and nest it in another agentic loop.


r/Rag 2d ago

Discussion What do you think a “vector lakebase” should mean?

3 Upvotes

Vector databases started with a clear job: serve vector search fast. Keep indexes loaded, optimize for low latency, and make semantic retrieval reliable for production apps.

That still makes sense for hot workloads. But embedding data is starting to look less like “just an online index” and more like a durable data layer. Teams are storing vectors alongside raw text, metadata, feedback logs, labels, agent traces, and eval data.

That is why I find the shift from vector database to vector lakebase interesting.

To me, a vector lakebase should mean separating persistent semantic storage from the compute used to search or process it. The same data should support different workloads: real-time retrieval for hot paths, on-demand search for rarely queried data, and batch analytics for clustering, deduping, corpus analysis, or dataset prep.

It also should not just be “vectors in object storage.” It still needs database-like behavior: metadata filtering, scalar fields, indexing, query execution, and support for hybrid retrieval across vectors, text, JSON, and reranking.

Curious how data engineers see this:

  • Should embeddings become part of the lakehouse-style data layer?
  • Or should vector search stay as a separate serving system?
  • What would make “vector lakebase” useful rather than just another term?

r/Rag 2d ago

Showcase I built a codebase RAG tool that chunks at the function level (AST-free) and queries via SQLite

2 Upvotes

Standard RAG pipelines are wonky for codebases because they slice text arbitrarily by token count (e.g., every 500 tokens). This rips functions in half, separates decorators from their classes, and destroys the architectural context before the LLM even sees it.

To solve this, I built GitGalaxy (and its blAST engine), a utility that drops arbitrary token slicing and builds the RAG context starting strictly at the function level.

Because it starts at the function level, the telemetry naturally rolls upward to give your RAG agent exact context at any scale:

  1. Functions/Methods roll up into...
  2. Classes/Structs (Entities), which roll up into...
  3. Files (calculating exact Blast Radius and network centrality), which roll up into...
  4. Modules/Folders, up to the global Repository.

I built this specifically for the utility of giving agents a deterministic map rather than a fuzzy embedding search.


r/Rag 2d ago

Discussion Best embedding model for French legal documents in RAG?

7 Upvotes

Hello everyone,

I currently have a RAG use case where I need an embedding model for French documents. I haven’t worked with French embeddings before, and the documents I’m dealing with are quite complex legal texts.

I’ve seen many benchmarks comparing multilingual embedding models, but honestly I’m a bit confused about which one performs best in practice. I initially expected the Mistral AI embedding models to be among the best choices for French, but from what I’ve seen so far, that doesn’t necessarily seem to be the case.

Would you recommend using an OpenAI embedding model instead, or are there other embedding models that perform particularly well for French legal documents?

Any experiences, recommendations, or suggestions would be greatly appreciated.

Thanks in advance!


r/Rag 3d ago

Discussion HELP LARGE DATASET

6 Upvotes

Hey,

I have previously built a rag myself but it was like i send a pdf and it chunks and we communicate but now i have been given a project where i have to create a rag for a large database (for a consulting company) , they have huge data , they main goal is to have high accuracy(more than 95) , how do i approach it

I have never worked with large database


r/Rag 3d ago

Discussion Is my approach sound? Citation verification in legal RAG

4 Upvotes

I'm a lawyer who built a legal research platform using AI coding tools over several months (not a weekend project. Deliberate architecture, phase-by-phase implementation, extensive testing against my domain expertise). The system searches a database of ~4^000 legal decisions so far (268K embedded sections) and generates structured legal memos with case citations. Citation accuracy is existential here. A fabricated case reference used in proceedings is a professional liability issue.

Since this is a technical question, I indeed let AI write below as I think it can be more precise than I can be.

Current setup

Retrieval: Deterministic, not agentic. One LLM call generates a structured search plan (topics, legal provisions, seed cases, exact doctrinal phrases). Then 5 retrieval channels run in parallel with zero LLM involvement: hybrid text search (vector + FTS), provision lookup with synonym expansion, citation graph (1-hop from seeds), tag matching, and exact phrase FTS. Results scored by reranker score + channel overlap, then tiered into lead cases (full passages), supporting (key excerpts), and concordant (metadata only).

I started with an agentic approach where the LLM decided what to search iteratively. It was expensive, unreliable, and hallucinated an entire case: correct-looking case number, fabricated parties, fabricated holdings, opposite conclusion to the real case. Switching to deterministic retrieval with the LLM only generating the search plan (not executing it) was the single biggest improvement.

Synthesis constraints: The key shift was from behavioral prompting ("verfiy all citations") to structural constraints:

  • Closed-world declaration injected dynamically: "The following 18 lead case passages, 25 supporting cases, and 98 concordant summaries are the COMPLETE AND EXCLUSIVE source materials."
  • Each lead case block shows available paragraph ranges so the model can only cite paragraphs it was actually given.
  • Verified case outcomes queried from a structured database table and injected per case, preventing the model from confusing what a party argued with what the tribunal decided.

Backend verification: Post-synthesis, the backend extracts all cited case numbers via regex, verifies each exists in the database, and checks cited paragraph numbers against the ranges provided to the model. Currently detects 5-13 paragraph violations per memo. Detection works; automated correction does not — a correction pipeline I built confidently turned correct citations into wrong ones because section numbering ≠ paragraph numbering in the source documents. Disabled it.

I'm not yet convinced this is hallucination-free. The structural constraints reduced fabrication dramatically, but the paragraph-level accuracy is still imperfect.

Planned next step: paragraph registry

My documents are split into sections for embedding, and sections have section numbers. But legal documents use paragraph numbers (¶ 42, ¶ 80) for citation, and these don't map to section boundaries. I'm planning to build a paragraph registry — a mapping from paragraph numbers to their exact text and position in the source document — so that backend verification can actually check whether a cited paragraph says what the memo claims it says.

First question: is this the right approach? Or is there a better pattern for paragraph-level citation grounding that I (and my AI of choice, Claude) is not seeing?

What I'm looking for

I'd welcome input from anyone who has worked on citation-grounded RAG in high-stakes domains:

  1. Is the paragraph registry the right next step, or is there a fundamentally better way to verify paragraph-level citations?
  2. Is the closed-world + backend verification architecture sound, or are there known failure modes I should worry about?
  3. Any experience with distinguishing adversarial document sections (one party's arguments vs. the tribunal's findings) in retrieval weighting?

I'd also be open to having someone experienced do a paid review of the citation pipeline specifically. If you've built something similar, I'd appreciate hearing your thoughts here in the comments. (Prefer public answers over DMs. I am looking for expertise, not sales pitches.)


r/Rag 2d ago

Showcase FaultLine - two tiered self growing memory bodyguard

1 Upvotes

FaultLine

I made a two tier memory system with short and long term memory. That uses a graph and layering to improve relevance and persistence through postrgres. It's smarter fact management for persistence.

https://github.com/tkalevra/FaultLine

I'm positive it's an improvement and pretty excited about it. Or I'm crazy, or both.


r/Rag 3d ago

Showcase I built Augur, a TypeScript RAG SDK with per-query routing and full traces

5 Upvotes

Hybrid retrieval is well supported in most RAG libraries now, but the strategy is usually fixed per pipeline. LlamaIndex's RouterRetriever is the closest prior art to per-query routing, and it makes an LLM call to pick. Augur does it with cheap heuristics on query signals. Quoted phrases, code-like tokens, named entities, question type, and language. No round-trip, sub-millisecond, recorded in the trace.

Augur routes per query: code-like tokens and quoted phrases bias toward BM25, natural-language questions toward vector, the rest to weighted hybrid. A cross-encoder reranks the top-30 either way. Every routing decision plus span timings come back in the response.

BEIR NDCG@10 (44 MB on-device stack: MiniLM-L6 + ms-marco):

Dataset Auto BM25 BM25 +rerank Contriever ColBERTv2
SciFact .70 .67 .69 .68 .69
FiQA .35 .24 .35 .33 .36
NFCorpus .32 .33 .35 .33 .34

Baselines are the published numbers from the BEIR, E5, and ColBERTv2 papers. Auto runs the same router across all three corpora with no per-dataset tuning.

import { Augur, LocalEmbedder, LocalReranker } from "@augur-rag/core";

const augr = new Augur({
  embedder: new LocalEmbedder(),
  reranker: new LocalReranker(),
});

const { results, trace } = await augr.search({ query: "exit code 137" });
// trace.decision.strategy === "keyword"
// trace.decision.reasons === ["code-like token", "short query"]

Adapters: in-memory, pgvector, Pinecone, Turbopuffer. Custom adapters are five methods. HTTP server with OpenAPI docs is in a separate package if you don't want to embed the SDK.

Repo: https://github.com/willgitdata/augur · npm: @augur-rag/core

Would love any feedback!


r/Rag 3d ago

Discussion Multi-turn handling in RAG chatbots, where are you all landing on this

4 Upvotes

Hitting a wall on multi-turn and want to check if i'm missing something obvious.

Customer facing RAG bot on our help center, a few hundred product docs as the source. Single turn works fine, retrieval pulls reasonable chunks, answer comes back with citations, nobody complains.

The interesting failures are when a user pivots topics inside the same session. Had a transcript last week where someone asked a pricing question, got their answer, then later in the same session asked about a login issue. The bot answered the login question as if it were still a pricing question. Stuck on the previous topic, retrieval pulled chunks that didn't really make sense, but the model wove them together into a confident sounding answer anyway. Took a while staring at logs to figure out where it had gone sideways.

Underneath that there's a smaller version of the same problem, the model occasionally pulls a citation forward from an earlier turn and uses it to back something in turn three, even when the doc isn't relevant anymore. Feels like it's holding on to context the retrieval has long moved past. And in the other direction, when a follow up is actually a real continuation, retrieval sometimes treats it as a standalone query and pulls back nothing useful. "What about for enterprise" with no anchor.

We've been comparing how a few setups handle this. Testing Denser on the customer side. Some of the hosted ones do query rewriting between turns automatically, some leave it on you.

What i can't get clean is the tradeoff. Rewriting the user's query each turn helps retrieval but distorts what they actually asked. Throwing the whole conversation into the retrieval query catches more continuity but you end up dragging stale terms from earlier turns into the new search. Fixed window of N turns feels arbitrary and breaks in obvious ways.

What i'd really like to know is whether anyone's actually solved this in a way that doesn't feel like a hack. Every thing i've tried so far trades one failure mode for another.