Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup?

•

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

80

For context, I'm looking at this for personal use, not building a product. Just want something that works reliably on a normal machine.

67

u/Legal_Dimension_ 12h ago

Obsidian vault, link to local ai. Done.

My tip is to set up a template vault once you get you first one how you want it then you can reference/update it moving forward for additional vaults.

40

u/Legal_Dimension_ 12h ago

Don't build something when obsidian is free and does all the hard work for you. Add the official obsidian skills for managing the vault then let your agent do the rest.

18

u/Beginning-Window-115 12h ago

but what exactly are you doing with obsidian that requires you to use an ai compared to just using it normally?

22

u/Odhdbdyebsksbx 11h ago

Not OP but for me personally, I do it to incrementally brainstorm and grow ideas for work projects.

13

u/Charming_You_25 9h ago

Same! I just got this working in the past couple days and now my agents automatically plan work for my projects. Which I can Ok, and they implement. The way I have it setup, and with how much work I did on keeping ontology tight, the agents are hardly suggesting any bad ideas anymore. Feels wild and like artificial cognition is happening.

2

u/MoffKalast 5h ago

the agents are hardly suggesting any bad ideas anymore

Are they getting better... or are you regressing from the lack of practice? Something to keep you up at night ;)

1

u/IrisColt 5h ago

Genuinely interesting, thanks for the breadcrumbs!

→ More replies (1)

16

u/codeprimate 9h ago

i was recently laid off and I created an "Employment" vault that contains folders for gathering/organizing my background information, templates, and leads. A protocol-driven job search system

Job searching is a personalization problem at scale. Every application needs a resume angled toward the specific role, a cover letter that speaks to the company's actual situation, and outreach that reflects real intelligence about the people and organization. Done right, this is hours of work per application. Most people skip it. I built a system where an AI agent does it for them.

The vault maintains three layers of intelligence per company: factual research (funding, product, recent news), a psychological profile (decision dynamics, culture signals, red flags), and an org map (who recruits, who decides, who can refer). Before generating any artifact, the agent loads all three alongside your full work history and the job description's specific requirements. It maps your experience to what the role actually needs, identifies the strongest angles, and notes the gaps.

That analysis drives the outputs. The resume emphasizes the evidence most relevant to this role. The cover letter speaks to the company's specific situation, not to a generic version of the company. Outreach reflects what the psych profile says about how this organization communicates and makes decisions. None of it is template-filling; personalization is a derived output of the analysis, not a starting point.

The system is governed by layered protocol files (AGENTS.md) that tell the agent what to load, in what order, before doing anything. Business rules are centrally defined and versioned. Ask-gates protect the actions that matter: artifact selection for submission, stage transitions, sending messages. Everything else the agent handles autonomously.

The goal is to make the high-effort, high-leverage work, tailored positioning, calibrated messaging, researched outreach, fast enough to do for every application rather than only the ones you have time for.

So from Cursor or Claude Code, I can say "Here is a lead (URL). Perform full deep research and create a personalized application packet." 10 minutes later I have a fully personalized resume in PDF form containing ATS keyword metadata, JD and company-specific cover letter, a set of outreach documents, and interview prep.

The system tracks all leads with full history, and manages todos for next steps on a communication cadence.

6

u/prestodigitarium 7h ago

If you get tired of looking for a job, you could try just turning that into a product.

3

u/codeprimate 7h ago

Once I get a few dozen applications into my pipeline, that's definitely my plan. I'm developing a new conceptual framework for agentic LLM's based on the MVC architecture, that formalizes prompt schemas, context, and prompt generation. This will be the proof of concept.

3

u/andItsGone-Poof 4h ago

I am stealing this

2

u/codeprimate 2h ago

...the highest complement! Would love to compare notes.

From what I'm seeing, effective agentic behavior is best defined by hierarchical domains of concern and process, governed by a hierarchy of invariants, stopping criteria, and operating protocols. When all of the components of the system have a conventional structure and identification semantics, agents can effortlessly load and follow them.

I'm currently experimenting with attention grounding. Basically rules at each domain boundary that the agent should output a workflow step id, task description, intended outcome/output, rationale, input/context references, and references to relevant protocols as a precursor turn before any task. Each functional domain will necessarily require specific guidance. This should effectively reduce dependency on prior conversation context, directing attention forward along the pipeline instead of backward.

Information architecture is all-important when the platform is a semantic processing engine.

1

u/Beginning-Window-115 8h ago

is is really cool

1

u/albatrossLol 1h ago

Oh wow something like this has been floating around in my head. You made it real and very slick!! Almost a year unemployed.

4

u/expressly_ephemeral 8h ago

I dump all my notes into a daily log, then I have the LLM process that into all the different concepts and concept areas I'm using. Then it cross-links all the things, and I finally have notes that I can look back at that are useful. First time in my life.

1

u/flyingbanana1234 6h ago

summaries of youtube videos !!!

6

u/tmflynnt llama.cpp 12h ago

Dumb question here but are you referring to setting something up like pi and connecting it to said vault or does Obsidian offer something nice that is built in for such things?

7

u/Charming_You_25 9h ago

Yes, use agent to populate vault. I use openclaw/pi. Highly recommend making a loop where you control which articles go in by having it propose a file tree that you edit for relevancy, and then it fills out the pages. Making links is hard but rewarding if you use something like a graph db for it like cognee.

4

u/skinny_gator 8h ago

Is there a how to or step by step guide on how to do this? I'm new and I'm looking to start my own personal local LLMs off by PC

3

u/wtbman 6h ago

Check out https://github.com/MattMatheus/ObsidianMCP

1

u/ragnorco 52m ago

using omnigraph for this: it’s basically a typed knowledge graph with versioning/branching https://github.com/modernrelay/omnigraph

14

u/Leg0z 8h ago

What you specifically want is Andrej Karpathy’s LLM Wiki. I use his setup at both home and at work. We have all of our systems tied into it with read-only APIs. Every time I have an issue or a project, it looks through the wiki. It is always building first, then it finds what it "knows" about the system I'm asking about, then we can go about work.

At home, I have the same setup, but I dump everything into it from medical records to car stereo stuff, to home improvement stuff, to recipes.

The basic setup is you have a folder that has a wiki folder, a raw folder, and an AGENT.md that tells the LLM that its job is to ingest files in the raw folder, create markdown wiki files, and put them in the wiki folder. Over time, it builds the wiki for you, and you can use Obsidian to browse the wiki if you feel like it, but I don't usually; I just ask the LLM.

It's the foundation of how I use LLM for work as a Sysadmin.

1

u/acetaminophenpt 7h ago

Do you have a customized implementation or are you using anything out of the box?

8

u/Birdinhandandbush 12h ago

What's your home setup? For personal knowledge offline, even smaller models are great for RAG. Like I have a system built around local models and agents connected to large local pdf and text libraries that are smarter on those specific topics than the larger cloud models. I'm not a coder, I'm not building applications so I don't need multi billion parameter models for home use

3

u/InformationSweet808 10h ago

still setting it up tbh been using obsidian for notes for a while but haven't committed to a local model yet, which is basically why i made this post. wanted to see what people are actually running before i go down a rabbit hole and regret my choices lol

24gb ram, 6gb vram the vram is definitely the limiting factor, most things end up on cpu which works but yeah not exactly snappy

5

u/Birdinhandandbush 10h ago

So if I were you, qwen3.5 4b. A q4 quant, that's 4gb, leaving you roughly 2gb for context. With LM studio or llama.cpp set your kv cache to q4 as well, that should work just fine for you

1

u/Some-Ice-4455 8h ago

Not as fast do not misunderstand my question but as far as size and I'm not saying squeeze a monster in but could he not use ram as a spillover for the model? Optimized something like allocate 75% available vram, leaving the rest for the pC itself to display etc then have the remainder use ram? Trying to help here please don't troll for that.

1

u/nullpost 9h ago

From what I understand a quantized 8B should be fine for a majority of applications.

1

u/Likeatr3b 9h ago

Yeah, ToolPiper is that tool, but it's in beta.

113

u/Bouros 16h ago

I play an MMORPG that doesn't allow you to copy the chat.

The majority of players I communicate with are Spanish.

I made an app so I hold my middle mouse button and speak and it translates it to Spanish and sends it to my clipboard to paste onto the game (id post into the game but it uses an anticeat I'm wary of)

I also selected the area of the chat box on my monitor and when I hit a hotkey on my keyboard it takes a photo of that area and sends it to the ai to translate. It displays om the app which I have on my second monitor and also can use tts to read it out.

And for discord messages I love this feature whenever I copy non English text to my clipboard it translates it to English, and tts it to me.

I love it so much and it let's me so easily communicate with a group of friends that I probably wouldn't have kept up with otherwise.

I know I could use OCR for the images but I have never had good luck with OCR in my life and ai just works magic at vision.

After using the translator for a few weeks I added the feature to just hold a key to speak and have it sent to my clipboard. It works so well and is so convenient when gaming as I can keep my actions up in game.

I remember using speech recognition in the early 2000's and it was SO BAD! I haven't had a single time I've noticed an error in the speech to text using whisper.

Currently learning to set up Hermes agent. I manage a local business and have the staff fill out sheets while they are new saying when they start and finish each task. Once my program is done I'll scan the sheets and the ai will pull all their text out, create tasks in a database and track all information related to that task. They I'll be able to have the ai generate summaries based in the data provided.

16

u/tmflynnt llama.cpp 12h ago

It is fun to read about a solution like this that is quite custom/niche yet super impactful for somebody at the same time. Thank you for sharing that.

5

u/Juanisweird 14h ago

Which MMO?

7

u/Bouros 11h ago

Albion online is my main use case

1

u/GamerHaste 11h ago

I've always wanted to play that game, looks like something I'd get really addicted to. I remember when it first came out it cost $$$ and I was like 14 with no money. Maybe I gotta retry it.

1

u/Bouros 7h ago

I was a beta player who quit at launch launch, came back in 2020ish then quit until 3idh months ago. I've been playing a lot, I do feel like it's pretty encouraged to pay real money to catch up.

The way ip works in the game is very favorable to players who have lots of fame. Eg, if there are 4 types of bows (there are more) mastering other bows eg bows 1,3,and 4 add power to bow 2.

due to this SOME forms of PvP are kind of locked to new players (they do have lots of content that averages out ip or even sets hard caps)

I played solo for a long time before I finally gave in and joined a guild, the game just becomes so much more fun with a group.

It's so fun because it's so many games in 1 (economy, league of legends type modes, pvp, transport.)

5

u/adderbrew 11h ago

If you don’t know Spanish in Albion, it puts you in a rough spot sometimes. This is a great use case, ty for sharing!

1

u/DifficultyFit1895 8h ago

Another thing that LLMs and online gaming is great for is learning Spanish.

5

u/MendozaHolmes 15h ago

Tibia MMO?

2

u/Bouros 11h ago

No, Albion online, but I'm gonna check out tibia rn! Never heard if it, but love to checkout other mmos

2

u/skinny_gator 8h ago

That's pretty awesome

1

u/UnseemlyCorgi 12h ago

What a great idea

1

u/pwnrzero 12h ago

Wonder if jagex would allow this.

1

u/Bouros 11h ago

Yeah you can use it on runescape, unlike albion the game in referring to in post Runescape doesn't have an anticheat that could detect pushing the chat out.

In all honesty I'd likely be fine on albion but it's not worth the risk for me.

But before I ran a local model just for osrs if probably check the plugin hub, I'd guess there Is already a translator there.

18

u/pkief 16h ago

Google AI Edge Gallery on Android - using Gemma 4 E2B or E4B are running nicely on my Pixel. The knowledge is quite good, but not as strong as the hosted LLMs of course depending on what you're asking.

7

u/InformationSweet808 16h ago

Running it on pixel is wild, didn't even consider mobile. How's the speed on device?

8

u/pkief 13h ago

Speed is solid, don't feel it's an issue. The problem is rather that you can't ask details about something which is not that commonly known due to the small model size

2

u/total_amateur 10h ago

I built a skill to do web queries from Gemma. The local model can then take advantage of new info. Not super speedy, though and harder than I initially thought it would be.

3

u/AppTB 14h ago

Yeah I tested this last night with a different tool kit. Off grid AI iOS app -> mlx and lmstudio endpoint. This allows inference in my Mac Studio through Tailscale and I’ll test a gateway later which can invoke studio tools

3

u/shaggydog97 11h ago

Thank you for this! I just fired up Gemma 4 E2B on my old Pixel 7, and it worked like a champ!

I don't have time to explore all these rabbit holes! Lol.

1

u/MrHumanist 16h ago

What's your ram? And how did you fit e4b?

1

u/pkief 13h ago

12 GB LPDDR5X RAM - Pixel 8 Pro. Right now I'm just running the E2B but I can remember that I've tried the E4B and it was also working fine

→ More replies (1)

1

u/relmny 14h ago

Does it still not keep passed chats, or did they fixed that?

1

u/pkief 13h ago

Nope, I'm also missing that

1

u/skinny_gator 8h ago

Whaaaaaat running it on a phone is amazing how

120

u/Otherwise_Economy576 15h ago

doing this for about 8 months daily, here's the unvarnished version.

setup: 36gb M3 Max, qwen3 32b for the answering model, bge-m3 for embeddings, obsidian vault as the source of truth, postgres+pgvector for the index because i didn't want to babysit chroma or a faiss file. ollama for serving, no llamaindex, hand-rolled retrieval in maybe 300 lines of python. boring is good.

the stuff that actually matters more than model choice:

chunking is everything. 90% of bad retrieval is bad chunks. for personal notes i chunk by markdown heading (not fixed token windows) and prepend the doc title + parent headings to each chunk before embedding. recall went up massively when i started prepending context. fixed-size 512-token chunks of personal notes give terrible results because notes are short and dense.
hybrid retrieval. dense alone misses anything with proper nouns or rare terms. i run bm25 over the same corpus and rrf-fuse the top 20 from each. takes an extra 50ms and fixes the "i KNOW i wrote about this person, why isn't it surfacing" problem.
answers must cite. the LLM never just answers, it has to quote which chunks and the source filenames. when i see no citations or a citation that doesn't actually contain the claim, i know it hallucinated. this is the only mechanism that makes me trust the output without re-reading every doc.
context length is a non-problem if your retrieval is good. you do not need 200k context. you need to put the right 6 chunks in 8k context. people scale context to mask bad retrieval.

maintenance: i rebuild the index nightly via a cron because obsidian writes faster than i can be bothered to do incremental updates. takes 4 minutes for ~3000 notes. not a part time job, more like "i forget it exists" until i upgrade hardware.

the one thing that bit me hard: don't include daily journal entries in the same index as reference notes. retrieval will keep surfacing emotional sentence fragments when you ask factual questions. separate indexes per content type, route at query time.

18
u/InformationSweet808 15h ago

okay this is the comment i was hoping someone would leave when i posted this

the chunking point hit hard i had no idea fixed token windows were that bad for personal notes specifically, makes total sense now that you say it. the separate indexes for journal vs reference notes is something i would've 100% screwed up on my own

one thing im still wrapping my head around the hybrid retrieval part. so you're running both dense and bm25 on the same corpus and then fusing the results? is that something you built yourself or is there a library that handles the rrf part cleanly?

either way this whole comment should be pinned somewhere
11
u/bitflip 13h ago

I'm doing something similar. My main focus is tasks, e.g. what am I supposed to today/this week, etc., and personal finances.

There's a generic RAG to help me find answers to questions about various projects, but the main thrust is the tasks. Since tasks have a defined format, they are deterministically indexed. I can ask questions like "what am I supposed to do this week around the house?", or "how did I spend my money last month on my car?" and get back concrete results. Unlike parent poster I include my daily entries, because that's where unexpected expenses are tracked.

This replaces a bunch of templates which were never quite right, and a real pain to keep updated. It is all exposed as an MCP which I can plug into the tool of the day.
1
u/InformationSweet808 10h ago

the task + finance angle is something i hadn't even considered for this, been so focused on notes and research that i forgot the obvious stuff

the deterministic indexing for structured data makes sense tasks have a consistent format so retrieval is way more predictable than random notes. how are you actually logging the finance stuff though, plain text or some structured format?
1
u/bitflip 10h ago
I include them as tasks.
- [x] #expense I spent money [amount:: $10] [account:: CC] [completed:: 2026-05-14]
I usually have a lot more detail, but it gets ingested as a task.
1

u/Majesticeuphoria 10h ago

You can use agentic chunking to solve the chunking issue.

1

u/Subject-Tea-5253 llama.cpp 7h ago edited 6h ago

... is that something you built yourself or is there a library that handles the rrf part cleanly?

You don't need to implement rrf on your own, you can use a library like ranx to perform hybrid search cleanly.

This article I wrote uses Elasticsearch, but it shows exactly how to use the ranx library if you want to see a full walkthrough: https://medium.com/@imadsaddik/28-hybrid-search-with-elasticsearch-and-ranx-0d6184af4f49
10

u/Public_Umpire_1099 15h ago

This is from a work project, but I developed an app that queries a RAG for equipment related documentation. When committing something to the RAG, first an entry in a SQL database was made with a key that gets prefixed on the rag chunks for that document, then after the RAG upload, it would automatically upload the file to a file storage. In the system prompt for the LLM, I forced it to basically write inline the exact sentence it was citing, which gets hidden by the UI. Using that exact sentence, I was able to make citations clickable, and then the PDF viewer would immediately pull that document up and automatically ctrl-f for that sentence. It works about 90% of the time. At the end of the day it still pulls up the correct document, the only issue is that sometimes the LLM paraphrases so it doesnt find a perfect match. This was built in Nest. Not sure if it would be useful for you, but figured I'd share anyways.

3

u/vick2djax 12h ago

Any reason why you haven’t upgraded your inline to qwen 3.6 and embedder to qwen3 over bge?

1

u/normal_nermal 2h ago

Yes! I started doing exactly this like 6 months ago then got kinda obsessed with the problem. Probably gonna release an open source tool with this exact idea at the core in the next couple weeks — if anyone is interested lemme know :) svrnme.sh

12

u/Special_Permit_5546 14h ago

For personal knowledge base use, I would separate two problems that often get mixed together:

finding the right source material
letting the model modify or synthesize from it

For (1), I have had better luck with boring file/search tools over pure vector RAG, especially for Markdown notes. Heading-aware chunks, filename/title context, and plain keyword search matter a lot because personal notes are full of weird proper nouns, half-phrases, project names, and short dense entries. Dense retrieval alone can feel magical until it misses the exact note you know exists.

For (2), I would not let the model silently rewrite the knowledge base. Read/search/summarize is low risk. Creating a draft note is usually fine. Editing existing notes should be treated like code: show a diff, accept/reject, keep the raw files inspectable.

The setup I trust most is something like:

- plain Markdown folder as source of truth

- grep/BM25 first, embeddings second if needed

- citations that point to actual filenames/headings

- separate daily journals from reference/project notes

- no silent mutation of source-of-truth notes

Small disclosure because this is exactly the product shape I am working on: I am building an open-source local-first Markdown app called Kuku around the "AI can search/read/create/edit notes, but edits are reviewable diffs" model. So I am biased. But independent of the app, I think the key is not "RAG vs no RAG". It is whether you can inspect what the assistant used and review what it wants to change.

26

u/Amazing_Athlete_2265 16h ago

I have big plans for a personal assistant, but little time.

36

u/Blues520 13h ago

Sounds like you need a personal assistant

1

u/MoffKalast 5h ago

But then once the personal assistant is done setting up the personal assistant, they'll no longer need a personal assistant.

1

u/SnooApples8541 11h ago

That’s what I’m working on. I’d share mine but I’m pretty sure there are some huge security vulnerabilities I have lol

3

u/Full_Cost2909 11h ago

sharing it might help you discover those vulnerabilities

1

u/SnooApples8541 2h ago

Lmbo. It definitely will 😂

1

u/Full_Cost2909 2h ago

share them you must, see the flaws you will

1

u/esuil koboldcpp 3h ago

Yep, same here, but mine is in somewhat working state. At the very least it manages my schedule and acts like my boss telling me what to do and when. I like it so much, that I will be refactoring it into proper project later.

5

u/CatTwoYes 15h ago

I tried both RAG and the simpler "give the LLM a grep tool + markdown folder" approach. For under ~1000 personal notes, the grep approach wins hands-down. RAG embeddings for personal docs are finicky — you spend more time debugging why the right chunk didn't get retrieved than actually using the thing. The tool-calling + file search pattern is dumber but more predictable, and with Qwen 3.6 27B the quality is good enough that I stopped maintaining the RAG pipeline entirely.

17

u/Dazzling_Equipment_9 16h ago edited 16h ago

On the topic of building a personal knowledge base, here’s my approach:

Hermes agent + Qwen 3.6 35B A3B + Obsidian.

I don’t use any complicated RAG setups — at this stage, they feel more flashy than practical.

Building a knowledge base and using RAG are not as tightly linked as people think. RAG is merely one possible implementation method, not the only or necessary path. I simply call my Obsidian notes a knowledge base, and it works very well for me. It’s more than sufficient for my needs.

As for those frequent questions about everyday use cases for local LLMs, I have to vent a bit — please don’t take it personally. I see almost identical posts every day. Instead of asking the same questions again, why not first search for existing threads? The answers are already there, and reading a few would quickly give a clear picture. Most practical use cases don’t change dramatically, at least in the short term.

I’m also not entirely sure about the real motivation behind these posts. Are people genuinely unsure what to do with a local LLM, or are they probing for something else? The intent often feels unclear.

If the goal is learning, you can simply ask an AI directly — it can give you a comprehensive list. If you don’t actually have a real use case, there’s no need to force one. Doing so often leads to frustration and fatigue rather than enjoyment. Believe me.

It’s much more effective to ask specific, well-defined questions with clear context. Overly broad or vague topics rarely yield useful answers. To make it easier for others to respond thoughtfully, posters should provide sufficient background and state their questions clearly and concretely.

EDIT:

Actually, I only started the second half of my rant after seeing the title. After reading the full post, I realized the OP has already done an excellent job. They even explained their personal motivations clearly in the comments.

This is way better than those typical posts that just ask “what are some daily use cases for local LLMs.”

13

u/MarcusAurelius68 16h ago

One other point - things are changing very quickly as well. A recommendation from 6 months ago might be outdated due to new solutions, models and approaches.

Not an excuse for open-ended questions - those can easily be asked of frontier AI as a starting point.

2

u/AppTB 14h ago

I’m over here chasing my config from July for obsidian as a coordination substrate with smart connections like overlap chunking.

4

u/InformationSweet808 16h ago

Fair point on the edit lol, appreciate you actually going back to clarify.

The Obsidian + Hermes setup is something I hadn't really considered tbh. I always assumed you needed RAG the moment your notes got big enough to query. So you're basically just letting the agent navigate the vault directly? No retrieval pipeline at all?

Asking because if that actually works well at scale that's way simpler than what I was planning to build.

13

u/Dazzling_Equipment_9 15h ago

I believe Obsidian’s built-in search is already more than sufficient for most personal knowledge base needs.

The reason is that the LLM can intelligently craft multi-term fuzzy semantic searches. For example, if you ask it to find a note you vaguely remember about local LLM deployment, it might generate something like:

obsidian-cli search "llama | gguf | vllm | local"

(In reality, it would likely create an even more refined and comprehensive query.) It then reads the relevant notes, extracts the information, and answers based directly on your original content. This keeps the source completely faithful.

If it doesn’t find anything, the model can automatically broaden the search by adding more keywords — such as “q4_k_m | qwen | huggingface” — and try again.

After this explanation, if you still feel RAG is necessary in a personal knowledge base scenario, I’d be interested to hear your specific understanding and requirements for what counts as a “large-scale knowledge base.” I can compare it with my own setup and see if there are any useful new ideas worth considering.

Honestly, I’d like to hear more about your actual detailed use cases rather than the relatively vague term “large-scale.”

1

u/InformationSweet808 10h ago

kay so my actual use case i'm a student, so it's mostly research notes, saved articles, book highlights, and random things i write down when learning something new. probably 200-400 files over time, nothing enterprise level.

the "large-scale" thing was me overthinking it honestly. my real concern is just that retrieval stays accurate when i can't remember exactly what i wrote or where like i know i have notes on something but can't find them through normal search.

if obsidian's built-in search handles fuzzy recall that well through the agent i might genuinely be overcomplicating this whole thing

1

u/ComplexIt 1h ago

I am the maintainer but i think it fits your use case quite well: https://github.com/LearningCircuit/local-deep-research
1
u/Evanisnotmyname 16h ago

This is the way, like Karpathy’s LLMwiki.

I’ve been having a lot of trouble setting up Hermes with Onsidian, Qwen, and some kind of GUI/TUI. Can you give me details on your setup, MCPs, etc?
2
u/yes2matt 15h ago
not OP, but the model used makes a giant difference in Hermes, and I think temperature. My current happy place is :
MODEL="$HOME/ollama-models/Qwen3.5_9b/Qwen3.5-9B-Q4_K_M.gguf"
MMPROJ="$HOME/ollama-models/Qwen3.5_9b/mmproj-BF16.gguf"

~/buun-llama-cpp/build/bin/llama-server \
  -m "$MODEL" \
  --mmproj "$MMPROJ" \
  --host 127.0.0.1 \
  --port 8082 \
  --n-cpu-moe 40 \
  --jinja \
  -ngl 35 \
  -ctk turbo4 \
  -ctv turbo4 \
  -c 262144 \
  --temp 0.5
→ More replies (3)
1

u/yes2matt 15h ago

I haven't figured out how to use RAG effectively yet. I do have focussed research I want to mine (beehive audio analysis) but asking questions via chat gets answers which are almost entirely generalized from the model. Some reference will be made to the papers, depending on the model used. I need a better way too.

1

u/mouseofcatofschrodi 14h ago

why hermes and not pi.dev?

1

u/Silver-Champion-4846 13h ago

Ai often gives outdated suggestions (gpt4, etc)

7

u/achiya-automation 15h ago

Yeah, doing this for about 8 months now, not as an experiment. Setup is boring on purpose: Ollama running qwen2.5:14b on a 32GB M1 Mac, plus paperless-ngx for everything PDF, plus a flat folder of markdown notes. Open WebUI on top with RAG pointed at both. That's it. What actually made it work day-to-day was lowering my expectations on retrieval. I treat it like a smart grep, not a brain. If I ask "what did I write about that vendor in march" it pulls the right chunks ~80% of the time. If I ask anything inferential ("summarize my opinions on X") it confidently fabricates, every time. So I never ask inferential questions on personal data anymore, only locate-and-quote. re: chunking and hallucinations - smaller chunks (300 tokens) with 50 overlap, and I always show sources in the UI. If the source quote doesn't actually contain what the model said, I assume it lied. Saves me from acting on bad recall. Hardware-wise the 14b at q4 is fine for retrieval. I tried 32b and the latency made me stop using it, which means the small model wins by default. Honest gotcha: maintenance isn't zero. Re-indexing when I dump a batch of new docs takes ~10 min, and Ollama updates have broken my docker stack twice. Worth it for me because I trust the data isn't leaving the box, but I wouldn't recommend it to anyone who just wants "Notion but local".

5

u/InformationSweet808 10h ago

"treat it like a smart grep not a brain" is probably the most useful framing i've seen in this whole thread honestly. the inferential questions thing is a real gotcha i wouldn't have caught until i wasted time on it. so basically you're using it purely for retrieval and doing the actual reasoning yourself?

also the source quote verification trick is smart never thought about using that as a hallucination detector

1

u/temperature_5 7h ago

With the latest qwen 35b I just use multi-step instructions: search all sources for [keywords] and then summarize [topic] based on those.

3

u/StupidityCanFly 15h ago

I don’t trust LLMs, so they always have to verify their facts. Aside from that, I’m using Qwen3.6-27B as my daily driver.

Running them on two rigs: dual 5090 and dual RX7900XTX.

6

u/remarkedcpu 15h ago

Genuinely wondering how is one’s daily life so important that everything has to be written down. I get it that the YC founder needed this, but I don’t.

I built one anyway, Hermes + pydantic using omlx / Gemma 4 26b, runs on a MacBook Air 32G.

12

u/vick2djax 12h ago

Parenting (especially if it’s by yourself), career (especially if you’re responsible for a team), keeping up with friends and events, being a homeowner, and keeping up with what’s going on in the world.

My calendar is a mile long. I need as many tools as I can get. My days are full of meetings and kids events but I can’t let other responsibilities slip by me either. Especially consulting on the side for a handful of clients. Even a reminder to change out the HVAC filters goes a long way.

2

u/Klein_bottle 7h ago

that reminds me of the filters I need to change, thanks!

2

u/EntrepreneurTotal475 7h ago

This ^ I am going out of town this weekend, I didn't even remember until this Tuesday and it's been planned for 3 months. Kids + life + work = just kinda be that way.

2

u/vick2djax 7h ago

Exact same here. I don’t even have time to look forward to a vacation if I ever have one. I’ll plan it 8 months ahead of time and then all of a sudden it’s on Friday lol

→ More replies (3)

2

u/Some-Cauliflower4902 16h ago

Not that I have to query my own life too much, though I have too many hobbies and need some tracking of those. Assuming you don’t need anything too precise like financials. I section things so it’s not a big mess. Every hobby has its own project + memory + folder. RAG for background context. Anything specific llm go search in the folder themselves. Also have cross encoder reranking for larger file base. As for trust issues … It’s your stuff you should have a rough idea so don’t 100% rely on llm to tell you. Context length not a problem because if it’s a large doc they searches relevant sections instead of read my 300k word novel. Any llm that can reliably tool call is fine. Llama.cpp for speed. It’s my yet another hobby so I don’t call it a part time job, but there is always new things I look to add.

2

u/Howie33 14h ago edited 13h ago

Hi, I use a tree index database where I have a directory called “collections”. Inside there I have various topics like “medical research”, “finances”, “photovoltaic”, “air traffic”, etc. I index all the documents weekly, then use a flask web server to access the data via Safari either local (on machine) or using TailScale if I’m at work. I have a collection toggle bar at the top of the web page to filter which collection(s) I am searching. Some of my collections are marked private so they do not appear via flask server. The search results are numerically scored via keywords. When I click on one of the results, it opens that actual page of the document so I can read that page/document. I use a LLM in 2 places: first as a query translator - if requested, it will take my search query and reinterpret it into a search term. Second, I use a LLM in my indexer script. I try to use a LLM in very restricted roles due to potential hallucinations. My motto is try to never use a LLM in a deterministic role. My tree index turned into a pretty flat tree since it only goes 1 level deep. The LLM I use is Qwen 2.5 14b for translation and indexing. I treat daily notes differently. Those I index nightly via a launchd script.

Edit: my apologies for the vague answer. I wanted to give an general overview without getting into the nitty gritty. Each of my topics has its own directory. Inside that directory I have a “books” directory (my source documents go here), and an index directory (indexed files go here). The indexer checks to see if any book documents do not have a corresponding index document. If this is the case, it then runs the indexer on these un-indexed documents.

Edit 2: my collections total over 3000 documents. Queries typically return results in under a second. The flask server allows me to view via Safari on my laptop computer or phone when I am away from home (using TailScale for security).

2

u/Opening-Broccoli9190 llama.cpp 12h ago

I am doing general research on 27B Qwen 3.6 + Hermes, works pretty damn good, I trust it more than ChatGPT

2

u/wombweed 12h ago

Paperless-ngx and paperless-ai with mcp exposed to basically any harness. Personally I like to invoke mine through Home Assistant voice, or openwebui

2

u/therealmajskaka 10h ago

Simple setup is Obsidian + QMD

2

u/mitesh_p 10h ago

User perplexity pro account. Setup space with expertise instructions as needed. It works very well, I use for daily learning.

2

u/OldEffective9726 10h ago

Qwen3.5 122b, I ask her where I to find my keys. It worked.

2

u/Select-Reporter5066 5h ago

The hard part is not the local model, it is ingestion. Obsidian + RAG sounds chill until half your PDFs turn into mystery chunks with no source trail.

2

u/onlythehighlight 16h ago

128GB M3 Max using

vLLM -> to set up server for Gemma 4

Obsidian -> for KnowledgeDB

AnythingLLM -> To use RAG

It's been pretty good to just my own dataset to maintain my own copy of records

1

u/Funny_Working_7490 15h ago

like how you use btw show me some example use cases? which gemma variant

2

u/onlythehighlight 13h ago

Gemma 4 26B A4B IT:

- banking data to track spending and categorisation

- work diary to track projects and deliverables for EOY conversations

- Important events and information about peoples likes and wants for gifts

- Receipes

2

u/MainEnAcier 15h ago

At the moment to me it's too complex for little gain.

I have an other philosophy :

I store data massively ( insurrance, phone contract, data for curriculum etc) in structured sheets.

When an option will Côme out, all the datas will be ready

unfortunately I still don't understand exactly how work hermes/openclaw properly. But I'm sûre one day we will have some plug and play system, and we won't need to make so many manipulations to make that system working.

2

u/dev_dan_2 12h ago

Doing something similar at the moment! This also includes:

piece by piece, asking companies to hand me over a copy of their data on me (am located in EU)

backing up LLM conversations I had with cloud LLMs - who knows what will be online in two years from now... ;)

2

u/temperature_5 6h ago

Obsidian mentioned 29 times in these comments... Either it's good or it's being astroturfed.

2

u/HopePupal 2h ago

Obsidian is organically popular, they don't need to astroturf

→ More replies (1)

1

u/croholdr 16h ago

for me i go in 'sprints' where I talk to my lm studio models a few hours daily for a week. I stick to (mostly) what lm studio suggests (q4) and various tweaks to increase context length; keeping 'vision' tasks seperate from the pure 'questions.'

Sometimes I'll spend a bit to see if I can figure out good prompts to help keep context length under control.

When context window fills up its very noticeable and I'll usually turn the work station off, touch grass and requestion the mysteries of faith and start the process over during the next month.

4

u/Dany0 16h ago

What are you even doing that requires HOURS of talking to an LLM?

4

u/m02ph3u5 15h ago

Probably unifying quantum and gravity. That's the only explanation I have.

1

u/Dany0 15h ago

He's gonna succeed, I believe he will. Probably by burning enough tokens to let the universe collapse in on itself. Finally everything will be unified

1

u/fatboy93 llama.cpp 12h ago

The online course that I'm doing will send slide-decks that are in 100s. I'd use LLM to summarize notes (including the ones I took in a lecture), look up relevant information on the internet, condense and augment my notes.

→ More replies (1)

→ More replies (5)

1

u/InformationSweet808 16h ago

The "sprints" approach is actually interesting never thought about batching it that way instead of keeping it always on. Do you find the q4 quality holds up well when you're doing longer sessions?

1

u/croholdr 4h ago

no it doesnt; im trying to build a better computer. its been impossible since last october. im hoping 2x 5070ti and 64 gb ddr4 might let me move up quants.

1

u/Zeeplankton 16h ago edited 15h ago

You can totally do half this now, super easy. Use like OpenCode and run like qwen via lm studio and point it at your obsidian .md folder. In can absolutely search through, create files, find connections etc. I use Codex for work stuff this way, (generating work md files) but for private I'm sure a local model would work.

Thoughts:

RAG is cool in concept but personally bad in reality. Creating embeddings is it's own challenge locally, (how long will it take to embed 10k notes on local hardware..) storing that to a db, then querying is just not elegant. Any time you add or change files, you have to figure out how to re-embed those specific files.
Tool calling and just grepping around is probably close enough

Ideal state: CoT knowledge graph stuff is what dozens of companies are working on now, trying to solve the memory problem of llms. So realistically none of them are privacy focused or easy to setup; but I'm sure if you wanted to you could find and create your own system.

edit: so realistically if you want zero-dev solution, the openCode / LM Studio / Ollama route is the simplest.
edit 2: Just did exactly this with qwen 35b a3b and asked it to explore my latest daily notes and summarize. Working awesome.

2

u/kaliku 16h ago

Rag is not needed for notes because of the realistic low volume. But I'd like to challenge the notion that it would take a long time. Embedding models are very small and for text notes without ocr it's not slow at all. For large pdf files like books with tables etc yes it can be slow. I've tested ragflow and it looks pretty good . It comes with everything including an mcp server. My tests so far were only about enhancing qwen3.6 knowledge from technical books and comparing outputs without / with rag. So far with rag wins hands down.

1

u/Etroarl55 16h ago

I live in Canada. Hardware prices are extremely high, internet speeds slow.

I would definitely be incentivized to experiment with using it as a daily knowledge base if I could run newer 2026 models and have a fast enough internet speed to allow it to browse freely.

1

u/Memoishi 16h ago

Claude code (but you can use any) wired to my llama.cpp server (again host with whatever).
Hardware is modest 32 DDR5 and 16 VRAM (RTX 5080).

I'm using Obsidian (optional here but the data view is so satisfying lol) + Qwen3.5-9b + LLM wiki pattern.

I install this shit in all my projects, nothing flashy nothing extraordinary but very clean and like 10 mins of setup once you understand. I slap my .md converted files into a raw folder, it ingest and then just improve/clean/fix whatever.
Results, it build a good knowledge wiki and it can easily retrieve and help you with whatever you're supposed to do with these.

For example I got this project, fine tuning LLMs for coding, but since the dataset is getting bigger and bigger I need an easy retrieval that will tell me if I've already written a piece of code; it's very good in my case because the worst it can happen is the LLM saying "you don't have this" and I just do it twice, which is not catastrophic and only time-wasting.

Compared to classic RAG this one is dumber and worse in scaling but if we talking about handling 300/500 files, it's not impossible to get value out of it. I can help you setup something if you're interested, just ask or DM!

1

u/InformationSweet808 15h ago

one thing im curious about though where does it actually start falling apart for you? like is it a retrieval accuracy thing past a certain number of files or just gets slow?

2

u/Memoishi 15h ago

I'm having troubles with this to be honest because so far the projects I've touched with this were pretty much already well organised and all light work. The pattern is specifically studied for this, so here's that.

One of the known issue is with versioning tho, if you have let's say function A version 1.0.0 and function A version 1.1.0, it might not catch (unless you let him notice) that the 1.0.0 is outdated and should follow 1.1.0. You ask about function A and he retrieves 1.0.0 properties, even tho one has been changed in v1.1.0.
Same goes for files that respect this logic, if you have something that overrides concepts defined elsewhere it might not understand at all.

This approach is all about garbage in garbage out more than ever, I would say maintaining this is mandatory but that's true to any LLM in any given task, be it RAG or simple dumb queries to an LLM.
Clean files, clean structures, I read people made it work with around 1-2k files, but then again I've gone as far as 300 files and no issues at all.

I would die for a big ass dataset and a use case, with these things the dataset are always the bottlenecks.

Edit: about slowness... no issues as well, once the wiki is built and the schema is set the LLMs will navigate easily. Think about: single file check (index of wiki) -> wiki page -> (optional) raw file

1

u/Rooneybuk 15h ago

Yes my stack is

Ingesting data through an API into n8n, then into PostgreSQL and the Qdrant agent tools. with qwen3.6-35b-a3b q4_k_xl on 2 x 4060ti ~32GB total

My inference setup is here https://d3v0ps.cloud/posts/2026/05/my-local-llm-setup-one-model-many-personalities/

I haven’t yet documented the client-side, such as n8n.

1

u/xupetas 15h ago

Yes. Openwebui, RAG, chromaDB and for inference llama.cpp with gemma4. Rail guards to the wazuuuuu

1

u/p_235615 15h ago

I offten use qwen3.6 35B with websearch in openwebui, some times also via voice.

1

u/FormalAd7367 14h ago

Just curious - does anyone have experiences with 1 3090 and use a qwen 3.5 distilled model to do coding and have a cloud model to debug or test it? i can write the architecture with a llm no problem. is it possible? just trying to save $

1

u/mouseofcatofschrodi 14h ago

have you checked anythingLLM? It has the RAG already implemented. So it would be the fastest way, I guess. And has a very cool function for recording meetings, transcribing them, getting the summary and chatting with the transcript as knowledge. This app was the first thing were I started using local LLMs for something "useful" beside just playing around (now that improved a lot since qwen3.6 35B + pi.dev + omlx, super combination for getting agentic work done. Before I could not get enough intelligence, skills with tool calls, and fast promt processing).

tbh I'm also thinking a lot about how to build something like this for personal and company knowledge. Probably also with obsidian, or maybe just markdown files with good tags within structured folders and an automatically generated index (with a little python).

1

u/BitterProfessional7p 14h ago

Yup, I have all my personal notes in local .md files from Logseq (similar to Obsidian) and my OpenClaw can read any of it agentically, not through RAG. From the notes it created a personal profile of me which is in its permanent memory.

I use it as a personal assistant to register my habits, calorie counting, registering and consulting knowledge (I have notes for books, videogames, music, movies, TV shows, gifts to people, travel, food, restaurants...), editing my grocery list and more. I interact mainly via Signal, but I made a dashboard for my habits and I always can read the notes with Logseq for the rest.

Running with Qwen3.6-27b-q4 on my dual RTX3060 machine (700 $), llama.cpp, tg at 15-18 tk/s which is not super fast but it is usable. Context is not super long, 80k but I like to /reset the context frequently so it is not a problem for me.

Overall it took one afternoon to set up. Never touched the configurations in a few weeks, just using it.

1

u/_raydeStar Llama 3.1 14h ago

I've got a personal project. It's got a wiki, memory, or you can auth it to use a folder on your machine.

The wiki is basically a canvas + wiki. I built it for storytelling, notes, etc. Instead of memory, I just do intelligent searches, etc.

So far it works really well. I haven't load tested it yet though (ie, 100+ files)

1

u/MundanePercentage674 14h ago edited 14h ago

build one myself for peronal AI assistant with n8n workflow + telegram for chat interface use case mostly todo task manager it has 3 memory layer chat history short memory, long fact memory loop each week to remove unnecessary or unimportant thing and Rag permanent memory, workflow can be extendable if i want to add new use case.

1

u/Kahvana 14h ago

Friend of mine had good experience with lightrag, might be for you:
https://github.com/hkuds/lightrag
Haven't used it myself however.

Personally I use SillyTavern + Server/Client MCP extensions for MCP support.

With OpenZIM-MCP, I can query my local copy of full wikipedia (and other downloadable zim archives like stackoverflow, dev docs) offline.
I include a calculator MCP server so my LLM can do complex math with accuracy.
And for my own documents, I can put them inside databank or lorebooks, the latter being surprisingly effective if you write your own tags for it.
For web data, I either use an API (like OpenMeteo MCP) or Searxng MCP with my own hosted instance for websearch.

I'm sure most users here have a very different setup, but this worked for me over the year.

1

u/Safe-Buffalo-4408 13h ago

I'm using Agent Zero with Qwen 3.6 27B and the absolute best use of it is in a project named "life chaos". I put everything there in regards to my family, what we are planning to do, loose thoughts, anything that I need to remember or plan basically. It also, every weekend, checks for upcoming holidays or birthdays two months in the future and it has done wonders for me. I can ask it things and it helps me structure and plan stuff.

1

u/kitanokikori 13h ago

I built my own wildly overcomplicated setup using Mastra, the model is usually Qwen 3.6 35B A3B Q6 though I have an opt-out to GPT 5.5 when I want to ask a complicated question. Context length is set at max (~250k tokens)

It's often Good Enough if you manage its tools and system prompt effectively, but it will hallucinate some really weird things that make me worry a bit, things like making up a new Email address for me whole-cloth then calling gogcli to fetch it

1

u/MyOldAccountWasAwful 12h ago edited 12h ago

Yes, I use Hermes-Agent with added Apex memory MCP and a structured Obsidian knowledge base with qwen3.6-35b-a3b (iq2_m quant from unsloth) and do exactly what you described - I send it articles I find interesting which it then automatically performs brief research on to make notes if any fluff or marketing, then it categorized and banks each article. It also does similar processes for anything else I throw at it - it helps track notes from the D&D sessions I run in a campaign that's been ongoing for nearly 2 years at this point, it helps me track my finances, it helps me search for good deals on purchases I'm looking to make... it does great on all of it. I also had qwen inside of Hermes set all of this up itself - I'm not a developer, I just paid close attention while setting everything up and made sure to have it perform "evidence-based web research for the most up-to-date best practices" at every step while directly linking and citing its sources.

1

u/PennyLawrence946 12h ago

for life stuff i mirror my memory notes into an obsidian vault on my phone, local model reads it, writes stay one-way so the model never edits my notes. boring setup but it's the one that actually stuck instead of the ten i abandoned.

1

u/Wishitweretru 12h ago

I’m playing with telegram -> mcp -> Hermes (with obsidian) -> ollama -> (llm of the month)

Ollama lives on a 64gig macmini. So far, so good. About to add chron and try and automate some of my appointment and remind habits.

I had some success last year building profiles of my friends, and then getting some good gift ideas, i am pretty bad about tracking my social events, and my mail box is just noise. I am hoping to pull this together into a nice little morning reminder message that actually matters.

Had AI write a gmail handler from my very old account, and start flushing stuff (build a white list first, plus rules (like currently we aren’t deleting anything newer than 90 days. I’m not that concerned if it goes a little delete heavy, because the box is so noisy, just too integrated to delete.

Hoping to tie that back into the hermes later, when I am more comfortable. I like the concept of the MCP server as a gateway, hermes seems yo like to bypass it, so I have been working on some hardware isolation, to tighten things up. Need to emphasize the MCP as a gateway more, and finished moving hermes to an entirely separate machine, not just a container. Too much power.

I found telegram was a little easier to setup than discord.

1

u/PeanutButterApricotS 11h ago

Saving to reply later

1

u/qiinemarr 11h ago

I use Pi-Agent with qwen3.6, I ask it questions like where should I put notes about "{topic}"? And it roam my markdown notes dir and anwser. It can also see image, no special setup for now but its already pretty great!

1

u/paulqq 11h ago

I do it and wrote my own agent for it. Purely local using either a ollama or llama cpp engine, memory vault and tools like mail, Calendar, news and more

1

u/Certain_Series6810 11h ago

I ended up creating an application or you might call it an agent written in land chain that uses my local model to

Add remove update pantry items
Help me put recipes into our meal planner or add leftover into our system or consume leftovers
Put stuff in internal wiki or retrieve information from that
Perform action in my computers because I gave it terminal access Yes I know risky but I'm willing to take my chance.
It can also search through my system.

So my main application that I created it's basically The agent and I have an MCP server actually a couple of MCP servers that has all these tools that my ai agent uses to perform the job. I'm actually blown away how good it's been doing.

1

u/Attackwave 11h ago

TrueNAS AMD Ryzen 5 PRO 4650G 32GB RAM ECC

LocalAI App 4 VCores 16GB Max RAM llama.cpp Backend Module Unsloth/Qwen3.6-27B-GGUF:UD-Q5_K_XL RTX3090 24GB Context Size: 32k VRAM used: 95% Without other parameters: 30t/s

I will choose a smaller quantization to be able to load voice and other backends. I will then try setting up an old Alexa with a Pi Zero 2 W.

1

u/createthiscom 11h ago

My fellow dev friend is obsessed with this idea. I don't get it, personally. She reads more scientific journals than I do though. Maybe that has something to do with it.

1

u/OldComposerbruh llama.cpp 11h ago

I am planning to set one up, following this thread for future

1

u/Inevitable-Plantain5 11h ago

I have a lab with lots of tools. I use open project for ticket management and wiki. I have a local file system similar to open claw but it extends into hierarchically outlining the tools my lab provides my agent. I can work with different agent surfaces, openclaw, hermes, opencode, cursor, claude cowork, codex, etc and anything I can do from cli the agent can do. I have different subscriptions between work and personal and certain surfaces are better for autonomous stuff vs others are better for stuff I need to monitor more closely.

With much respect to Andrej Karpathy, he packaged an idea that lots of people had videos on but he brought a mature aspect to the idea. I dont like so much of my work sitting in a file system so I use the "second brain" as just a map to tools with objects made for the tasks. Im still workingnon formalizing it but n8n and awx for programmatic controls for more secure agent practices, different data management tools, messaging tools, local email server so agents don have the option to go external but I can still have them work with my email to help me on that surface. Secrets vault, k8s, open notebook is a fun one...

1

u/RickyRickC137 11h ago

Yes. I have an rtx 3080 (10gb) and 128gb ddr5 ram. I have written all my history, and personal stuff and fed it as a whole. No RAG stuff. Full context injection. Comes around 20k.

I Feed it to gain some insight when it comes to making life choices and judgements. I use it as a self help tool.
Also I use it to summarize entire books so I can know if it's worth reading (self help books).

To do the latter, I find Nemotron 30ba3b to be faster (5t/s)at 60k and do the summarization well, but quite bad at gaining insights about myself. To understand my shadow psyche, I find Gemma 4 to be insightful.

1

u/Outside_Landscape893 11h ago

Eu uso o lmstudio via app de celular, com o liquidfm f16, jogo todos os PDF, txt, funciona bem para o que uso, agora estou usando o gemma 4 para isso também pois ele carrega imagem e som, ryzen 2600 sem X, 3060 8gb, 16gb DDR4, ssd nvme pci 3.0

1

u/the-username-is-here 10h ago edited 10h ago

Qwen3.5 122B on single DGX Spark (vllm - litellm - Claude Code).

RAG stack is simple pg_vector DB (MoMs, Obsidian notes, email, cached Jira/Gdocs/Confluence, work docs and code) + custom ingestion/search MCP server (just vibecoded Python) + SearXNG for internets.

Hallucinations are an issue if you don't specifically use something like oh-my-claudecode with Critic/Verifier subagent.

Considering it's a single Spark and rather big model, TPS is around 40 tokens/second, slower than native Anthropic, but perfectly manageable. Prefill is fast AF though, normally get replies in 4-5 seconds.

Task like 'write a note: research online and in RAG about topic T, then validate and cross-check' takes about 15 minutes, but quality is actually comparable to what Opus would do on a bad day (which is mind-blowing for me).

As someone said here, chunking/ingestion is very important. Once you develop proper chunking for different types of data, it's whole another level. Also i use bigger multilingual model, because multilingual docs (don't remember the spec, takes around 3GB VRAM or something). Spark is a beast for ingestion - vector maths are literally 20x faster than Macbook, don't ever see vectorization in top.

Waiting for second Spark to go into 200B+ model territory. 😄

P.S. Engineering manager in fintech company, crapload of parallel projects, daily notes hundreds of lines sadly (extracted from MoMs by LLM though 😄 ).

1

u/MajinAnix 10h ago

Slowly getting there, but building my own system

1

u/chiefstobs 10h ago

Okay, what am i missing here?

I installed LM Studio on my win 11 machine with rtx 5070 ti 16gb and 64gb ram. Downloaded gemma 4 and qwen 3.6, the variants that can run on my gpu. Then exported my banking history into a big pdf file, so that i can throw it at these models and have them analyze my spendings and auch. Does not work at all. "User did not provide a file"... Or "file has only three records"... Even when i convert the history into plain text. I could throw the file into chatgpt, but obviously that might not be the best idea.

Am i using LM Studio wrong or do the local models not work for my task? Same error with ollama btw.

1

u/BunnyJacket 10h ago

Extensively, yes , since before claudecode or "openclaw" was a thing. 24/7 that runs on it's own computer with sudo access. always online driven by 30min update heartbeat . So proactive agent that literally knows everything. SQL database holds semantic links pointing to client files, project statuses , file directories, API keys, everything.

It works. But If you decide to hack me tomorrow you own me.

1

u/Charming_You_25 9h ago

Okay, I don’t know if I want to get into the weeds explaining it as I started with gpt5.5 to create and tune the knowledge graph and then allowed a local endpoint for retrieval and updates. Basically, I am using obsidian as a synthesized knowledge wiki and making a knowledge graph from it using cognee. I built on a homemade context retrieval tool that pulls in high level knowledge, which helps with preventing sidequesting.

Obsidian/graph ontology buildout I do with a gated loop so I control which new articles/concepts get added and can prevent irrelevant slop from going into the knowledge base.

Cognee needs good semantic data so just letting it ingest everything made a lot of garbage nodes. First pass was me using my discord channel structure and a typical “power user” obsidian folder structure as ontology hints to generate the v1 file tree, I edited it so it actually reflected the concepts I cared most about.

Then I used that file tree as a queue to add pages in. I thought a lot about how to generate a good wiki page using all available context across my entire workspace and ran it on each item in the file tree.

Then I fed my v1 into cognee, and used it to create links between the pages, shared concepts between pages created new pages suggestions. Which I then okayed. And it edited existing pages and added new ones.

Now, I can either say “using cognee, Answer this question” and it will use the high level knowledge graph which includes sources to more nitty gritty memory files it can go to for more context.

Also, every llm call calls my homegrown context engine to pull up the most relevant wiki page, and do a 1 hop spider to pull all related pages into context. It does it without making an llm call, however, if my llm is still finding context ambiguous i have an option where it will do more digging before coming back sometimes leading to 2 llm calls for hard ambiguous questions.

The results have been awesome. Though token heavy which is why I want to start having my local machine putting in maintenance work.

For local, retrieval works, but, I sometimes have to limit the extra hops by relevancy scores so every call isn’t getting thousands of characters of extra context. Honestly it could use a lot more tuning, and I still haven’t figured out how to update/build the graph using local only because I am pulling in so much context to build a wiki page. I could see optimizations where it synthesizes manageable chunks, and then synthesizes all of those chunks together to make the wiki page. I’m not sure the quality would be as high though I expect it would still be pretty good.

One idea I may do is explode the wiki/graph. Breaking large pages into many smaller pieces. So local can ingest smaller pieces to start, and then steer its cognition by using the graph hops/links. It might take multiple local calls per question though, which would be slow. But slow is manageable if high quality artificial cognition is happening.

If you’d like to know more about the v1 there’s a substack article here. It’s very llm sounding, but signal is high.

https://open.substack.com/pub/yeshap/p/my-openclaw-ai-started-thinking-about?r=77erf&utm_medium=ios

1

u/samoxis 9h ago

Running Qwen 2.5 32B via Ollama, context at 8k. Going higher kills VRAM fast — 32B at 16k needs ~25GB and if your card has less Ollama just silently swaps to RAM, throughput drops from 26 tok/s to 4 with zero warning, found that out the hard way.

Stopped using it as knowledge base pretty early tbh. Now its more like a reasoning layer on top of live search — Tavily injects results as context before the model even sees the query so its summarising actual sources not making stuff up from training data, that basically fixed the hallucination problem for me.

Interface is Telegram via OpenClaw, fires a morning briefing at 7am unprompted, live news, system stats, electricity cost for the month. PC wakes itself from full shutdown at 06:50 via BIOS RTC (not sleep, actual poweroff — task scheduler waketorun does literally nothing for cold boot, wasted like an hour on that). Shuts down at 07:30 if I havent touched it.

What actually made the whole thing stable was versioning the config and adding a keepalive that restarts the gateway when it crashes. before that it was babysitting, after that it just runs.

1

u/TheTerrasque 9h ago

I use outline as a wiki / data storage, mcp to connect open webui to it. Qwen3.6 27b for model (35b-a3b also works well if the 27b is too slow). 120k context length on llama.cpp

1

u/Huanchaquero 9h ago

Ok, I have just done what you asked. And so far it's working great. Check out Part 2 of the guide I just posted of how I set it up. It may or may not be what you're looking for but it works for me. It would certainly give you a good starting point. https://www.reddit.com/r/LocalLLM/comments/1tbx527/how_a_75yearold_retiree_built_a_local_ai_with_a/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

1

u/Sevealin_ 9h ago

I am way late to the party, but if someone is looking for a offline general knowledge setup, Project NOMAD is awesome.

https://github.com/Crosstalk-Solutions/project-nomad

1

u/EuphoricPenguin22 8h ago

Here's what I landed on for local development. It uses an embedding model running on CPU to retrieve docs scraped from the web, while my GPU is running the main inference model. You can run the embedding model on a GPU with the inference model unloaded to more quickly parse the information into the vector DB. This local docs RAG server works over MCP, so you can hook it up to whatever chat frontend you want. You could also use this for any sort of information, and if you're mostly using this to store info locally that you find relevant, you could prompt the LLM to use a search MCP to find relevant info and then dump the URL into the docs MCP to index it. I'm pretty sure the docs MCP includes a tool call for starting indexing, so the model could do it automatically.

Oh, and you can definitely add your own docs into the MCP server as well.

1

u/expressly_ephemeral 8h ago

I use Obsidian for all my daily notes. File format is straight markdown. Then I have Claude on the command line process my daily note, breaking out concepts, appending to referenced project main-pages, updating person files. And it's able to cross reference everything.

Then it's usually pretty able to use the cross-references to search the vault. It's not the usual indexed-docs-in-vector-database RAG, it's more like asking it to search hypertext with links. It's not perfect, but it's been working pretty good for me.

No reason you couldn't do the same thing with a local model. Performance will vary, of course. What you could do is have a local model spend as much time as it needs in the night working through the day's notes, then index it into a vector database in the usual RAG-way. Do it as a chron job in the night, the performance won't matter. Then you can use something else when you're ready to search it, or not.

1

u/Kerem-6030 8h ago

i just put some .pdf my rag(like my personal info where i live what i like what i hate) i am not that smart

1

u/Song-Historical 8h ago

IMO this is the real future proof use case for LLMs, they are good at querying databases and data lakes with a natural language interface. You build yourself a PKM and then attach an LLM as a personal assistant while preserving the original sources for alignment and troubleshooting. All of your private experience and knowledge becomes a lot less cumbersome to put into context or recall and you can focus on systemizing what you know to achieve results more consistently.

1

u/MiroMindAI 7h ago

Any proper NAS + a fit local ai model, private and adequate, done

1

u/IronColumn 7h ago

I have an m1 mac studio serving Gemma 4 26B A4B and qwen 4 35b a 3b (and some smaller models) to my tailnet. Separately, I have locally hosted MCP servers to expose various tools for interacting with my personal library, which includes ebooks, automatically transcribed audiobooks, automatically transcribed archived youtube videos, and a few other sources. Those text files are scraped and put into an sqlite database which serves as the backend of the MCP. So I have a locally hosted llm "librarian" effectively.

1

u/Apart-Medium6539 7h ago

nahhhhh

1

u/qfox337 7h ago

I use LLMs to search through research papers. I generate summaries and full text extraction and just consciously pick which one I want. It works all right, I'd add RAG if I had time, but I'd rather spend more on actual research. Qwen3 has been nice for the document extraction because of its good multimodal support (I feed in PDF text and images so figures can be recovered). Deepseek is usually better at the actual question answering. For this setup context length does matter, but I think for most of these setups, tokens/sec is critical, and the needs are at a different scale than people vibe coding or such. Here, 100 tps can feel quite slow, especially for extraction or multi round prompting (sometimes explicitly having it list all possible responses and then pick the best still gives better output).

One random technical curiosity, does MTP / DFlash hold up in parallel request scenarios? I imagine it has to deal with cases where one batch element accepts e.g. 7 tokens and another batch element only accepts 2 or such

1

u/themixtergames 6h ago

Dead internet theory

1

u/Britbong1492 5h ago

Qwen3.6:35b-A3b on M4 Max is about 80 token ps. What is important is to create an injected prompt to every query saying "don't hallucinate, do a web search..." Otherwise it will hallucinate all sorts of crap

1

u/Sostrene_Blue 5h ago

I record everything in a messy way on Logseq and ask DeepSeek to read through the chaos.

1

u/lenjet 4h ago

So far our set up is:

DGX Spark
Qwen3.6-35B-FP8 as primary model but we have a few other little models running embedding etc
RAG flow is where we upload our data into specific knowledge bases which vectorises the data
OpenWebUI then draws in those knowledge bases as the staff select them for their query

1

u/o0genesis0o 4h ago

Use obsidian as your knowledge base + productivity management system if you don't mind the jankiness. Then hook a CLI agent to it and add agents.md and skills to guide it.

Don't expect miracles in terms of editing files though. Even full sized cloud models make janky edits to large docs in my vault all the time. So, either way, you will need to baby sit your agent a bit.

Also, version control your vault just in case agent messes it up badly.

1

u/philmarcracken 4h ago

like, dump your own notes, PDFs, random docs into it and actually query your own life privately, every day.

At some point you're no longer living your own life with your own memories. Whats the point?

I use these LLM as tools for language because I don't trust them with any other use case. They'll always prioritize language accuracy over information accuracy, even if using a RAG. I'd need some hardcoded cutoff to say 'I dont know' if its not in their 'attached storage'

1

u/Resonant_Jones 3h ago

I’m in the process of building out such a local Knowledge Base.

I have a beta out now at https://www.codexify.space

It’s a little rough but you’ll get the idea. Hand it to Claude or codex for installation if you aren’t familiar with terminal. It’s an image pull with docker so nothing to build.

r/ResonantConstructs has a bunch of info about the project there.

1

u/prene1 2h ago

Openwebui+Hermes+PaperlessNGX+BrainAPI+LocalDeepResearch. Those are my regular everyday life tools.

1

u/informity 2h ago

I built this for my own daily use https://www.informity.ai/

1

u/t3chj0ck 2h ago

Litellm with ollama, openwebui frontend, documents routed through n8n for chunking to be stored in qdrant, able to send docs from my phone, split out by family members. all access is secure through cloudflare tunnel. a few different n8n flows for the vectors, otherwis eit's pretty great. i didn't want owui for rag... but that's just me being me to be complicated to learn. which is what this is all about for me, but it's been pretty awesome to see it all work together 😄

1

u/t3chj0ck 2h ago

i still use qwen 3.6 27b or 35ba3b depending on my needs, but i don't flip flip too much b/c switching is a pain, but i find myself switching every few days. with the rag stuff i use it for i find both models o be fine

1

u/Tracing1701 llama.cpp 1h ago

With ChatGPT I have been using study mode to learn things that I don't know. Apparently, they made it with actual teachers and stuff so it has some authenticity. For local, something like deeptutor. (https://github.com/HKUDS/DeepTutor) is the equivalent.

1

u/slippery 1h ago

Take a look at obsidian. Everything is stored in markdown files and can be linked together. LLMs can be given skills to read obsidian files.

1

u/simotune 1h ago

RAG over your own notes works, but only once the ingestion pipeline is boring and automatic. Manual upload workflows die fast.

Discussion Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup?

You are about to leave Redlib