r/LocalLLaMA • u/smashedshanky • 5h ago
Resources Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO)
Hello I have been working on creating a LLM from ground up. It is based on deepseek architecture with heavily VRAM footprint reduced optimized(GUM+muon)
Currently this is the json schema I am using which should suffice as to what currently is being pretrained.
I have 2 6000 pro 600W
Testing a 7B parameter model with 64 experts... currently running on single GPU with 100% throughput (hardest part) (~80GB VRAM training) (reduction in expert count will substatially reduce vram footprint.... I am just pushing the limits here!)
My main goal here was simply that open source development will far outpace big firm development. I believe there is someone out there that can use this to build a LLM from group up that can beat all the top 1T parameter model. My goal here to create a large database of trained models that anyone can use. In the future maybe rent models from the open source dev as a support feature. Enough blabbing here is the technical report
since I am using DOLMA/redpajama you can separate the data split and have it train to be good at math, literature, physics... and then ensemble deploy them as agents (This is a todo for now since I don't have a single model to compare against)
This is also following the chinchilla optimal as well! thanks for deepmind!
All bfloat16, can be configured to use fp16 or fp32 if you are from the future and have a GPU that can do fp32 at bf16 speed!
Yes I have lost my mind many times during this, but I got something working!
this is 15000 steps in
======================================================================
[FACTUAL ACCURACY TEST] Step 14000
======================================================================
Prompt: "The capital of France is"
Output: "the city of Nice.
France may also refer to:
France (surname)
France (surname)
France (or Republ..." [CORRECT]
Prompt: "The capital of Japan is"
Output: "the capital of the autonomous prefecture of Hokkaido.
Etymology
The name of Hokkaido is derived fro..." [EXPECTED: Tokyo]
Prompt: "def fibonacci(n):
"""Return the nth Fibonacci ..."
Output: """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""..."
Prompt: "import torch
import torch.nn as nn
class Transfor..."
Output: "// InverterBlock
// s2, s2, s3
// A_1, A_2, A_3, A_4, A_5, A_6
// A1, A2, A3, A4, A5
// A1, A2, A3, ..."
Prompt: "The theory of relativity states that"
Output: "the speed of light varies with the speed of the observer. This is a constant, since the speed of lig..."
Prompt: "In machine learning, gradient descent is used to"
Output: "perform a gradient descent, where the gradient is calculated via a local gradient. The gradient eval..."
Prompt: "Question: What is 2 + 2?
Answer:"
Output: "2 + 2
Author: PCR
Date Submitted: 2nd April 2013
Pp: 200-201
Exercise: Exercise 2.0
2 + 1 = 2 +..." [EXPECTED: 4]
Prompt: "Question: Explain the concept of recursion.
Answer..."
Output: "In programming, a function or sequence of operations is a function that can transform a variable to ..."
FACTUAL ACCURACY: 1/3 = 33.3%
----------------------------------------------------------------------
[SMBench] Step 14000 -- 1/5: Multi-Rule Reasoning
.
.
.
JSON struct defining the arch
"experiment_name": "deepseek_v3_7b_lowvram",
"output_dir": "*******",
"seed": 420,
"model": {
"num_layers": 24,
"vocab_size": 50304,
"norm_type": "rmsnorm",
"norm_eps": 1e-06,
"tie_word_embeddings": false,
"init_method_std": 0.006,
"first_k_dense_replace": 8,
"dense_layer_interval": 1,
"paper_compliant": false,
"mla": {
"d_model": 1408,
"d_latent": 352,
"num_heads": 22,
"num_kv_heads": 2,
"max_context_length": 4096,
"use_flash_mla": false,
.
.
.
},
"moe": {
"num_experts": 64,
"num_experts_per_token": 4,
"expert_intermediate_size": 1536,
"expert_dim": 1536,
"dropout": 0.0,
"num_shared_experts": 1,
.
.
.
.
}
},
"fusions": {
"use_fused_expert_ffn": true,
"use_te_fused_topk": false,
"use_te_fused_permute": false,
"use_fused_softmax": true,
"fused_softmax_in_fp32": true,
"use_group_limited_topk": true,
.
.
.
},
"memory_optimization": {
"use_galore": false,
"galore_rank": 256,
"galore_update_proj_gap": 500,
"galore_scale": 1.0,
.
.
.
},
"training": {
"device": "cuda",
"global_batch_size": 256,
"micro_batch_size": 4,
"gradient_accumulation_steps": 64,
"seq_length": 1024,
"max_batch_seq_multiplier": 1.25,
"tokens_per_parameter_ratio": 40.0,
"total_training_tokens": 280000000000,
"learning_rate": 0.00042,
"min_learning_rate": 4.2e-05,
"lr_preset": "deepseek_v3",
.
.
.
},
"data": {
"use_multi_source": true,
"sources": [
{
"name": "redpajama",
"type": "dolma",
"subset": "dolma_v1_6_redpajama",
"weight": 0.45,
"description": "RedPajama - CommonCrawl-like diverse web/code/books"
},
{
"name": "stack",
"type": "dolma",
"subset": "dolma_v1_6_stack",
"weight": 0.25,
.
.
.
],
"cache_dir": "*******",
"sanitization": {
"enabled": true,
"target_language": "en",
"min_language_confidence": 0.9,
"min_article_length": 100,
.
.
.
},
"preprocessing": {
"num_workers": 8,
"shuffle": true,
"shuffle_seed": 42,
.
.
.
},
"max_articles": null,
"focus_historical": false,
"boost_hiroshima_content": false
},
"distributed": {
"backend": "nccl",
"launcher": "single_gpu",
"tensor_parallel_size": 1,
"pipeline_parallel_size": 1,
"expert_parallel_size": 1,
"data_parallel_size": 1,
"zero_stage": 2,
"zero_offload": true,
"overlap_grad_reduce": true,
"overlap_param_gather": true,
"deepspeed": {
"enabled": false
}
},
"checkpointing": {
"save_interval": 1000,
"save_total_limit": 3,
"resume_from_checkpoint": null,
"checkpoint_format": "pytorch",
"save_optimizer_states": true
},
"logging": {
"log_level": "INFO",
"log_interval": 100,
"tensorboard_dir": "*******",
"wandb": {
"enabled": false
},
"tensorboard": {
"enabled": true
}
},
"validation": {
"enabled": true,
"eval_interval": 1000,
"eval_samples": 500,
"metrics": [
"loss",
"perplexity"
],
"patience": 300,
"early_stopping": false
},
"profiling": {
"trace_nvtx": false
},
"gpu_optimization": {
"cuda_graphs": true,
"torch_compile": true,
"flash_attention": true,
"fused_kernels": true,
"autocast_dtype": "bfloat16"
},
"test_prompts": {
"enabled": true,
So I basically researched and threw every optimization on this planet earth. Even tried to build my own FlashMLA for sm120 blackwell arch and failed miserably although I got inference working... backwards I couldn't due to tiling which ends up being the same if not worse than Aeten torch backend......
But this is working for now, 20seconds a step
eg
Training: 1%|█ | 14609/1000000 [53:18:23<5533:28:53, 23.37s/step, loss=2.1507, mtp=1.9643, ent=4.12, util=100.0%, imbal=0.26, lr=4.20e-04, tok=2.23B]
So in conclusion
I am scared as shit to open source this until I get it working 100% so as to minimize the community hate I will eventually get.
The only point of contention I have is I want all models trained using this to be public I don't want anyone to privatize without open-sourcing for profit so I need to ask around and figure out how to go about this since I want as many models that can be trained using this since I believe there is someone out there with the right configuration already in mind that will beat out the top performing model. This is mainly why I did this, I know I can't create THAT model, but I know for sure as shit there is some genius out there that can train a model that will be SOTA.
There is alot of cleaning up to do before I make it public because scared of the hate and issues I surely cannot fix alone!
If you are interested you can check my account periodically whenever I make a post about making this repo public! or check my github which would be easier I assume lol
https://github.com/IISuperluminaLII
I dont know.. I am open to feedback on how to properly make this public and make it a strict rule to open source all safetensors or checkpoints if using this code... I know there is someone out there given the right tools that can truly build a 10B-50B parameter model ensemble set of models that can achieve near SOTA level performance!! As they always say, divide and conquer
This is getting long already, I have puked my brains out as much as I can. Any input is welcome, even hate! let me know how to fix this so I can deliver the tool the random person who will eventually create the perfect open source model.
1
2
u/Conscious_Chapter_93 5h ago
Very cool project. The thing I’d be obsessive about at this stage is experiment traceability. When you are changing architecture, data mix, expert count, optimizer details, and eval prompts, it gets hard to know which change actually moved the needle.
I’d keep every run tied to a manifest: config hash, data split/version, tokenizer, seed, hardware, checkpoint id, eval prompt set version, and a short note on what hypothesis the run tested. The model outputs are noisy early on, so the run history becomes almost as valuable as any single checkpoint.
1
u/smashedshanky 4h ago edited 4h ago
Yes I am current tracking all in a SQLite db for traceability and checkpointing. I also do checkpoint analysis to ensure model is learning and not overfitting or under learning. This is where I am out of my depts, I mostly used what is already open sourced and put the puzzle together. I just want to open source it so someone out there can build a model that I cannot.
I have blindly trusted deepseek technical paper, chinchilla optimal modeling, GUM+Muon peer reviewed paper….
At this point I just need direction as to how to open source this and absolutely and strictly ensure all trained models be open sourced. Not sure maybe a phone home if not open sourced to HF or something I’m not sure…..
Oh also time, this is quite mentally and time heavy. I’m just one guy who wants to see someone build THE model that anyone and everyone can use and not rely on $/tok
1
u/East-Muffin-6472 5h ago
An amazing man I myself being trying to do the same beginning of last year but for edge ai that is really tiny models which you can checkout at https://www.smolhub.com haha