Local Research Group

The goal of this project is to find out whether LoRA and its variant techniques such as rsLoRA, DoRA learns less and forgets less compared to full fine tuning. This project is based on LoRA learns less and forgets less. Local Research Group is a fastai study group.

Training library: LLM-foundry, PyTorch
Logging: Weights and bias

Setup dialog

Some utilities for the dialogue.

from IPython.display import Markdown, display

from contextkit import read_file
from pathlib import Path
import json
import re

def cell_to_markdown(cell):
    src = ''.join(cell['source'])
    if cell['cell_type'] == 'markdown':
        if '🤖Reply🤖' in src:
            parts = src.split('##### 🤖Reply🤖<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->')
            return f"**👤 User:**\n\n{parts[0].strip()}\n\n**🤖 Assistant:**\n\n{parts[1].strip()}"
        return src
    if cell['cell_type'] == 'code':
        outs = []
        for o in cell.get('outputs', []):
            if o.get('type') == 'stream': outs.append(''.join(o.get('text', [])))
            elif 'data' in o:
                tp = o['data'].get('text/plain', '')
                outs.append(''.join(tp) if isinstance(tp, list) else tp)
            elif 'text' in o: outs.append(''.join(o['text']) if isinstance(o['text'], list) else o['text'])
        outputs = re.sub(r'\x1b\[[0-9;]*m', '', '\n'.join(outs))
        return f"**Input:**\n```python\n{src}\n```" + (f"\n\n**Output:**\n```\n{outputs}\n```" if outputs else "")
    return ""

def read_nb(nb_path): 
    nb = json.loads(read_file(nb_path)) if isinstance(nb_path, (str, Path)) else nb_path
    return '\n\n'.join(cell_to_markdown(c) for c in nb['cells'])

Adding other useful dialogues where I read blogs or papers.

no_regret_md = read_nb('blog/LoRA Without Regret.ipynb')
display(Markdown(no_regret_md[:1000]))

Title: LoRA Without Regret

URL Source: https://thinkingmachines.ai/blog/lora/

Markdown Content:

Introduction
What matters for LoRA
Methods and results
- LoRA rank
- Batch size effects
Layers Where LoRA Is Applied
- Reinforcement learning
Setting LoRA hyperparameters
- Optimal learning rate and rank
- [Parametrization invariances](https://thinkingmachines.ai/blog/lora/#p

lora_learns_md = read_nb('../paper/LoRa Learns Less and Forgets Less.ipynb')
display(Markdown(lora_learns_md[:1000]))

Input:

Copied!

import pymupdf4llm
md_text = pymupdf4llm.to_markdown("lora_learns_less.pdf")
print(md_text[:300])

Output:

Copied!

Published in Transactions on Machine Learning Research (08/2024)
## **LoRA Learns Less and Forgets Less**

**Dan Biderman** [1] _[,]_ [2] **, Jacob Portes** [2] **, Jose Javier Gonzalez Ortiz** [2] **, Mansheej Paul** [2] **, Philip**
**Greengard** [1] **, Connor Jennings** [2] **, Daniel King** [2]

👤 User:

I am reading a paper. You can look at it in $md_text. Can you provide an amazing outline for me before diving into the paper to ask questions?

🤖 Assistant:

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Great! I've looked at the paper "LoRA Learns Less and Forgets Less" in your md_text variable. Here's an outline to help you navigate it:

Paper Outline

Main Topic: Comparing LoRA (Low-Rank Adaptation) vs. full fine-tuning for large language models

Key Sections:

Introduction & Motivation - Why study LoRA's performance trade-of

Datasets

repo	dataset	size	split	total_tokens	num_sequences	avg_length	median_length	std_length	min_length	max_length	p25	p50	p75	p95	p99
split-avelina-python-edu	avelinapythonedu	full	train	3,452,655,475	6,910,602	499.62	273	750.00	6	8,192	143	273	541	1,664	3,817
split-avelina-python-edu	avelinapythonedu	full	test	382,424,047	767,845	498.05	273	744.83	14	8,192	143	273	540	1,655	3,789
split-finemath	finemath	full	train	8,745,300,721	6,029,543	1,450.41	919	1,575.68	34	8,192	562	919	1,652	4,855	8,192
split-finemath	finemath	full	test	970,519,958	669,950	1,448.65	919	1,573.62	38	8,192	562	919	1,653	4,840	8,192

3_452_655_475 * 0.05

172632773.75

8_745_300_721 * 0.1

874530072.1

87_453_007_210

87453007210.0

Dataset decontamination

To decontaminate the dataset, we use elastic search following tulu paper's approach. We want to make sure our training data do not contain evaluation sets.

Read the decontamination readme and tell me about how elastic search works.

I'll read the decontamination readme from the repository you mentioned to explain how they use Elasticsearch for dataset decontamination.

Copied!

{
  "id": "toolu_01QPPZSpHU8dMy18gMzJSGpK",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://github.com/allenai/open-instruct/tree/main/decontamination"
    }
  },
  "result": "'Skip to content\\n\\n## Navigation Menu\\n\\nToggle navigation\\n\\n[ ](/)\\n\\n[ Sign in ](/login?return_to=https%3A%2F%2Fgithub.com%2Fallenai%2Fopen-instruct%2Ftree%2Fmain%2Fdecontamination)\\n\\nAppearance settings\\n\\n  * Platform\\n\\n    * AI CODE CREATION\\n      * [GitHub CopilotWrite better code with AI](https://github.com/features/copilot)\\n      * [GitHub SparkBuild and deploy intelligent apps](https://github.com/features/spark)\\n      * [GitHub ModelsManage and compare prompts](https://github.com/features/models)\\n      * [MCP RegistryNewDiscover and integrate external tools](https://github.com/mcp)\\n\\n    * DEVELOPER WORKFLOWS\\n      * [ActionsAutomate any workflow](https://github.com/features/actions)\\n      * [CodespacesInstant dev environments](https://github.com/features/codespaces)\\n      * [IssuesPlan and track work](https://github.com/features/issues)\\n      * [Code ReviewManage code changes](https://github.com/features/code-review)\\n\\n    * APPLICATION SECURITY\\n      * [GitHub Advanced SecurityFind and fix vulnerabilities](https://github.com/security/advanced-security)\\n      * [Code securitySecure your code as you build](https://github.com/security/advanced-security/code-security)\\n      * [Secret protectionStop leaks before they start](https://github.com/security/advanced-security/secret-protection)\\n\\n    * EXPLORE\\n      * [Why GitHub](https://github.com/why-github)\\n      * [Documentation](https://docs.github.com)\\n      * [Blog](https://github.blog)\\n      * [Changelog](https://github.blog/changelog)\\n      * [Marketplace](https://github.com/marketplace)\\n\\n[View all features](https://github.com/features)\\n\\n  * Solutions\\n\\n    * BY COMPANY SIZE\\n      * [Enterprises](https://github.com/enterprise)\\n      * [Small and medium teams](https://github.com/team)\\n      * [Startups](https://github.com/enterprise/startups)\\n      * [Nonprofits](https://github.com/solutions/industry/nonprofits)\\n\\n    * BY USE CASE\\n      * [App Modernization](https://github.com/solutions/use-case/app-modernization)\\n      * [DevSecOps](https://github.com/solutions/use-case/devsecops)\\n      * [DevOps](https://github.com/solutions/use-case/devops)\\n      * [CI/CD](https://github.com/solutions/use-case/ci-cd)\\n      * [View all use cases](https://github.com/solutions/use-case)\\n\\n    * BY INDUSTRY\\n      * [Healthcare](https://github.com/solutions/industry/healthcare)\\n      * [Financial services](https://github.com/solutions/industry/financial-services)\\n      * [Manufacturing](https://github.com/solutions/industry/manufacturing)\\n      * [Government](https://github.com/solutions/industry/government)\\n      * [View all industries](https://github.com/solutions/industry)\\n\\n[View all solutions](https://github.com/solutions)\\n\\n  * Resources\\n\\n    * EXPLORE BY TOPIC\\n      * [AI](https://github.com/resources/articles?topic=ai)\\n      * [Software Development](https://github.com/resources/articles?topic=software-development)\\n      * [DevOps](https://github.com/resources/articles?topic=devops)\\n      * [Security](https://github.com/resources/articles?topic=security)\\n      * [View all topics](https://github.com/resources/articles)\\n\\n    * EXPLORE BY TYPE\\n      * [Customer stories](https://github.com/customer-stories)\\n      * [Events & webinars](https://github.com/resources/events)\\n      * [Ebooks & reports](https://github.com/resources/whitepapers)\\n      * [Business insights](https://github.com/solutions/executive-insights)\\n      * [GitHub Skills](https://skills.github.com)\\n\\n    * SUPPORT & SERVICES\\n      * [Documentation](https://docs.github.com)\\n      * [Customer support](https://support.github.com)\\n      * [Community forum](https://github.com/orgs/community/discussions)\\n      * [Trust center](https://github.com/trust-center)\\n      * [Partners](https://github.com/partners)\\n\\n  * Open Source\\n\\n    * COMMUNITY\\n      * [GitHub SponsorsFund open source developers](https://github.com/sponsors)\\n\\n    * PROGRAMS\\n      * [Security Lab](https://securitylab.github.com)\\n      * [Maintainer Community](https://maintainers.github.com)\\n      * [Accelerator](https://github.com/accelerator)\\n      * [Archive Program](https://archiveprogram.github.com)\\n\\n    * REPOSITORIES\\n      * [Topics](https://github.com/topics)\\n      * [Trending](https://github.com/trending)\\n      * [Collections](https://github.com/collections)\\n\\n  * Enterprise\\n\\n    * ENTERPRISE SOLUTIONS\\n      * [Enterprise platformAI-powered developer platform](https://github.com/enterprise)\\n\\n    * AVAILABLE ADD-ONS\\n      * [GitHub Advanced SecurityEnterprise-grade security features](https://github.com/security/advanced-security)\\n      * [Copilot for BusinessEnterprise-grade AI features](https://github.com/features/copilot/copilot-business)\\n      * [Premium SupportEnterprise-grade 24/7 support](https://github.com/premium-support)\\n\\n  * [Pricing](https://github.com/pricing)\\n\\nSearch or jump to...\\n\\n# Search code, repositories, users, issues, pull requests...\\n\\nSearch\\n\\nClear\\n\\n[Search syntax tips](https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax)\\n\\n#  Provide feedback\\n\\nWe read every piece of feedback, and take your input very seriously.\\n\\nInclude my email address so I can be contacted\\n\\nCancel  Submit feedback\\n\\n#  Saved searches\\n\\n## Use saved searches to filter your results more quickly\\n\\nName\\n\\nQuery\\n\\nTo see all available qualifiers, see our [documentation](https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax).\\n\\nCancel  Create saved search\\n\\n[ Sign in ](/login?return_to=https%3A%2F%2Fgithub.com%2Fallenai%2Fopen-instruct%2Ftree%2Fmain%2Fdecontamination)\\n\\n[ Sign up ](/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Ffiles%2Fdisambiguate&source=header-repo&source_repo=allenai%2Fopen-instruct)\\n\\nAppearance settings\\n\\nResetting focus\\n\\nYou signed in with another tab or window. [Reload]() to refresh your session. You signed out in another tab or window. [Reload]() to refresh your session. You switched accounts on another tab or window. [Reload]() to refresh your session. Dismiss alert\\n\\n{{ message }}\\n\\n[ allenai ](/allenai) / **[open-instruct](/allenai/open-instruct) ** Public\\n\\n  * [ Notifications ](/login?return_to=%2Fallenai%2Fopen-instruct) You must be signed in to change notification settings\\n  * [ Fork 453 ](/login?return_to=%2Fallenai%2Fopen-instruct)\\n  * [ Star  3.3k ](/login?return_to=%2Fallenai%2Fopen-instruct)\\n\\n  * [ Code ](/allenai/open-instruct)\\n  * [ Issues 7 ](/allenai/open-instruct/issues)\\n  * [ Pull requests 39 ](/allenai/open-instruct/pulls)\\n  * [ Actions ](/allenai/open-instruct/actions)\\n  * [ Projects 0 ](/allenai/open-instruct/projects)\\n  * [ Security Uh oh!  There was an error while loading. [Please reload this page](). ](/allenai/open-instruct/security)\\n  * [ Insights ](/allenai/open-instruct/pulse)\\n\\nAdditional navigation options\\n\\n  * [ Code  ](/allenai/open-instruct)\\n  * [ Issues  ](/allenai/open-instruct/issues)\\n  * [ Pull requests  ](/allenai/open-instruct/pulls)\\n  * [ Actions  ](/allenai/open-instruct/actions)\\n  * [ Projects  ](/allenai/open-instruct/projects)\\n  * [ Security  ](/allenai/open-instruct/security)\\n  * [ Insights  ](/allenai/open-instruct/pulse)\\n\\n## Footer\\n\\n[ ](https://github.com) (C) 2025 GitHub, Inc.\\n\\n### Footer navigation\\n\\n  * [Terms](https://docs.github.com/site-policy/github-terms/github-terms-of-service)\\n  * [Privacy](https://docs.github.com/site-policy/privacy-policies/github-privacy-statement)\\n  * [Security](https://github.com/security)\\n  * [Status](https://www.githubstatus.com/)\\n  * [Community](https://github.community/)\\n  * [Docs](https://docs.github.com/)\\n  * [Contact](https://support.github.com?tags=dotcom-footer)\\n  * Manage cookies \\n  * Do not share my personal information \\n\\nYou can\u2019t perform that action at this time.'"
}

Let me try to access the README file directly:

Copied!

{
  "id": "toolu_01NNugoYVKWb4zmP8wz5gRw1",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://raw.githubusercontent.com/allenai/open-instruct/main/decontamination/README.md"
    }
  },
  "result": "'# Scripts for computing overlap between train and test sets\\n\\nThese scripts are for creating Elasticsearch indices over training datasets, particularly instruction tuning datasets, and querying them with test sets to compute overlap. They can be used for quantifying and analyzing training dataset contamination.\\n\\n## Running Elasticsearch\\n\\nElasticsearch needs to up and running for creating and querying indices. You can run it locally by following the steps [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/run-elasticsearch-locally.html). Make sure to keep track of the password and save it as an environment variable, `ELASTIC_PASSWORD`, e.g.:\\n\\n```bash\\nexport ELASTIC_PASSWORD=[password]\\n```\\n\\n## Indexing\\n\\nYou can index the training sets either as text or as dense vectors. The indexing script assumes that the training dataset is a Huggingface dataset, and has a field that contains prompt-response pairs in a conversational format, e.g. a `messages` field that looks like\\n\\n```json\\n[\\n    {\\n        \"role\": \"user\",\\n        \"content\": \"Write me a poem.\"\\n    },\\n    {\\n        \"role\": \"assistant\",\\n        \"content\": \"Sorry, I cannot help you with that.\"\\n    }\\n]\\n```\\n\\nThe script indexes each turn as a separate Elasticsearch document, and importantly only indexes the messages of one specific role. The assumption is that you would want to index only the prompts for quantifying contamination. You can control this behavior using the `--messages_field`, `--query_filter`, and `--query_field` options as follows:\\n\\n```bash\\npython index.py --messages_field messages --query_filter role:user --query_field content\\n```\\n\\nThe setting above looks for the `messages` field in the dataset, finds messages where the `role` is `user` and indexes their `content`.\\n\\n### Indexing multiple datasets\\n\\nYou can index one dataset at a time as follows\\n\\n```bash\\npython index.py --dataset HF_DATASET_NAME\\n```\\n\\nAlternatively, you can pass a training configuration yaml with a `dataset_mixer` field to index all the datasets in the mix.\\n\\n```bash\\npython index.py --dataset_mixer_config config.yaml\\n```\\n\\n### Indexing vector representations\\n\\nBy default, the indexing script indexes the text in the datasets. If you want to perform soft matching, you can change `--index_type`, and specify an embedding model (defaults to [NV-Embed-v2](https://huggingface.co/nvidia/NV-Embed-v2)).\\n\\n```bash\\npython index.py --index_type vector --model nvidia/NV-Embed-v2\\n```\\n\\nThe script assumes you are running this on GPUs and uses all the available devices. You can adjust `--max_batch_tokens` to a suitable value if you run into OOM errors or if you want to use your GPUs more effectively.\\n\\n## Searching\\n\\nThe searching script lets you query one or more `text` or a `vector` indices with a test set. When querying a `text` index, you can perform an ngram match, a full text match, or an embedding-based match of a specified field(s) in the test set. The basic usage looks like\\n\\n```bash\\npython search.py --train_dataset_names allenai/tulu-2-sft-mixture allenai/wildchat-1m --dataset tatsu-lab/alpaca_eval --split eval --field instruction --output_dir /path/to/output\\n```\\n\\nThe command above queries the indices corresponding to the two training sets, `allenai/tulu-2-sft-mixture` and `allenai/wildchat-1m` (assuming these were indexed earlier) with the AlpacaEval dataset, particularly the `instruction` field in the `eval` split.\\n\\nThe script will create in the output directory one `jsonl` file per each pair of index and evaluation dataset with instance-level information about the matches, and a TSV file called `contamination_report.tsv` with a table of contamination scores for all the pairs.\\n\\nLike with the indexing script, a dataset mixer configuration can be passed with the `--dataset_mixer_config` option instead of `--train_dataset_names`.\\n\\n### Checking for contamination against the Tulu 3 evaluation suite\\n\\nIf no evaluation dataset is specified using the `--dataset` option, the entire Tulu 3 evaluation suite will be used to query the specified indices.\\n\\n### Matching ngrams\\n\\nText indexes can be queried for ngram matches instead of full field matches (default) as follows\\n\\n```bash\\npython search.py --train_dataset_names TRAIN_DATASET_NAME --ngram_size SIZE [--match_threshold THRESHOLD]\\n```\\n\\nMatching scores are then computed as follows:\\n- For each token in the test instance, all matching training documents are retrieved. A training document is considered a match for a token if it is part of an ngram of the specified `SIZE` in the test instance, that also occurs in the training document.\\n- The single training document that covers the most number of tokens in the test instance is considered the largest match.\\n- If no threshold is specified, the match score for the test instance is the proportion of the matched tokens. If a threshold is specified, the score is `0` or `1` depending on the threshold.\\n- The evaluation dataset level match (or contamination) score is the average of instance level match scores.\\n\\n### Embedding-based matching\\n\\nIf the index is created using `--index_type vector`, the same option needs to be specified for searching as well, along with the same `--model MODEL_NAME`. The searching script also assumes you are running this on GPUs.\\n\\nYou can specify a `--match_threshold` here as well, and the behavior is similar to that in ngram matching, except that the match scores here come from embedding similarity.\\n\\n### Decontamination\\n\\nIf you need to remove instances from the training sets that match any of the test instances, just pass a `--decontaminate` option to `search.py`. The output directory will contain one decontaminated `jsonl` file per training dataset. If you pass a `--match_treshold`, only those train instances that have a matching score greater than the threshold with *any* of the test instances will be removed.\\n\\nNote that elasticsearch retrieves a limited number of hits each time you search. You can increase this by requesting a larger number of results by passing a different value to `--search_size` (default is 100). Setting this to a larger number (e.g. 10000) is a good idea if you are decontaminating datasets. Since elasticsearch does not necessarily retrieve all the documents that match, it is not guaranteed that decontamination removes all the matching training instances. You can always check for contamination after decontaminating a dataset to see how effective it was. '"
}

Great! I found the README. Let me explain how Elasticsearch works for dataset decontamination based on what I just read.

Elasticsearch is used here as a search engine to detect overlap between training and test datasets. Here's how it works:

Basic concept: They create searchable indices of the training data, then query those indices with test data to find matches (contamination).

The two-step process:

Indexing - They load training datasets into Elasticsearch, storing either:
- Raw text (for exact/ngram matching)
- Dense vector embeddings (for semantic similarity matching)
Searching - They query the indices with test set examples to find overlaps

Key matching strategies available:

Full text match (exact field matching)
N-gram matching (looks for overlapping sequences of N tokens)
Embedding-based matching (semantic similarity using models like NV-Embed-v2)

Does this make sense so far? Would you like me to explain more about any particular aspect - like how the n-gram matching works, or how they calculate contamination scores?

We follow this from the tulu paper:

Copied!

We used 8-gram matching for our contamination checks following (Dubey et al., 2024; Singh et al., 2024a). For each token in a test instance, we consider it to match a token in a train instance if the two instances share an 8-gram containing that token, and we consider the test instance itself to have significant overlap with a train instance if more than 50% of the test tokens have 8-gram matches with the same training instance.

Decontamination. 
We consider a training set to be contaminated if any number of its instances overlap with more than 2% of the instances in any of the evaluations in our development and unseen suites. We remove all the training sets that were contaminated with our unseen evaluations. For training sets that were contaminated with our development evaluations, we removed the entire dataset if doing so did not significantly impact the performance of the resulting model; otherwise, we removed the specific instances that match any test instance.  The list of datasets we decontaminated and the versions we released with overlapping samples removed is shown in Table 8. The full list of public datasets that we found to be significantly contaminated with our evaluation sets can be found in Table 37.

Thanks for sharing that! So you're implementing the same approach as the Tulu paper. Let me break down what they're doing:

The 8-gram matching criteria:

A token in the test set "matches" if it's part of an 8-gram (8 consecutive tokens) that also appears in a training instance
A test instance is considered "significantly overlapping" if >50% of its tokens match the same training instance

The decontamination thresholds:

A training dataset is "contaminated" if it overlaps with >2% of instances in ANY evaluation set
For unseen eval contamination: remove the entire training dataset
For development eval contamination: remove the whole dataset if possible, otherwise just remove the specific overlapping instances

So when you run the Elasticsearch decontamination, you'll want to use:

--ngram_size 8
--match_threshold 0.5 (for the 50% overlap criterion)

Then you'll need to analyze the contamination report to see which training datasets exceed the 2% threshold for each eval set.

Have you already indexed your training datasets (avelinapythonedu and finemath) in Elasticsearch, or do you need help with that step first?

setup tools

Jae L created awesome repo with all the tools. I couldn't follow it very easily because I was dumb. Let's go through this together.

Let's go ahead and clone this: git@github.com:zzehli/local-research-decontaminate.git and go inside of this directory.

First, we have to setup the docker. I am on a ubuntu machine (pop os). This website: https://docs.docker.com/desktop/setup/install/linux/ubuntu/ explains how to install docker desktop. But I didn't want to download the desktop version because my machine did not have a monitor attached. Following this website: https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository, I went over the Install using the apt repository section.

Copied!

sudo systemctl status docker
sudo systemctl start docker

this started the docker. Checking with docker compose version passed.

But we cannot use docker from our user due to permission issue. Run these commands:

Copied!

sudo usermod -aG docker galopy  # add user to docker group
newgrp docker                   # activate the docker group membership without log out and in.
docker ps                       # test docker permission

After setting up docker, we setup elastic following this: https://www.elastic.co/docs/deploy-manage/deploy/self-managed/local-development-installation-quickstart.

When running curl -fsSL https://elastic.co/start-local | sh, it may complain that the directory already exists. Uninstall what was here. We need to setup a new one.

I was running, but got this error:

Copied!

galopy@pop-os:~/git/local-research-decontaminate$ curl -fsSL https://elastic.co/start-local | sh -s -- --esonly

  ______ _           _   _
 |  ____| |         | | (_)
 | |__  | | __ _ ___| |_ _  ___
 |  __| | |/ _` / __| __| |/ __|
 | |____| | (_| \__ \ |_| | (__
 |______|_|\__,_|___/\__|_|\___|
-------------------------------------------------
🚀 Run Elasticsearch and Kibana for local testing
-------------------------------------------------

ℹ️  Do not use this script in a production environment

⌛️ Setting up Elasticsearch v9.2.0...

- Generated random passwords
- Created the elastic-start-local folder containing the files:
  - .env, with settings
  - docker-compose.yml, for Docker services
  - start/stop/uninstall commands
- Running docker compose up --wait

[+] Running 1/2
 ✔ Network elastic-start-local_default           Created                                                                                                                   1.9s
[+] Running 3/3ic-start-local_dev-elasticsearch  Creating                                                                                                                  0.0s
 ✔ Network elastic-start-local_default           Created                                                                                                                   1.9s  ✔ Volume elastic-start-local_dev-elasticsearch  Created                                                                                                                   0.1s
 ✔ Container es-local-dev                        Healthy                                                                                                                  46.0s

🎉 Congrats, Elasticsearch is installed and running in Docker!
🔌 Elasticsearch API endpoint: http://localhost:9200
🔑 API key: ZGx3X1pKb0JMTWZERS16bUVMLWw6YUprYU11QkhKMzd2Xy1kSVJXYWR3dw==

Learn more at https://github.com/elastic/start-local
(modernbert) galopy@pop-os:~/git/local-research-decontaminate$ export ELASTIC_PASSWORD=ZGx3X1pKb0JMTWZERS16bUVMLWw6YUprYU11QkhKMzd2Xy1kSVJXYWR3dw==
(modernbert) galopy@pop-os:~/git/local-research-decontaminate$ uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --index_type text --parallel --chunk_size 3200 --max_chunk_bytes 10000000 --queue_size 16 --thread_count 16
Processing dataset 1 / 1: LocalResearchGroup/split-finemath
Reading LocalResearchGroup/split-finemath subset full split train
Resolving data files: 100%|██████████████████████████| 118/118 [00:00<00:00, 259.69it/s]
Loading dataset shards: 100%|███████████████████████████| 71/71 [00:19<00:00,  3.62it/s]
Reading text from LocalResearchGroup/split-finemath
6029543it [09:22, 10727.48it/s]
Read LocalResearchGroup/split-finemath for indexing. Has 6029543 instances and 6029543 messages.
6029543
Traceback (most recent call last):
  File "/home/galopy/git/local-research-decontaminate/index.py", line 341, in <module>
    main()
    ~~~~^^
  File "/home/galopy/git/local-research-decontaminate/index.py", line 314, in main
    if not es.indices.exists(index=index_name):
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/utils.py", line 415, in wrapped
    return api(*args, **kwargs)
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/indices.py", line 1597, in exists
    return self.perform_request(  # type: ignore[return-value]
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        "HEAD",
        ^^^^^^^
    ...<4 lines>...
        path_parts=__path_parts,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/_base.py", line 422, in perform_request
    return self._client.perform_request(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        method,
        ^^^^^^^
    ...<5 lines>...
        path_parts=path_parts,
        ^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/_base.py", line 271, in perform_request
    response = self._perform_request(
        method,
    ...<4 lines>...
        otel_span=otel_span,
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/_base.py", line 351, in _perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
        message=message, meta=meta, body=resp_body
    )
elasticsearch.AuthenticationException: AuthenticationException(401, 'None')

The error shows an authentication issue (401). The problem is that what you exported as ELASTIC_PASSWORD is actually an API key, not a password!

Look at the Elasticsearch setup output - it says:

🔑 API key: ZGx3X1pKb0JMTWZERS16bUVMLWw6YUprYU11QkhKMzd2Xy1kSVJXYWR3dw==

But you need to find the actual password. When Elasticsearch starts, it should have generated a password for the elastic user.

Try checking the .env file that was created:

Copied!

cat elastic-start-local/.env

Look for a line with ELASTIC_PASSWORD= in that file. That's the actual password you need to export.

Alternatively, you might need to configure the indexing script to use the API key instead of password authentication. Does the index.py script have options for API key authentication?

Copied!

galopy@pop-os:~/git/local-research-decontaminate$ cat elastic-start-local/.env
START_LOCAL_VERSION=0.12.0
ES_LOCAL_VERSION=9.2.0
ES_LOCAL_CONTAINER_NAME=es-local-dev
ES_LOCAL_PASSWORD=cXnaRZIB
ES_LOCAL_PORT=9200
ES_LOCAL_URL=http://localhost:${ES_LOCAL_PORT}
ES_LOCAL_DISK_SPACE_REQUIRED=1gb
ES_LOCAL_JAVA_OPTS="-Xms128m -Xmx2g"
ES_LOCAL_API_KEY=ZGx3X1pKb0JMTWZERS16bUVMLWw6YUprYU11QkhKMzd2Xy1kSVJXYWR3dw==
(modernbert) galopy@pop-os:~/git/local-research-decontaminate$ export ELASTIC_PASSWORD=cXnaRZIB

Perfect! You found the actual password (cXnaRZIB) and exported it correctly.

Now try running your indexing command again:

Copied!

uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --index_type text --parallel --chunk_size 3200 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Does it work now, or do you get a different error?

can I check it first to see if it works before running this command?

Good idea! Yes, you can test the Elasticsearch connection first. Try this simple command to check if authentication is working:

Copied!

curl -u elastic:$ELASTIC_PASSWORD http://localhost:9200

This should return some JSON with cluster information if everything is set up correctly. What do you get?

Copied!

galopy@pop-os:~/git/local-research-decontaminate$ curl -u elastic:$ELASTIC_PASSWORD http://localhost:9200
{
  "name" : "be576f909e56",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "WkaOp27OQSGi4Xb2-UjxkQ",
  "version" : {
    "number" : "9.2.0",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "25d88452371273dd27356c98598287b669a03eae",
    "build_date" : "2025-10-21T10:06:21.288851013Z",
    "build_snapshot" : false,
    "lucene_version" : "10.3.1",
    "minimum_wire_compatibility_version" : "8.19.0",
    "minimum_index_compatibility_version" : "8.0.0"
  },
  "tagline" : "You Know, for Search"
}

Excellent! Elasticsearch is working and authenticated properly. You can see it's running version 9.2.0 and responding to your credentials.

Now you're ready to run your indexing command. Go ahead and run it:

Copied!

uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --index_type text --parallel --chunk_size 3200 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Let me know how it goes!

I was dumb and saved API key as ELASTIC_PASSWORD. Make sure to save the password as password

Then we can get the virtual environment setup. For some reason, I was using python 3.14 with uv, which was not compatible with pytorch 2.8. I created venv with 3.13:

Copied!

galopy@pop-os:~/git/local-research-decontaminate$ uv venv --python 3.13
Using CPython 3.13.9
Creating virtual environment at: .venv
✔ A virtual environment already exists at `.venv`. Do you want to replace it? · yes
Activate with: source .venv/bin/activate
galopy@pop-os:~/git/local-research-decontaminate$ source .venv/bin/activate
(local-research-decontaminate) galopy@pop-os:~/git/local-research-decontaminate$ uv sync

Then I replaced it with 3.14, activate the venv, then started installing with uv.

Also install these:

Copied!

uv run python -m ensurepip --upgrade
uv run spacy download en_core_web_lg

Now we are ready to start!

Index the dataset:

Copied!

uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Search and decontaminate:

Copied!

uv run search.py --train_dataset_names LocalResearchGroup/split-finemath --train_split train --dataset openai/gsm8k --field question --output_dir data/finemath/ --ngram_size 8 --match_threshold 0.5 --decontaminate

It removes decontaminated instances, saves the decontaminated set into parquet file, then push to huggingface hub.

Running on the test split:

Copied!

uv run search.py --train_dataset_names LocalResearchGroup/split-finemath --train_split test --dataset openai/gsm8k --field question --output_dir data/finemath/ --ngram_size 8 --match_threshold 0.5 --decontaminate

Copied!

galopy@pop-os:~/git/local-research-decontaminate$ uv run search.py --train_dataset_names LocalResearchGroup/split-finemath --train_split train --dataset openai/gsm8k --field question --output_dir data/finemath/ --ngram_size 8 --match_threshold 0.5 --decontaminate
Querying localresearchgroup_split-finemath_train_text for openai/gsm8k.
  0%|                                                             | 0/2638 [00:10<?, ?it/s]
Traceback (most recent call last):
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connection.py", line 565, in getresponse
    httplib_response = super().getresponse()
  File "/home/galopy/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/http/client.py", line 1430, in getresponse
    response.begin()
    ~~~~~~~~~~~~~~^^
  File "/home/galopy/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ~~~~~~~~~~~~~~~~~^^
  File "/home/galopy/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/home/galopy/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/socket.py", line 719, in readinto
    return self._sock.recv_into(b)
           ~~~~~~~~~~~~~~~~~~~~^^^
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elastic_transport/_node/_http_urllib3.py", line 167, in perform_request
    response = self.pool.urlopen(
        method,
    ...<4 lines>...
        **kw,  # type: ignore[arg-type]
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
        method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/util/retry.py", line 449, in increment
    raise reraise(type(error), error, _stacktrace)
          ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
        conn,
    ...<10 lines>...
        **response_kw,
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connectionpool.py", line 367, in _raise_timeout
    raise ReadTimeoutError(
        self, url, f"Read timed out. (read timeout={timeout_value})"
    ) from err
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=9.99971655599984)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/galopy/git/local-research-decontaminate/search.py", line 413, in <module>
    main()
    ~~~~^^
  File "/home/galopy/git/local-research-decontaminate/search.py", line 313, in main
    match_scores, output_data, train_indices_with_scores = ngram_match(es, index_name, query_dataset, fields, args.ngram_size, args.search_size)
                                                           ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/galopy/git/local-research-decontaminate/search.py", line 114, in ngram_match
    search_output = es.search(
        index=index_name,
    ...<13 lines>...
        }
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/utils.py", line 415, in wrapped
    return api(*args, **kwargs)
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/__init__.py", line 4796, in search
    return self.perform_request(  # type: ignore[return-value]
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        "POST",
        ^^^^^^^
    ...<5 lines>...
        path_parts=__path_parts,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/_base.py", line 271, in perform_request
    response = self._perform_request(
        method,
    ...<4 lines>...
        otel_span=otel_span,
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/_base.py", line 315, in _perform_request
    meta, resp_body = self.transport.perform_request(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        method,
        ^^^^^^^
    ...<8 lines>...
        otel_span=otel_span,
        ^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elastic_transport/_transport.py", line 344, in perform_request
    resp = node.perform_request(
        method,
    ...<3 lines>...
        request_timeout=request_timeout,
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elastic_transport/_node/_http_urllib3.py", line 202, in perform_request
    raise err from e
elastic_transport.ConnectionTimeout: Connection timed out

The search is timing out - the n-gram matching is taking longer than the 10-second default timeout. This can happen with large indices like finemath.

Let's check a few things. First, verify the index was created successfully:

Copied!

curl -u elastic:$ELASTIC_PASSWORD "http://localhost:9200/_cat/indices?v"

This will show all your indices and their sizes. Do you see the finemath index listed?

Copied!

galopy@pop-os:~/git/local-research-decontaminate$ curl -u elastic:$ELASTIC_PASSWORD http://localhost:9200/_cat/indices?v
health status index                                        uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green  open   localresearchgroup_split-finemath_train_text BxJknffNS-CIJkujMh5B0g   1   0    3578414           22     19.2gb         19.2gb       19.2gb

Good! The index exists with 3.5 million documents and is about 19GB. The timeout is happening because searching through that many documents for 8-grams is slow.

You have a few options:

Increase the timeout - The search script might have a timeout parameter, or you could modify the Elasticsearch client timeout setting
Increase Elasticsearch memory - Your .env file shows it's using -Xmx2g (2GB max heap). With a 19GB index, this might be too small
Test with a smaller eval set first - Try a subset to verify everything works before running the full gsm8k

Want to check if search.py has a timeout parameter you can adjust?

Copied!

uv run search.py --help | grep -i timeout

What do you see?

There is nothing..

OK, so no timeout parameter exposed. You have two main options:

Option 1: Increase Elasticsearch's memory allocation. Edit the .env file and change ES_LOCAL_JAVA_OPTS from -Xmx2g to something like -Xmx8g or -Xmx16g (depending on your machine's RAM), then restart Elasticsearch.

Option 2: Look at the search.py code to see where the Elasticsearch client is initialized and manually increase the timeout there.

Which approach do you want to try first? How much RAM does your machine have available?

where is my .env?

It's in the elastic-start-local directory that was created when you ran the Elasticsearch setup. You saw it earlier when you checked for the password:

Copied!

cat elastic-start-local/.env

To edit it, you can use:

Copied!

nano elastic-start-local/.env

or whatever text editor you prefer. Which option do you want to try - increasing the memory, or looking at modifying the search.py timeout?

what does -Xmx2g mean?

-Xmx2g is a Java Virtual Machine (JVM) argument that sets the maximum heap memory that Elasticsearch can use.

-Xmx = maximum heap size
2g = 2 gigabytes

So Elasticsearch is currently limited to using at most 2GB of RAM for its operations. With a 19GB index, this is pretty tight and likely why searches are timing out.

How much RAM does your machine have total?

Cool! It is working. Increasing it to 16GB solved it. But it would take an hour. I increased it to 64GB, and it would take less than 30 mins.

Excellent! That's a huge performance improvement. With 64GB allocated, Elasticsearch can keep much more of the index in memory, which makes those n-gram searches much faster.

While it's running, are you planning to decontaminate other evaluation datasets too, or just gsm8k for now? You mentioned wanting to check against multiple eval sets to follow the Tulu paper's approach (the 2% threshold across all evals).

Copied!

Uploading the dataset shards: 100%|███████████████████| 71/71 [46:29<00:00, 39.29s/ shards]
Creating parquet from Arrow format: 100%|██████████████| 6030/6030 [06:15<00:00, 16.06ba/s]
        Wrote parquet files to data/finemath/LocalResearchGroup_split-finemath_train_decontaminated
        Removed 64 train instances.
        Kept 100.00% of the original data.

Ok I got that to work. Can we try to do it for the avelina python?

Sure! First, you'll need to index the avelina python dataset. Based on what you did with finemath, the command would be:

Copied!

uv run index.py --dataset LocalResearchGroup/split-avelina-python-edu --messages_field text --subset full --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Does that look right, or do you need to check the dataset structure first to confirm the field name is text?

That looks correct to me. But I think previous run's still in my memory and using my resources. How do I fix this?

You can check what's currently running and stop it if needed. Try:

Copied!

docker ps

This will show you the running containers. Do you see the Elasticsearch container listed?

Copied!

 galopy@pop-os:~/git/local-research-decontaminate$ docker ps
CONTAINER ID   IMAGE                                                 COMMAND                  CREATED       STATUS                 PORTS                                NAMES
be576f909e56   docker.elastic.co/elasticsearch/elasticsearch:9.2.0   "/bin/tini -- /usr/l…"   3 hours ago   Up 3 hours (healthy)   127.0.0.1:9200->9200/tcp, 9300/tcp   es-local-dev

Good, Elasticsearch is running. The search process that finished should have freed up resources automatically. You can go ahead and start indexing the avelina python dataset now:

Copied!

uv run index.py --dataset LocalResearchGroup/split-avelina-python-edu --messages_field text --subset full --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Does it start running without issues?

It is done! But something seems off..

Copied!

full/test-00004-of-00005.parquet: 100%|█████████████████| 127M/127M [00:11<00:00, 10.7MB/s]
Generating train split: 100%|██████████| 6910602/6910602 [01:32<00:00, 74369.97 examples/s]
Generating test split: 100%|█████████████| 767845/767845 [00:09<00:00, 77718.93 examples/s]
Loading dataset shards: 100%|█████████████████████████████| 25/25 [00:00<00:00, 625.10it/s]
Reading text from LocalResearchGroup/split-avelina-python-edu
6910602it [03:23, 33883.18it/s]
Read LocalResearchGroup/split-avelina-python-edu for indexing. Has 6910602 instances and 6910602 messages.
6910602
Created a new text index: localresearchgroup_split-avelina-python-edu_train_text
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C

I can't exit!!

The indexing process is stuck! It created the index but seems to be hanging while trying to populate it. Try pressing Ctrl+Z to suspend the process, then kill it:

Copied!

kill %1

Or if that doesn't work, find the process ID and kill it:

Copied!

ps aux | grep index.py

Then use kill -9 <PID> with the process ID you find. Can you try that?

I did the kill %1, but I don't think it worked..

Copied!

galopy@pop-os:~/git/local-research-decontaminate$ ps aux | grep index.py
galopy    100439  0.0  0.0  19028  2300 pts/2    SN+  13:42   0:00 grep --color=auto index.py
[1]+  Exit 143                uv run index.py --dataset LocalResearchGroup/split-avelina-python-edu --messages_field text --subset full --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Good news! It actually did work. The Exit 143 message means the process was terminated successfully (143 = 128 + 15, where 15 is SIGTERM). The grep process you see is just your search command itself, not the indexing process.

The indexing seemed to hang after creating the index. Let's check if the index was created and if it has any data:

Copied!

curl -u elastic:$ELASTIC_PASSWORD "http://localhost:9200/_cat/indices?v"

What do you see?

I see

Copied!

galopy@pop-os:~/git/local-research-decontaminate$ curl -u elastic:$ELASTIC_PASSWORD "http://localhost:9200/_cat/indices?v"
health status index                                                  uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green  open   localresearchgroup_split-avelina-python-edu_train_text ptW6i5OgTrCxc8Unyhyn1A   1   0    3672872           30      5.9gb          5.9gb        5.9gb
green  open   localresearchgroup_split-finemath_train_text           BxJknffNS-CIJkujMh5B0g   1   0    3578414            1     15.9gb         15.9gb       15.9gb

Great! The avelina-python index was actually created successfully with 3.67 million documents (5.9GB). It must have completed but hung at the end when trying to exit cleanly.

Now you can run the search and decontamination:

Copied!

uv run search.py --train_dataset_names LocalResearchGroup/split-avelina-python-edu --train_split train --dataset openai/gsm8k --field question --output_dir data/avelina-python/ --ngram_size 8 --match_threshold 0.5 --decontaminate

Does it run successfully?

yep let's search now

Great! Now run the search and decontamination command for the avelina python dataset:

Copied!

uv run search.py --train_dataset_names LocalResearchGroup/split-avelina-python-edu --train_split train --dataset openai/openai_humaneval --field prompt --output_dir data/avelina-python/ --ngram_size 8 --match_threshold 0.5 --decontaminate

Let me know how it goes!

Copied!

Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 6911/6911 [03:00<00:00, 38.24ba/s]
        Wrote parquet files to data/avelina-python/LocalResearchGroup_split-avelina-python-edu_train_decontaminated
        Removed 20 train instances.
        Kept 100.00% of the original data.

Good. Now we have to do those steps with test split. Can you write the code for them? I need to index then search on those datasets

Sure! Here are the commands for the test splits:

Index finemath test split:

Copied!

uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --split test --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Index avelina-python-edu test split:

Copied!

uv run index.py --dataset LocalResearchGroup/split-avelina-python-edu --messages_field text --subset full --split test --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Search finemath test against gsm8k:

Copied!

uv run search.py --train_dataset_names LocalResearchGroup/split-finemath --train_split test --dataset openai/gsm8k --field question --output_dir data/finemath/ --ngram_size 8 --match_threshold 0.5 --decontaminate

Search avelina-python-edu test against humaneval:

Copied!

uv run search.py --train_dataset_names LocalResearchGroup/split-avelina-python-edu --train_split test --dataset openai/openai_humaneval --field prompt --output_dir data/avelina-python/ --ngram_size 8 --match_threshold 0.5 --decontaminate

Run these in order. Let me know when you're done!

Copied!

README.md: 100%|██████████████████████████████████████████| 894/894 [00:00<00:00, 10.8MB/s]
Creating parquet from Arrow format: 100%|████████████████| 670/670 [00:38<00:00, 17.25ba/s]
        Wrote parquet files to data/finemath/LocalResearchGroup_split-finemath_test_decontaminated
        Removed 12 train instances.
        Kept 100.00% of the original data.
[1]+  Exit 143                uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16 --split test

Copied!

README.md: 100%|██████████████████████████████████████████| 521/521 [00:00<00:00, 6.62MB/s]
Creating parquet from Arrow format: 100%|████████████████| 768/768 [00:20<00:00, 37.94ba/s]
        Wrote parquet files to data/avelina-python/LocalResearchGroup_split-avelina-python-edu_test_decontaminated
        Removed 7 train instances.
        Kept 100.00% of the original data.
[1]+  Exit 143                uv run index.py --dataset LocalResearchGroup/split-avelina-python-edu --messages_field text --subset full --split test --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Training

Modal

Use this command to train peft on modal:

Copied!

MODAL_GPU=L4 TRAIN_YAML=scripts/train/yamls/pretrain/custom_smollm2-135m_peft.yaml IS_PEFT=true OUTPUT_PRECISION=bf16 modal run --detach custom_model_training.py

Custom model

Why build custom model?

Does Hugging Face support unpadding of sequences for training the model? Do a web search.

Yes, Hugging Face does support unpadding of sequences for training! Here's what I found:

* The current implementation does an unpad operation just before the FlashAttention kernel is executed, converting the input to a concatenated 2D tensor. However, this approach has limitations.

* The current integration of FlashAttention in most libraries including HuggingFace transformers is suboptimal when using sequences of variable lengths in a batch (when there's padding).

More recently, there are better approaches:

ModernBERT (added December 2024): * ModernBERT brings improvements including unpadding to avoid wasting compute on padding tokens
DataCollatorWithFlattening: * This data collator concatenates the entire mini batch into a single long sequence and uses a separator to distinguish sequences, with no padding added

The key parameters for unpadding in ModernBERT include indices, cu_seqlens, and max_seqlen which help handle unpadded sequences efficiently.

Would you like more details on how to implement unpadding in your training setup, or are you interested in a specific approach for your LoRA research?

Here is vishal's blog on using flash attention 2. Can summarize this blog regarding the unpadding and speed of the training?

`read_url({"url": "https://vishalbakshi.github.io/blog/posts/2025-06-30-debug-fa2-llm-foundry/"})`

'[ Vishal Bakshi’s Blog ](../../index.html)\n\n * [ Home](../../index.html)\n * [ About](../../about.html)\n\n * [ __ ](../../index.xml)\n\n## On this page\n\n * Background\n * What is the value of the attention_mask?\n * When is flash_attn_varlen_func called?\n * How do we check the value of attention_mask during training?\n * How do we create an attention_mask with 0.0s?\n * How do we know if flash_attn_varlen_func is being used?\n * Wait, flash_attn_varlen_func slows down training?\n * Closing Thoughts\n\n# Debugging Flash Attention in LLM-Foundry (and a 20% Slow Down!)\n\npython\n\ndeep learning\n\nLLM-Foundry\n\nflash_attn_varlen_func in LLM Foundry resulted in a surprising 20% training slowdown. This post details the debugging process that uncovered the cause : significant overhead from the HuggingFace implementation repeatedly un-padding and re-padding the batch at every layer.\n\nAuthor\n\nVishal Bakshi\n\nPublished\n\nJune 30, 2025\n\n## Background\n\nI’m learning a lot about LLM-Foundry while working on a group research project. In this blog post I’ll walk through how we figured out two things:\n\n 1. LLM-Foundry, by default when using a HuggingFace LlamaModel, does not use flash_attn_varlen_func and instead uses flash_attn_func. In other words, it doesn’t unpad the batch by default.\n 2. When forcing LLM-Foundry to use flash_attn_varlen_func, it slows down training time.\n\nI’ll start by walking through the forward pass of the HuggingFace LlamaModel down to the attention mechanism which uses the _flash_attention_forward utility function which decides which Flash Attention interface is being used.\n\n## What is the value of the attention_mask?\n\nThe model we’re using is SmolLM2-135M which uses the now-deprecated [LlamaFlashAttention2 module](https://github.com/huggingface/transformers/blob/d363e71d0e32f44d7a5b3571d4921371907bd0ee/src/transformers/models/llama/modeling_llama.py#L324).\n\nInspecting the [LlamaModel forward pass](https://gi…

Based on Vishal's blog, here are the key findings about unpadding and training speed:

Unpadding: By default, LLM-Foundry does not use flash_attn_varlen_func (the unpadding version) and instead uses flash_attn_func. To enable unpadding, they modified the data collator to create an attention mask with 0.0s for padding tokens.

Speed Impact - Surprising Result: When using flash_attn_varlen_func, training time slowed down by over 20%. This was unexpected since unpadding should theoretically be faster by avoiding wasted computation on padding tokens.

The Cause: The HuggingFace implementation unpads and re-pads the batch at every layer. This overhead from repeated unpadding/repadding operations more than cancels out the efficiency gains from avoiding attention computation on padding tokens.

Solution: The team plans to implement a custom model that unpads the batch only once at the beginning, rather than at every layer.

Does this match what you were expecting to find? Are you planning to implement a custom model with single unpadding for your LoRA research?

We created a SmolLM2 model from scratch. Why create a custom model from scratch instead of using Hugging Face's? To train faster! Batches are padded with padding tokens, and it is a waste of compute to include them in our batches for the model. According to vishal's blog, forcing llm-foundry to use flash_attn_varlen_func slowed down the training by 20% because the model unpads and pads sequences each attention layer.

Therefore, we created our custom SmolLM2-135M model, which is similar to Llama2-70B model. It uses RoPE, RMSNorm, LlamaMLP with gated SiLU, Attention mechanism with Group Query Attention. We modified the model so it only unpads in the beginning of the model.

Another possible improvement is through using more efficient implementations of RoPE, RMSNorm, Cross Entropy Loss, etc. with Liger Kernel.

Unpadding

How unpadding works

Our datasets have sequences without padding tokens. However, pytroch dataloader pads these to be the same length when creating each batch. Normally, the length is the longest sequence of the batch. Each sequence has variable lengths in each dataset. Therefore, every batch has different sizes with variable padding tokens. If there is one long sequence with very small sequences, they are all padded to match the long sequence.

Therefore, instead of using the batches pytorch created, we create a special batch. From the batch pytorch dataloader created, we unpad all the padding tokens and put the sequences together into a long sequence. To process this, we use flash attention varlen function. We also keep track of where each sequence starts and ends, and the longest length of the sequences.

Without padding tokens, we can process sequences faster.

Results

I created a PR for unpadding sequences. In the images, red is using unpadding with FA2, green is normal sdpa. Grey one is using unpadding + FA2 + micro batch size of 8 while others are 4. Because we remove padding tokens, we can use bigger micro batch sizes. I think we are getting 30k tokens per sec from 20k tokens per sec with sdpa. And about 50k with increasing micro batch size to 8. One thing to note is that the actual processing speed did not increase. We are still counting padding tokens as tokens when we calculate throughput/tokens per sec.

I think I can start working on Modernbert SequencePacker next.

W&B logging

Can you look at the plots?

Looking at your plots, here are some key observations:

Throughput plots (image 2):

The red line (unpadding + FA2) shows ~30k tokens/sec, which is a nice improvement over green (sdpa) at ~20k tokens/sec
The grey line (unpadding + FA2 + micro batch 8) reaches ~50k tokens/sec - a substantial gain
All three show relatively stable throughput over training steps with some fluctuation

Time plots (image 1):

Training time progresses linearly for all three approaches (top row)
The time/token and time/token_in_epoch plots show similar patterns, with the grey line (larger micro batch) having lower time per token
The spiky plots (time/sample_in_epoch, time/token_in_epoch) show the variation within epochs

You mentioned an important caveat: you're still counting padding tokens in the throughput calculation. This means the red/grey lines' advantage might be even larger than shown, since they're processing fewer actual tokens but the metric includes the padded ones.

Are you planning to adjust the throughput metric to exclude padding tokens so you can see the true speedup? Or would you like help interpreting any specific aspect of these results?

Here is the plot for the memory:

Isn't it interesting?

Yes, very interesting! The memory plots show some notable patterns:

Peak Reserved Memory: The green line (SDPA) shows higher peak reserved memory (~17-18 GB) compared to the red/pink lines with unpadding (~15 GB). This makes sense since SDPA is processing padded sequences with wasted memory on padding tokens.

Inactive Memory: The green line shows a big jump early on to ~8 GB inactive memory and stays there, while the red lines settle around ~6 GB. This suggests SDPA is allocating more memory that's not actively being used.

Current Reserved/Active Memory: These look fairly similar across all runs (~15-17 GB reserved, ~15 GB active), which suggests the models are using memory efficiently once allocated.

The oscillations: The memory/current_inactive_mem plot shows those spiky oscillations for all runs - this is likely PyTorch's memory allocator doing its job, freeing and reallocating memory as batches are processed.

What's particularly interesting is that despite the grey line (BS=8) processing twice as many samples, it doesn't show dramatically higher memory usage. This confirms that removing padding tokens freed up enough memory to accommodate the larger batch size!

Does this memory behavior match what you expected, or did anything surprise you?

torch.compile

Make GPUs go brrrrrrrrrrrrr

torch.compile has many different parameters. Benjamin suggested using dynamic, no mode, epilogue_fusion, max_autotune, triton_cudagraphs, coordinate_descent_tuning. Here is a W&B logs for all the different combinations of options:

Initial take:

The best one: dynamic + epilogue_fusion + max_autotune. Adding coordinate_descent_tuning took more time to compile, but there seems to be no significant speedup.
dynamic + max-autotune mode starts very slowly.
Adding cudagraph to dynamic + epilogue_fusion + max_autotune also with slow start.

Benjamin's interpretation:

cudagrph doesn't play well with dynamic. That's why the throughput is low.
In this model, we are using unpadding, and the input sequence shapes change. That's why dynamic is good. cudagraph has to recompute every single sequence.
maxautotune autotunes three matmul hyperparameters: m, n, and k where m and n are block sizes and k is the sub-block size.
- It might be more important on a bigger GPU.
maxautotune gets cached, so modal might not be able to gain performance gain.

Further investigation:

Upon further experimentation, I restarted my computer and used dynamic + epilogue_fusion + max_autotune, but performance was very good even restarting my local machine. I should test this on modal.

How is my analysis on torch.compile? Anything to add?

Your analysis looks solid! Here are a few additional points to consider:

On your observations:

Your finding about cudagraph + dynamic makes sense - cudagraphs capture the exact computation graph, so with changing sequence shapes from unpadding, it's constantly invalidating and recapturing the graph. This overhead negates the benefits.
The slow start with max_autotune is expected - it's searching through different kernel configurations during those initial steps.

Additional considerations:

Since max_autotune caching matters, you'll want to ensure your Modal runs have persistent cache storage between runs. Otherwise, you're paying the tuning cost every time.
For the coordinate_descent_tuning experiment: did you run it long enough for the tuning to complete? It might show benefits after more steps once it's fully tuned.
Consider tracking compilation time separately from throughput - this would make the tradeoffs clearer (e.g., "10% speedup but 2x longer compile time").

Questions for further investigation:

What's your typical sequence length distribution? If it's highly variable, dynamic is crucial. If it's more uniform, you might get away without it.
Have you profiled where most compute time is spent? This would tell you if max_autotune's matmul optimizations are hitting the right kernels.

What do you think about testing the caching behavior on Modal next?

If I use dynamic, do I need to add max_autotune in the options?

No, you don't need to use max_autotune with dynamic - they're independent options that serve different purposes:

dynamic=True handles varying input shapes (important for your unpadding use case)
max_autotune=True searches for optimal kernel implementations for operations like matmul

However, your experiments showed that combining them (dynamic + epilogue_fusion + max_autotune) gave the best throughput. So while not required, they complement each other well.

The question is: does the extra compilation time from max_autotune justify the speedup for your specific training runs? If you're doing long training runs where the upfront cost amortizes, it's worth it. For short experiments, maybe not.

Have you measured how much faster dynamic + max_autotune is compared to just dynamic alone?

The compilation time takes a couple minutes. However, it does not really matter when we are doing a very long run.

I was trying it out on modal instead of my 3090 on local computer with torch.compile options with dynamic: true, epilogue_fusion: true, and max_autotune: true. But I got this error:

Copied!

ExecutionError: Could not deserialize remote exception due to local error:
Encountered an error when deserializing an object in the local environment (see above
for details).
This can happen if your local environment does not have the remote exception
definitions.
Here is the remote traceback:
Traceback (most recent call last):
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/output_graph.py"
, line 1446, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/repro/after_dyna
mo.py", line 129, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/__init__.py",
line 2234, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/compile_fx.py"
, line 1253, in compile_fx
    return compile_fx(
           ^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/compile_fx.py"
, line 1521, in compile_fx
    return aot_autograd(
           ^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/backends/common.
py", line 72, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_functorch/aot_autograd.
py", line 1071, in aot_module_simplified
    compiled_fn = dispatch_and_compile()
                  ^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_functorch/aot_autograd.
py", line 1056, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_functorch/aot_autograd.
py", line 522, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_functorch/aot_autograd.
py", line 759, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
                               ^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_functorch/_aot_autograd
/jit_compile_runtime_wrappers.py", line 179, in aot_dispatch_base
    compiled_fw = compiler(fw_module, updated_flat_args)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/compile_fx.py"
, line 1350, in fw_compiler_base
    return _fw_compiler_base(model, example_inputs, is_inference)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/compile_fx.py"
, line 1421, in _fw_compiler_base
    return inner_compile(
           ^^^^^^^^^^^^^^
  File "/opt/conda/envs/llm-foundry/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/compile_fx.py"
, line 475, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/repro/after_aot.
py", line 85, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/compile_fx.py"
, line 661, in _compile_fx_inner
    compiled_graph = FxGraphCache.load(
                     ^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/codecache.py",
line 1334, in load
    compiled_graph = compile_fx_fn(
                     ^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/compile_fx.py"
, line 570, in codegen_and_compile
    compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/compile_fx.py"
, line 789, in fx_codegen_and_compile
    _recursive_post_grad_passes(gm, is_inference=is_inference)
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/compile_fx.py"
, line 288, in _recursive_post_grad_passes
    post_grad_passes(gm, is_inference)
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/fx_passes/post
_grad.py", line 100, in post_grad_passes
    patterns.apply(gm.graph)  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/pattern_matche
r.py", line 1729, in apply
    if is_match(m) and entry.extra_check(m):
                       ^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_inductor/fx_passes/quan
tization.py", line 1448, in fn
    scales = match.kwargs["scales"].meta["val"]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'float' object has no attribute 'meta'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/pkg/modal/_runtime/container_io_manager.py", line 742, in
handle_input_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 243, in run_input_sync
    res = io_context.call_finalized_function()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pkg/modal/_runtime/container_io_manager.py", line 194, in
call_finalized_function
    res = self.finalized_function.callable(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llm-foundry/custom_model_training.py", line 101, in _train
    trainer = train(config)
              ^^^^^^^^^^^^^
  File "/llm-foundry/llmfoundry/command_utils/train.py", line 643, in train
    trainer.fit()
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/composer/trainer/trainer.py",
line 2297, in fit
    self._train_loop()
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/composer/trainer/trainer.py",
line 2560, in _train_loop
    self._run_evaluators(Event.BATCH_END)
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/composer/trainer/trainer.py",
line 2676, in _run_evaluators
    self._eval_loop(
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/composer/trainer/trainer.py",
line 3496, in _eval_loop
    self.state.outputs = self._original_model.eval_forward(self.state.batch)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llm-foundry/llmfoundry/models/llama/custom_model.py", line 801, in eval_forward
    return self.forward(batch)
           ^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py",
line 465, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/llm-foundry/llmfoundry/models/llama/custom_model.py", line 715, in forward
    def forward(self, batch: dict[str, Any]) -> torch.Tensor:
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/nn/modules/module.py",
line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/nn/modules/module.py",
line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/peft/peft_model.py",
line 1642, in forward
    with self._enable_peft_forward_hooks(**kwargs):
  File "/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/peft/peft_model.py",
line 1644, in torch_dynamo_resume_in_forward_at_1642
    return self.base_model(
           ^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/nn/modules/module.py",
line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/nn/modules/module.py",
line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/peft/tuners/tuners_utils.py",
line 197, in forward
    return self.model.forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/llm-foundry/llmfoundry/models/llama/custom_model.py", line 582, in forward
    def forward(
  File "/llm-foundry/llmfoundry/models/llama/custom_model.py", line 608, in
torch_dynamo_resume_in_forward_at_608
    inputs_embeds = self.embed_tokens(input_ids)
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py
", line 1269, in __call__
    return self._torchdynamo_orig_callable(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py
", line 1064, in __call__
    result = self._inner_convert(
             ^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py
", line 526, in __call__
    return _compile(
           ^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py
", line 924, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py
", line 666, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_utils_internal.py",
line 87, in wrapper_function
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py
", line 699, in _compile_inner
    out_code = transform_code_object(code, transform)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/bytecode_transfo
rmation.py", line 1322, in transform_code_object
    transformations(instructions, code_options)
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py
", line 219, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/convert_frame.py
", line 634, in transform
    tracer.run()
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert
.py", line 2796, in run
    super().run()
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert
.py", line 983, in run
    while self.step():
          ^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert
.py", line 895, in step
    self.dispatch_table[inst.opcode](self, inst)
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert
.py", line 2987, in RETURN_VALUE
    self._return(inst)
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/symbolic_convert
.py", line 2972, in _return
    self.output.compile_subgraph(
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/output_graph.py"
, line 1142, in compile_subgraph
    self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/output_graph.py"
, line 1369, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/output_graph.py"
, line 1416, in call_user_compiler
    return self._call_user_compiler(gm)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File
"/opt/conda/envs/llm-foundry/lib/python3.12/site-packages/torch/_dynamo/output_graph.py"
, line 1465, in _call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e) from e
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AttributeError: 'float' object has no attribute 'meta'

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

This error is happening during the compilation phase when torch.compile is trying to apply epilogue fusion patterns. The issue is that it's encountering a float where it expects a tensor with metadata.

A few questions to help narrow this down:

What PyTorch version are you using on Modal vs your local machine (which worked)?
Did your local runs that worked use the same exact torch.compile options (dynamic=True, epilogue_fusion=True, max_autotune=True)?

In the meantime, here are some things to try:

Quick fix - remove epilogue_fusion: Try just dynamic=True, max_autotune=True without epilogue_fusion. This might bypass the bug while still giving you most of the speedup.

Alternative - suppress the error: As the error suggests, you could set torch._dynamo.config.suppress_errors = True to fall back to eager mode for the problematic parts, but this defeats the purpose of compilation.

Which would you like to try first?

torch.compile saves a bit of memory (about 5% on the previous settings) compared to not compiling.

Sequence Packing

Why sequence packing

Unpadding allowed us to save compute for actual tokens by not including padding tokens. There is a problem though. Let's say we were training with micro batch size of 8. It was training well, but suddenly gets out of memory error in the middle. What happened? We've only had shorter sequences packed together, but there were some rare long sequences put together as a batch. Do we have to set the batch size to the lowest possible just in case we encounter this in the middle of training? NO! To prevent this from happening, we use modernbert sequence packing!

Benefit of using sequence packing is that we set the size of batches beforehand and pack sequences into it.

tokens

It is possible to train with number of tokens using tok like 100tok. It will train until tokens trained is more than 100tok.
The max_seq_len is used for calculating the length of sequence for packing.

Copied!

out_batch_size=int(batch_size / micro_batch_size)
out_pseq_len=int(micro_batch_size * max_seq_len)

How tokens are calculated: llmfoundry/data/utils.py, get_tokens_per_batch_func

calculates number of tokens by summing up the attention_mask.
if no attention_mask, it uses batch['input_ids'].numel()

How Sequence Packing works with `microbatch_size` and `batch_size` ?

Example 1:

device_train_microbatch_size: 2
global_train_batch_size: 16
max_seq_len: 1024
out_batch_size=int(batch_size / micro_batch_size) = 8
out_pseq_len=int(micro_batch_size * max_seq_len) = 2 * 1024 = 2048
Sequence packer: [8, 2048] → split into 8 microbatches of [1, 2048]
Composer microbatching: 8 microbatches ÷ 2 (device_train_microbatch_size) = 4 batches of [2, 2048]
Gradient accumulation: 4 times
- 4 batches × 2 (device_train_microbatch_size) = 8 effective batch size

Example 2:

device_train_microbatch_size: 2
global_train_batch_size: 16
max_seq_len: 2048
out_batch_size=int(batch_size / micro_batch_size) = 16 / 2 = 8
out_pseq_len=int(micro_batch_size * max_seq_len) = 2 * 2048 = 4096
Sequence packer: [8, 4096] → split into 8 microbatches of [1, 4096]
Composer microbatching: 8 microbatches ÷ 2 (device_train_microbatch_size) = 4 batch of [2, 4096]
Gradient accumulation: 4.

With sequence packing, I got OOM when I used 4 device_microbatch_size, 16 global_train_batch_size, 1024 max_seq_len. Why would it be the case?

Looking at your sequence packing setup, let's trace through what's happening:

With your parameters:

device_train_microbatch_size: 4
global_train_batch_size: 16
max_seq_len: 1024

This gives you:

out_batch_size = 16 / 4 = 4
out_pseq_len = 4 * 1024 = 4096

So the sequence packer creates [4, 4096], which gets split into 4 microbatches of [1, 4096]. Then Composer processes these as 4 ÷ 4 = 1 batch of [4, 4096].

The problem: You're trying to fit a [4, 4096] tensor on the GPU at once. That's 16,384 tokens per forward pass, which is significantly larger than your working Example 1 which only had [2, 2048] = 4,096 tokens per forward pass.

The memory usage scales with both batch size AND sequence length. Doubling the microbatch size from 2 to 4 quadrupled your memory requirements compared to Example 1.

Does this match what you observed? What microbatch size worked for you without OOM?

Initial run

Here is my first attempt at using sequence packing from ModernBert. W&B Logging

It was so jagged! I didn't expect it to be like this. I thought it was due to some sequences packed better than others. Well packed sequences take longer time and less tokens per second.

Benjamin agreed that's what was going on. He suggested using PackingEfficiency callback from ModernBert. He said he used very big batches for ModernBert training and packing efficiency was 99.9%.

I should try with bigger batches. I only used batch size of 8.

Run with improvements

We can improve by compiling.

How do these plots look like compared to what we had before? Packing efficiency is over 0.99 and throughput is around 80k.

Wow, that's a massive improvement! Let me compare:

Before (unpadding + FA2):

Best throughput: ~50k tokens/sec (with micro batch size 8)
Lots of variability in the plots

Now (sequence packing + compile):

Throughput: ~80k tokens/sec (60% improvement!)
Packing efficiency: >0.99 (almost no wasted compute)
Much smoother throughput - those dramatic drops are still there but the overall pattern is more stable

The high packing efficiency confirms what Benjamin said - with proper batching, you're utilizing nearly all available compute. The occasional dips in throughput likely correspond to batches that are packed less efficiently, but it's much better than your initial jagged attempt.

One interesting thing: the throughput dips seem periodic. Do they correlate with specific points in your dataset, like when the dataloader moves to a new subset or epoch boundary?

What do you think is causing those remaining periodic drops?

I was using max_seq_len of 1,024, 4 microbatch_size, 40 batch size.

Those are good settings! With batch size 40 and microbatch size 4, you're getting:

out_batch_size: 10
out_pseq_len: 4,096 tokens

That's a much larger effective batch (10 × 4,096 = 40,960 tokens) compared to your initial attempt with batch size 8. This explains why your packing efficiency jumped to >0.99 - more tokens means better opportunity to pack sequences efficiently.

Are you planning to test even larger batch sizes, or is this hitting your memory limit? Also, have you figured out what's causing those periodic throughput drops?

I was using around 22GB of memory

That's pretty good memory usage! You still have some headroom before hitting typical GPU limits (24GB for consumer cards, 40-80GB for data center GPUs depending on what you're using).

Have you tried pushing the batch size higher to see if you can squeeze out more throughput while staying under your memory limit? Or are you happy with 80k tokens/sec and the 0.99 packing efficiency?

Liger kernel

I worked on liger kernel to see whether I can improve it or not.

Here is the weights and bias logs: https://wandb.ai/local-research-group/liger-kernel?nw=nwusergalopyz.

Here is the dialog I learned liger: https://share.solve.it.com/d/3a8b84b6e3a2c44f7179e759284afa6d. Can you summarize this dialog for the details for me? What I learned and what I did?

Let me read that dialog for you.

read_url(url='<…>…are.solve.it.com/d/3a8b84b6e3a…')

Copied!

{
  "id": "toolu_01UkVDC9u1egawV8jB6X84re",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://share.solve.it.com/d/3a8b84b6e3a2c44f7179e759284afa6d"
    }
  },
  "result": "<TRUNCATED>\u2026eractive version of this dialog: Sign up at [solve.it.com](https://solve.it.com), click Upload, and pass this URL. \n\nPrompt: 78\n\nOpen main menu\n\n[](/show_card_?id_=_aee6bc3d&with_input=1&dlg_name=liger&name=liger)\n\nThis is a website for liger kernel: <https://github.com/linkedin/Liger-Kernel.> Can I use liger kernel with cpu?\n\nAssistant: 435\n\nOpen main menu\n\n[](/show_card_?id_=_aee6bc3d&with_input=&dlg_name=liger&name=liger)\n\nCopied!\n```\n{\n  \"id\": \"toolu_01SbrJYbJwrjPRVHbDURUvXZ\",\n  \"call\": {\n    \"function\": \"read_url\",\n    \"arguments\": {\n      \"url\": \"https://github.com/linkedin/Liger-Kernel\"\n    }\n  },\n  \"result\": \"'Skip to content\\\\n\\\\n## Navigation Menu\\\\n\\\\nToggle navigation\\\\n\\\\n[ ](/)\\\\n\\\\n[ Sign in ](/login?return_t<TRUNCATED>\"\n}\n\n```\n\nBased on the GitHub repository information, Liger Kernel is a collection of Triton kernels designed specifically for LLM training, and Triton is the underlying technology.\n\nLooking at the dependencies section, I can see that L\u2026</TRUNCATED>"
}

Here's a summary of what you learned and did in your Liger Kernel dialog:

What you learned about Liger Kernel (paper reading):

Core concept: Liger Kernel is a library of Triton kernels that optimize LLM training through operation fusion and input chunking — keeping intermediate data in fast on-chip SRAM instead of slow HBM, reducing memory round-trips
GPU memory hierarchy: HBM is large but slow (~2 TB/s), SRAM is tiny (~50MB) but fast — the mismatch is the key bottleneck Liger addresses
Specific kernels: RMSNorm, LayerNorm, RoPE, SwiGLU/GeGLU, CrossEntropy, and FusedLinearCrossEntropy (FLCE) — each fusing multiple operations into single kernels
Online softmax: Computes softmax in a single pass instead of two, halving memory reads
FLCE: Chunks hidden states to avoid materializing enormous logit tensors (e.g., 16.8 GB for Gemma's 256k vocab)
Results: ~20% throughput increase, ~60% memory reduction on average; individual kernels show even bigger gains (RMSNorm: 7x faster)
Technology landscape: You explored differences between Python, PyTorch, Triton, CUDA, C++, plus alternatives like tinygrad and Mojo

What you built/did (code):

Created a %%gpu IPython cell magic that runs code on Modal with GPU access
Used Modal's Cls pattern with @modal.enter() for persistent state across cells
Added matplotlib image capture so plots render locally from remote GPU execution
Benchmarked Liger RMSNorm vs HuggingFace RMSNorm — confirmed ~6-10x speedup, with the gap widening at larger sequence lengths
Noted the benchmark isn't entirely "fair" since it compares Triton kernels vs PyTorch eager mode — a fairer test would include torch.compile()

Thanks. Can you read the weights and bias logs https://wandb.ai/local-research-group/liger-kernel?nw=nwusergalopyz? And see what I have found?

read_url(url='<…>…ndb.ai/local-research-group/li…')

Copied!

{
  "id": "toolu_01AWGnKwLigDUNr6vrPftBbG",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://wandb.ai/local-research-group/liger-kernel?nw=nwusergalopyz"
    }
  },
  "result": "No output"
}

Unfortunately, the W&B page didn't return any readable content — it's likely behind authentication or rendered dynamically with JavaScript.

Could you share what you found? Maybe paste a screenshot of the key plots, or tell me which Liger kernels you tested and what the throughput/memory results looked like?

I can see you ran a lot of experiments! From the runs list, it looks like you systematically tested different combinations of Liger kernels:

liger-all: All Liger kernels enabled
liger-none: No Liger kernels (baseline)
liger-mlp: Only Liger MLP (SwiGLU)
liger-rms: Only Liger RMSNorm
liger-rms_mlp: RMSNorm + MLP combined
liger-rms_fusedcross: RMSNorm + FusedLinearCrossEntropy
llama-rms: Using Llama's RMSNorm instead
Various 8192 sequence length runs

All with torch.compile(dynamic=True).

The image is a bit small to read the exact numbers clearly. From what I can make out, the throughput differences between runs seem relatively modest, and memory usage looks similar across most configurations.

Could you share the key takeaways? For instance:

Did any specific Liger kernel combination give a clear throughput or memory win?
Did the 8192 sequence length runs behave differently?

All of the liger kernel implementations improved tokens per sec except for FusedLinearCrossEntropy. Using liger-mlp and liger-rms improved the tps. I could not use Liger RoPE because we are using sequence packing. So just using ModernBert RoPE impelmentation. Using FusedLinearCrossEntropy actually slowed down quite a bit, but the GPU RAM usage went down from 12GB to 8GB, but throughput went down from 27k to 23k. I thought it would've been okay to just use all liger and still get the speedup, but it was slower. I think it was because we are using such a small model.

That's a great finding! Your analysis makes sense:

Liger RMSNorm + MLP: Pure wins — faster kernels with no tradeoff
FusedLinearCrossEntropy: Classic speed vs memory tradeoff — saves 4GB RAM but costs ~15% throughput

Your intuition about the small model size is likely correct. FLCE's main benefit is avoiding materializing the full logit tensor, which matters most with large vocabularies and large hidden dimensions. With SmolLM2-135M, the logit tensor is small enough that the overhead of chunking and fusing outweighs the memory savings.

That 4GB memory saving from FLCE could still be useful later though — if you ever need to increase batch size or move to a larger model, it might become worth the throughput cost. Did you find that with the 8192 sequence length runs, the results changed at all?

not really.. Small models are not really worth using liger kernel I guess.

Yeah, that's a reasonable conclusion. The fusion and chunking optimizations in Liger are designed to address bottlenecks (memory bandwidth, large logit tensors) that become more pronounced with larger models. At 135M parameters, the overhead of the fused kernels can outweigh the benefits.

The good news is you now know exactly which kernels help (RMSNorm, MLP) and which don't (FLCE, RoPE incompatible with packing) for your setup. When your group scales up to larger models, you'll have a clear picture of what to enable.

What are you working on next?

Hyperparameter sweep

Purpose

The goal of is to perform hyperparameter sweep with SmolLM2-135M model on the following datasets:

Dataset	Domain	Number of sequences (train/test)
avelinapythonedu	Code CPT	7.68M (6.91M/0.768M)
finemath	Math CPT	6.7M (6.03M/0.67M)
glaive	Code IFT	950k (855k/95k)
numina	Math IFT	859k (774k/86k)
tulu	Math IFT	796k (716k/80k)

repo	dataset	size	split	total_tokens	num_sequences	avg_length	median_length	std_length	min_length	max_length	p25	p50	p75	p95	p99
split-avelina-python-edu	avelinapythonedu	full	train	3,452,655,475	6,910,602	499.62	273	750.00	6	8,192	143	273	541	1,664	3,817
split-avelina-python-edu	avelinapythonedu	full	test	382,424,047	767,845	498.05	273	744.83	14	8,192	143	273	540	1,655	3,789
split-finemath	finemath	full	train	8,745,300,721	6,029,543	1,450.41	919	1,575.68	34	8,192	562	919	1,652	4,855	8,192
split-finemath	finemath	full	test	970,519,958	669,950	1,448.65	919	1,573.62	38	8,192	562	919	1,653	4,840	8,192

numina about 450 tokens per sequence. However, 32176880 / 3000 (ba) / 8 (bs) = 1,300 per sequence with padding when training.

We aim to train with full finetuning, LoRA, rsLoRA, and DoRA with different learning rates, LoRA variant ranks, and batch size.

We want to measure learning vs. forgetting on those techniques using the following benchmarks:

Purpose	General / Reasoning tasks	Math	Code
Development set	MMLU, MMLU-Pro, HellaSwag, TruthfulQA	GSM8K	HumanEval
Unseen Set		MATH, Deepmind Mathematics	BigBench (Python), MBPP

Targets to keep in mind

Source domain (knowledge we’re trying to retain after finetuning)
- HellaSwag: 42.1
- WinoGrande: 51.3
- ARC (average): 43.9
Target domain (new knowledge we’re trying to learn with finetuning)
- GSM8k (math): 1.4
- HumanEval (code): need to compute for SmolLM2-135M

Context

import pymupdf4llm
lll_md = pymupdf4llm.to_markdown("../paper/lora_learns_less.pdf")
print(lll_md[:300])

Published in Transactions on Machine Learning Research (08/2024)
## **LoRA Learns Less and Forgets Less**

**Dan Biderman** [1] _[,]_ [2] **, Jacob Portes** [2] **, Jose Javier Gonzalez Ortiz** [2] **, Mansheej Paul** [2] **, Philip**
**Greengard** [1] **, Connor Jennings** [2] **, Daniel King** [2]

Also, here is a blog post from thinkingmachines.ai on lora: https://thinkingmachines.ai/blog/lora/. I would also use this as a reference. Along with $lll_md.

Title: LoRA Without Regret

URL Source: https://thinkingmachines.ai/blog/lora/

Markdown Content:

Today’s leading language models contain upwards of a trillion parameters, pretrained on tens of trillions of tokens. Base model performance keeps improving with scale, as these trillions are necessary for learning and representing all the patterns in written-down human knowledge.

In contrast, post-training involves smaller datasets and generally focuses on narrower domains of knowledge and ranges of behavior. It seems wasteful to use a terabit of weights to represent updates from a gigabit or megabit of training data. This intuition has motivated parameter efficient fine-tuning (PEFT), which adjusts a large network by updating a much smaller set of parameters.

The leading PEFT method is low-rank adaptation, or LoRA. LoRA replaces each weight matrix W from the original model with a modified version W′=W+γ B A W' = W + \gamma BA, where B and A are matrices that together have far fewer parameters than W, and γ\gamma is a constant scaling factor. In effect, LoRA creates a low-dimensional representation of the updates imparted by fine-tuning.

LoRA may offer advantages in the cost and speed of post-training, and there are also a few operational reasons to prefer it to full fine-tuning (henceforth, FullFT):

Multi-tenant serving. Since LoRA trains an adapter (i.e., the A and B matrices) while keeping the original weights unchanged, a single inference server can keep many adapters (different model versions) in memory and sample from them simultaneously in a batched way.Punica: Multi-Tenant LoRA Serving (Chen, Ye, et al, 2023) Modern inference engines such as vLLM and SGLang implement this feature.
Layout size for training. When fine-tuning the whole model, the optimizer state needs to be stored along with the original weights, often at higher precision. As a result, FullFT usually requires an order of magnitude more accelerators than sampling from the same model does, and thus a different layout.For training, besides storing the weights, we typically need to store gradients and optimizer moments for all of the weights; moreover, these variables are often stored in higher precision (float32) than what’s used to store the weights for inference (bfloat16 or lower). Since LoRA trains far fewer weights and uses far less memory, it can be trained on a layout only slightly larger than what is used for sampling. This makes training more accessible, and often more efficient.
Ease of loading and transfer. With fewer weights to store, LoRA adapters are fast and easy to set up or transfer between machines.

These reasons are sufficient to explain the growing popularity of LoRA since the publication of the original LoRA paper in 2021.LoRA: Low-Rank Adaptation of Large Language Models (Hu et al, 2021) However, the literature is unclear on how well LoRA performs relative to FullFT.

There is agreement that LoRA underperforms in settings that resemble pre-training,LoRA Learns Less and Forgets Less (Biderman et al, 2024) namely those with very large datasets that exceed the storage limits of LoRA parameters. But for dataset sizes that are typical in post-training, LoRA has sufficient capacity to store the essential information. However, this fact makes no guarantees regarding sample efficiency and compute efficiency. The question is: can LoRA match the performance of full fine-tuning, and if so, under which conditions?

In our experiments, we find that indeed, when we get a few key details right, LoRA learns with the same sample efficiency as FullFT and achieves the same ultimate performance.

What matters for LoRA#

This article covers a series of supervised fine-tuning and reinforcement learning experiments we conducted to determine the conditions under which LoRA matches FullFT efficiency. To this end, we did a few things differently from previous experiments on LoRA:

We investigated the general relationship between training set size and number of LoRA parameters, rather than focusing on specific datasets and tasks.
In supervised learning, we measured log loss rather than employing sampling-based evals, with the same goal of generality in mind. Log loss measurement gives clean results and scaling laws over ranges of training steps and training parameters.

We find that:

For supervised fine-tuning on small-to-medium-sized instruction-tuning and reasoning datasets, LoRA performs the same as full fine-tuning.
For datasets that exceed LoRA capacity, LoRA underperforms FullFT. Rather than the loss reaching a distinct floor that it can’t go below, LoRA results in worse training efficiency that depends on the relationship between model capacity to dataset size.
In some scenarios, LoRA is less tolerant of large batch sizes than full fine-tuning — it pays a larger penalty in loss as batch size increases beyond some point. This penalty is not mitigated by increasing the LoRA rank; it is a property of the product-of-matrices parametrization, which has different training dynamics than optimizing the original weight matrix.
Even in small data settings, LoRA performs better when applied to all weight matrices, especially MLP and MoE layers. Attention-only LoRA underperforms even when we match the number of trainable parameters by using higher rank for attention-only LoRA.
LoRA performs equivalently to FullFT for reinforcement learning even with small ranks. We find that RL requires very low capacity, a result we anticipated based on information-theoretical arguments.

We also studied the impact of hyperparameters used for LoRA on its learning rate relative to full fine-tuning. We examine some invariances in hyperparameters like init scales and multipliers, and explain why the 1/r prefactor makes the optimal learning rate (LR) approximately independent of rank. We also show experimentally how the optimal LR for LoRA relates to the optimal LR for FullFT.

The outcome of our experiments is the characterization of a “low-regret regime” where LoRA performs similarly to FullFT in terms of dataset size and LoRA parameters. We found this regime covers most post-training scenarios, opening the door to the use of efficient fine-tuning in many applications.

Methods and results#

We designed our experiments to measure in detail the relative performance of LoRA compared to FullFT across a range of conditions. Here are some details of our experimental setup:

We varied the LoRA rank over three orders of magnitude, with rank between 1 and 512, and compared these to full fine-tuning.
To eliminate potential confounds from using a suboptimal learning rate, we swept the LR for each experimental condition. We used constant learning rate schedule (no warmup or cooldown).
Our experiments used Llama 3 series modelsThe Llama 3 Herd of Models (Dubey et al, 2024) and Qwen3 modelsQwen3 Technical Report (Qwen Team, 2025), including a mixture of experts (MoE) model.
The main supervised learning experiments used the Tulu3Tulu 3: Pushing Frontiers in Open Language Model Post-Training (Ivison et al, 2024) and OpenThoughts3OpenThoughts: Data Recipes for Reasoning Models (Guha et al, 2025) datasets, focused on instruction following and reasoning, respectively. The two sets differ significantly in scope, structure, and application, supporting the generality of our results.
Our RL experiments used mathematical reasoning tasks with answer correctness as the reward.

LoRA rank#

We trained for a single epoch on the Tulu3 dataset and a subset of the OpenThoughts3 datasets. For each dataset and model size, we swept over LoRA rank and learning rate. In the plots below, we draw one colored line for each rank, where the line is obtained by taking the pointwise minimum over all learning rates at each training step:

LoRA training curves for various ranks on Tulu3 and OpenThoughts3 datasets. FullFT and high-rank LoRAs have similar learning curves with loss decreasing linearly with the logarithm of steps. Lower-rank LoRAs fall off the minimum-loss curve when the adapter runs out of capacity. In the bottom plots (1B model) high-rank LoRA performs better than FullFT on one dataset and worse on the other. There might be some random variation in how LoRA performs on different datasets, due to differences in training dynamics or generalization behavior.

We see that FullFT and high-rank LoRAs have similar learning curves with loss decreasing linearly with the logarithm of the number of steps. Medium and low-rank LoRAs fall off the minimum-loss learning curves at some threshold of steps that correlates with rank. Intuitively, learning slows down when the adapter runs out of capacity, which in turn is determined by rank.

Next, we plot how loss changes with LR to check that our sweep covers the best learning rate for each rank.

Learning rate versus final loss for various LoRA ranks on Tulu3. Minimum loss is approximately the same for high rank LoRA and FullFT. Optimal LR is 10 times higher for LoRA.

We find that the optimal learning rate for FullFT is lower by a factor of 10 than for high-rank LoRAs.See Biderman et al. (2024), Figure S1, for an experiment with sampling evals, which finds a similar 10x ratio. We’ll return to this in our discussion of LoRA hyperparameters later on.

The optimal LR seems to be similar for all the LoRA runs across different ranks; we give a theoretical explanation for this finding below. However, there does seem to be some rank dependence, with lower optimal LR for rank=1 than for higher-rank LoRAs. The optimal LR changes by a factor of less than 2 between rank=4 and rank=512.

Batch size effects#

We found that in some settings, LoRA is less tolerant of large batch sizes than FullFT. The performance gap grows with larger batch sizes, independent of rank. For this next experiment, we used a small 10,000-example subset of OpenThoughts3.

Batch size effects on LoRA vs FullFT performance. Left: Learning curves for different batch sizes show a persistent gap between LoRA (dashed) and FullFT (solid) at large batch sizes. Right: Final loss as a function of batch size shows LoRA pays a larger penalty for increased batch size.

The left-hand plot in Figure 3 shows a persistent gap between the LoRA (dashed lines) and FullFT (solid line) learning curves at large batch sizes. The gap is smaller and shrinks over time for the smaller batch size of 32.

The right-hand chart plots final loss as a function of batch size. We see the gap in loss for LoRA increasingly diverging from FullFT for larger batch sizes.

The learning gap at large batches doesn’t seem to depend on rank, but rather seems to be a property of LoRA. The likely reason is that the product-of-matrices parametrization (BA) has less favorable optimization dynamics on this dataset than the full matrix (W). However, both LoRA and FullFT achieve their best loss at smaller batch sizes, so this gap may not matter as much in practice.

Layers Where LoRA Is Applied#

We investigated the effects of applying LoRA to different layers in the network. The original paper by Hu et al. recommended applying LoRA only to the attention matrices, and many subsequent papers followed suit, though a recent trend has been to apply it to all layers.Similar to our results, the QLoRA paper also found that LoRA performed worse than MLP or MLP+attention, though they found that MLP+attention > MLP > attention, whereas we found the first two to be roughly equal. Indeed, we achieved far better results when applying LoRA to all layers, in particular, the MLP (including MoE) layers. In fact, applying LoRA to the attention matrices shows no additional benefits beyond applying it to the MLPs only.Biderman et al. (2024) obtained a similar result, with attention-only LoRA providing no additional benefit on top of MLP-only.

Attention-only LoRA significantly underperforms MLP-only LoRA, and does not further improve performance on top of LoRA-on-MLP. This effect holds for a dense model (Llama-3.1-8B) and a sparse MoE (Qwen3-30B-A3B-Base).

The underperformance of attention-only LoRA is not explained by having fewer parameters. In this particular case, attention-only with rank 256 underperforms MLP-only with rank 128, despite them having approximately the same number of parameters. (Compare the bold numbers in the table below.)

LoRA configuration	Params
mlp, rank=256	0.49B
attn, rank=256	0.25B
all, rank=256	0.70B
mlp, rank=128	0.24B

Parameter counts for LoRA on Llama-3.1-8B

For the MoE experiment, we trained a separate LoRA on each expert, with the rank of each equal to the total rank divided by the number of active experts (equal to 8 for Qwen3 MoE). This scaling keeps the ratio of LoRA parameters to FullFT parameters the same for MoE layers as for other layers.

We did similar experiments comparing different LoRA layers in two additional settings: (1) supervised learning on a small subset of the OpenThoughts3 dataset with rank=256, and (2) reinforcement learning on the MATH dataset. We describe our experimental setup in the following section. Attention-only LoRA underperforms MLP-only LoRA (which performs similarly to MLP+attention) in these settings as well.

Learning rate vs final loss or reward, when varying which layers we apply LoRA to.

Reinforcement learning#

A key finding from our experiments is that LoRA fully matches the learning performance of FullFT when running policy gradient algorithms for reinforcement learning, even with ranks as low as 1.

For these experiments, we used a basic policy gradient algorithm with an importance sampling correction; objective=∑t p learner p sampler A d v t\text{objective} =\sum_t \frac{p_{\text{learner}}}{p_{\text{sampler}}} Adv_t.See Your Efficient RL Framework Secretly Brings You Off-Policy RL Training We used a GRPO-like centering schemeDeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (Shao et al, 2024) where we sample multiple completions per problem and subtract the mean reward per group.

Figure 6 (below) shows LR sweeps on the MATHMeasuring Mathematical Problem Solving With the MATH Dataset (Hendrycks et al, 2021) and GSMGSM8K: Training Verifiers to Solve Math Word Problems (Cobbe et al, 2021) datasets, using typical hyperparameters for each. We used the Llama-3.1-8B base model as Qwen2.5 and Qwen3 are known to have been pretrained on data that improves their math performance, as described by the Qwen tech reportsQwen2.5 Technical Report (Qwen Team, 2024), which makes it harder to measure what is being learned only during RL.

LoRA shows a wider range of performant learning rates and arrives at the same peak performance as FullFT (black line), at least within the precision limits afforded by the noisiness of RL.

Learning rate vs final reward (accuracy) when doing RL on grade school math (GSM, left) or MATH (right) dataset.

This result is anticipated by an information-theoretic argument. Supervised learning arguably provides O(number of tokens) bits per episode. In contrast, in policy gradient methods, learning is driven by the advantage function which provides only O(1) bits per episode. When each episode contains thousands of tokens, RL absorbs ~1000 times less information per token in training than supervised learning does.

We can use more precise numbers based on our experiments. In the MATH example, we trained on ~10,000 problems with 32 samples per problem. Assuming each completion yields a single bit of information, the whole training process only needs to absorb 320,000 bits. Rank-1 LoRA for Llama-3.1-8B already has 3M parameters We calculated this by adding up rank·d i n d_{in} (for matrix A) and rank·d o u t d_{out} (for B) over all weight matrices in the model., almost 10 times that number. Even at rank-1, LoRA has more than enough capacity to absorb all the information provided during training.

As another point of comparison, DeepSeek-R1-Zero was trained on 5.3M episodes Training took place for 10,400 steps, each step consisting of 32 unique questions, each question sampled 16 times., corresponding to 5.3M bits of information. This is less than the number of parameters in a low-rank LoRA, and we predict that the results can be replicated with LoRA.

For additional validation of our findings of LoRA’s effectiveness in reasoning RL, we carried out larger-scale experiments with Qwen3-8b-base on the DeepMath datasetDeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning (He et al, 2025) as it is much larger than the MATH dataset and in general contains harder problems. To speed up experiments, we restricted the samples to a length of 8192 tokens for training and evaluation. This sample length allows for backtracking and reasoning but limits the performance, relative to longer chain-of-thought.

Experiments on the DeepMath dataset with Qwen3-8b-base. In the left plot, we show the learning curve for different ranks and full fine-tuning. For each of these settings, we show the best learning rate, which results in the highest final performance. On the right, we plot learning rate vs final performance. As in our previous math experiments, LoRA seems to have a wider peak of near-optimal learning rates.

Additional plots from experiments on the DeepMath dataset with Qwen3-8b-Base. The left plot shows the benchmark scores on the AIME test set, which is more challenging than the training set. The right plot shows the chain-of-thought (CoT) length over training steps, which can be seen as a sign of learning to reason.

We observe that when picking the optimal learning rates for each setting, training progresses in an almost identical way for LoRAs with different sizes and full fine-tuning. Moreover, we see similar findings when we evaluate the models on the held-out problems of AIME 2024 and AIME 2025. Furthermore, we observe similar qualitative behavior from the LoRA and full-finetuning runs: both develop advanced reasoning behaviors such as backtracking, self-verification and in-context exploration, which is visible in the lengthening of the model CoTs.

Setting LoRA hyperparameters#

One barrier to LoRA adoption is the necessity to choose optimal hyperparameters, which are different from ones optimized for FullFT. In this section, we show that this problem isn’t as daunting as it appears at first glance and discuss our findings related to hyperparameter choice.

Optimal learning rate and rank#

Following Hu et al., we consider the following parametrization for LoRA:

W′=W+α r B A W' = W + \frac{\alpha}{r}BA Where r r is the LoRA rank, α\alpha is the LoRA scaling factor, and A A, B B are the LoRA weight matrices (of rank r r). We use α=32\alpha = 32 for the experiments in this article, following standard practice from other implementations.

The 1/r 1/r scaling factor makes the optimal learning rate approximately independent of rank. In fact, a stronger condition holds – the learning curve is exactly the same at the beginning of training, regardless of rank. This effect is striking, and in our experiments the closeness of the learning curves for different ranks had us worried that a bug caused the rank parameter to be ignored. It follows that in a short training regime, the optimal LR is also independent of rank. However, as we showed above in our plots of learning rate vs loss (Figure 2), optimal LR has some rank-dependence in the longer-training regime.

These plots look at the differences in the learning curves, early in training, for different ranks with the same learning rate. On the left, we show the learning curves. The right shows the difference between rank 16 and 256, which grows over time. Strangely, it is negative (though tiny) for the first few steps, so that part of the curve is missing from the plot.

We can partly explain this result by looking at the expected update to the LoRA matrix after the very first training update. We can think of the LoRA product B A BA as the sum of r r rank-1 outer products: B A=∑i=1 r b i a i T=∑i=1 r Δ i BA = \sum_{i=1}^r b_i a_i^T = \sum_{i=1}^r \Delta_i, where we define Δ i=b i a i T\Delta_i = b_i a_i^T. Here, ∂Loss/∂Δ i\partial \text{Loss}/\partial \Delta_i is the same for all i i; however the gradients ∂Loss/∂b i\partial \text{Loss}/\partial b_i and ∂Loss/∂a i\partial \text{Loss}/\partial a_i will depend on the initialization (∂Loss/∂b i\partial \text{Loss}/\partial b_i depends on a i a_i, for example). Since the initialization of a i a_i and b i b_i do not depend on rank, it follows that E[Δ i]\mathbb{E}[\Delta_i] is the same for all i i and does not depend on rank. At the first step of training, the expected update from each of these terms is equal and independent of the rank. It follows that (1/r)∑i=1 r Δ i(1/r)\sum_{i=1}^r \Delta_i is just a sample average of r r terms with the same expectation, so the expectation of the average, i.e., the change to the adapter (1/r)B A(1/r)BA, doesn’t depend on the rank.

Parametrization invariances#

There are four hyperparameters potentially applicable to LoRA:

The scale factor α α which appears in α/r α / r.
The learning rate for the down-projection matrix A A, L R A LR_A
The learning rate for the up-projection matrix B B, L R B LR_B.
The initialization scale of matrix A A, init A\text{init}_A. For a random initialization, this is the standard deviation of A A’s initial elements. Matrix B B is initialized to zero, so there is no need to define init B\text{init}_B.

Having to tune four different parameters may seem overwhelming. However, invariances in the training dynamics mean that two of these are redundant, and learning behavior is determined by two. We show this invariance by noting that when training with Adam and ε=0 ε = 0, the We can extend this result to ε>0 ε > 0; we need to scale it by 1/q 1/q because the gradients are scaled by that factor. optimization process is invariant to the following two-parameter transformation.

For p,q>0 p, q > 0:

α→1 p q⋅α α → \frac{1}{pq} \cdot α
init A→p⋅init A\text{init}_A → p \cdot \text{init}_A
L R A→p⋅L R A LR_A → p \cdot LR_A
L R B→q⋅L R B LR_B → q \cdot LR_B

Since two degrees of freedom out of the four don’t affect the learning process, we are left with a 2D parameter space. We can choose different bases for this 2D space, such as the following one which lends itself to a straightforward interpretation:

α⋅init A⋅L R B α \cdot \text{init}_A \cdot LR_B. This determines the scale of initial updates, or, equivalently, the initial slope of the learning curve. Since B B is initialized to zero, L R A LR_A and the initial updates to A A are irrelevant.
init A/L R A\text{init}_A / LR_A. Since Adam updates the elements of A A by approximately L R A LR_A at each step, this timescale parameter determines the number of steps it takes to significantly transform A A away from its initial state.

We can reinterpret some proposals from previous work on LoRA in terms of this basis.

LoRA+LoRA+: Efficient Low Rank Adaptation of Large Models (Hayou et al, 2024) proposes to use different LRs on A A and B B, with a higher rate for B B. Expressed in terms of our basis above, increasing L R B LR_B is equivalent to increasing init A/L R A\text{init}_A/LR_A so that A A changes on a longer timescale.

Unsloth’s LoRA Hyperparameter Guide recommends using higher values of α α for high-rank LoRA, e.g. by avoiding the 1/r 1/r scaling. This is also equivalent to increasing init A/L R A\text{init}_A/LR_A. When we increase α α, L R A LR_A and L R B LR_B need to be lowered in compensation to get the same update size. This in turn simply makes L R A LR_A smaller relative to init A\text{init}_A.

In our experiments, we used the standard parametrization used in the Huggingface peft libraryPEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods (Mangrulkar et al, 2022) proposed by Hu et al: a uniform distribution for A A with scale 1/d i n 1/\sqrt{d_{in}}, zero initialization for B B, the same LR for both, and α=32 α = 32. We were unable to improve on these hyperparameters in our experimentation.

Optimal learning rates for LoRA vs. FullFT#

Our experiments showed that the optimal LR for LoRA is consistently 10x the one used for FullFT in the same application, for both supervised learning and reinforcement learning. This shows up in every U-shaped plot of performance (loss or reward) charted against learning rate. This observation should make it more straightforward to transfer learning hyperparameters from FullFT to LoRA.

We don’t yet have an adequate theoretical explanation for this observation. We can attempt to derive this result from the facts that optimal LoRA LR is invariant to rank and that full-rank LoRA is directly comparable to FullFT. This analysis suggests a LR ratio of the model’s hidden size divided by 2⋅α 2 \cdot \alpha, which doesn’t match the empirical result of the optimal ratio being fixed at 10 independent of the base model.

For our empirical analysis, we conducted an LR sweep of 14 different Llama and Qwen models for both LoRA and FullFT on the Tulu3 dataset. From those sweeps, we fit a function that predicts the optimal learning rate based on the model’s hidden size and an indicator of whether it’s Llama or Qwen. The functional form used was:

LR=M LoRA⋅(2000 hidden size)model pow+LoRA pow\text{LR} = M_{\text{LoRA}} \cdot \left(\frac{2000}{\text{hidden size}}\right)^{\text{model pow} + \text{LoRA pow}} Where:

M LoRA M_{\text{LoRA}} is a multiplier applied when LoRA is used (1 if FullFT)
model pow\text{model pow} is an exponent adjustment, calculated separately for each model source (Llama and Qwen)
LoRA pow\text{LoRA pow} is an additional exponent adjustment for LoRA
hidden size\text{hidden size} is the dimension of the residual stream of the model.

We scored a predicted learning rate by using linear interpolation to predict the loss, based on the data from our sweep, and rated the parameters by summing the predicted loss over the 14 problems. Our optimization found a multiplier of 9.8 for LoRA over FullFT, and different dependence on hidden_size for Qwen3 and Llama models, but LoRA LRs had the same dependence on hidden_size as FullFT LRs, i.e., the optimization found LoRA pow=0\text{LoRA pow} = 0.

Learning rates in short and long runs#

The typical initialization of LoRA creates an implicit schedule of change in the effective learning rate. This leads to differences between short and long training runs, and some differences in the shape of learning curves compared to FullFT.

At the start of training, B B is initialized to zero. While B B is very small, changes in A A have negligible effects on the adapter B A BA which is added to the original network weights. As B B grows larger, updates to A A start to have a bigger impact on the network outputs, with the effective learning rate increasing over the course of training as B B approaches A A in scale. We found that by the end of the full training runs on the Tulu3 and OpenThoughts datasets, the B B matrices ended up with larger spectral norms than the A A matrices.

This implies that the optimal LR should be set higher for shorter training runs. Preliminary evidence suggests an optimal multiplier around 15x over the FullFT for short runs Based on anecdotal evidence, the higher multiplier is effective under ~100 steps or so., converging to the aforementioned 10x multiplier for longer runs.

Discussion#

We want to move beyond our empirical results to discuss some broader considerations related to LoRA performance and applicability that would be of interest to both researchers and builders.

First, let us examine in more depth our main result, namely the two conditions under which LoRA performs similarly to full fine-tuning:

LoRA is applied to all layers of the network, especially the MLP/MoE layers which house most of the parameters.
LoRA works well when not capacity constrained, i.e., the number of trainable parameters exceeds the amount of information to be learned, which can be estimated in terms of dataset size.

When (1) is satisfied, we get similar learning dynamics to FullFT at the very start of training. Then, as per (2), LoRA continues to look like FullFT until we start reaching capacity limits.

Why LoRA might be needed on all layers#

As we showed earlier, if we put LoRA on only the attention layers, we get slower learning even in the tiny-data regime.

One possible explanation could come from thinking about the empirical neural tangent kernel (eNTK) as an approximation of what happens when we do a small amount of fine-tuning, following Malladi et al.A Kernel-Based View of Language Model Fine-Tuning (Malladi et al, 2022) eNTK is based on the dot products of gradients, specifically gradients g i=∂/∂θ log⁡p(token i∣prefix i)g_i = \partial/\partial\theta \log p(\text{token}_i | \text{prefix}_i), and K(i,j)=g i⋅g j K(i, j) = g_i \cdot g_j. As a consequence, the layers with the most parameters will typically have the most influence on the kernel. The paper also points out the eNTK for LoRA is approximately the same as that for full fine-tuning, when you train all the layers. So LoRA training ≈\approx eNTK(LoRA) ≈\approx eNTK(FullFT) ≈\approx FullFT. The approximation eNTK(LoRA) ≈\approx eNTK(FullFT) only holds when we apply LoRA to the layers that contain most of the parameters which make up the dot products.

How much capacity is needed by supervised and reinforcement learning?#

Past workPhysics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws (Allen-Zhu and Li, 2024) has shown that neural networks can store 2 bits per parameter. These results pertain to the maximum amount of information absorbed in the long-training limit, not to the compute efficiency or rate of learning.

The 2-bits-per-parameter result relied on synthetic datasets cleverly constructed to contain a precise amount of information. It’s not as straightforward to estimate the information content required for a given realistic learning problem. One classic observation is that when minimizing log-loss, the total log-loss measured during the first epoch of training provides a measurement of the dataset’s description length. That is, an upper bound for the number of bits required to memorize the dataset. LLM datasets usually have a loss of around 1 bit (0.69 nats) per token, depending on dataset and model size.

This estimate measures the capacity required to perfectly memorize the dataset, which overestimates the actual capacity needed for “generalizable” learning that reduces log-loss on test data. Measuring the capacity requirements of supervised learning and how these interact with the number of trainable parameters is an open question for future work.

For RL, we claimed that policy gradient algorithms learn roughly 1 bit of information per episode, given that there’s a single reward value at the end of the episode. This isn’t a fundamental property of RL, as other algorithms could conceivably learn a lot more from each episode. For example, model-based RL algorithms train the learning agent to predict the observations and build a world model, potentially extractingmore information per episode. The claim of 1-bit-per-episode may only apply narrowly to policy gradient algorithms.

We can sharpen the bits-counting argument in information-theoretic terms. Consider an episode, consisting of a trajectory τ\tau and final reward, as a message (i.e., a noisy channel) that provides some information about the unknown reward function R R. We’ll condition on the current policy and training history and look at the mutual information between the policy gradient estimator and R R. The REINFORCE update is G=S⋅Adv G = S \cdot \text{Adv} with S=∇log⁡p θ(τ)S = \nabla \log p_\theta(\tau). S S is independent of R R given the history, so the only R R-dependent component is the scalar advantage.

By the data processing inequality:

I(G;R∣history)≤I((S,Adv);R∣history)=I(Adv;R∣S,history)≤H(Adv).I(G ; R | \text{history}) \leq I((S, \text{Adv}) ; R | \text{history}) = I(\text{Adv} ; R | S, \text{history}) \leq H(\text{Adv}). If we quantize the advantage into B B bins, then H(Adv)≲log⁡(B)H(\text{Adv}) \lesssim \log(B). That is, the number of bits of useful information gleaned per episode is O(1)O(1), independent of model size. These bits tell us which member of a discrete set of reward functions (or, equivalently, optimal-policy classes) we’re in. This analysis of mutual information mirrors what’s used in some theoretical analysis of optimization algorithms.Information Complexity of Black-Box Convex Optimization: A New Look via Feedback Information Theory (Raginsky and Rakhlin, 2009) Note that this estimate is an upper bound on the information absorbed by training; the actual amount learned will depend on the policy initialization and other details. For example, if we initialize with a policy that doesn’t get any reward, then the entropy of the advantage is zero (not log(B)), and it won’t learn anything.

Compute efficiency advantage of LoRA#

Our experiments above measured learning progress against the number of training steps, but we may also be interested in the compute efficiency of different methods. We calculate that LoRA takes slightly more than ⅔ of the FLOPs that full fine-tuning does per pass. As a result, it will often outperform FullFT on compute efficiency overall.

We derive this ⅔ ratio by analyzing the FLOPs used in the forward–backward pass on a given weight matrix. These operations account for the vast majority of FLOPs in neural network models. We use the following notation:

W∈R N×N W \in \mathbb{R}^{N \times N} is a weight matrix
x∈R N x \in \mathbb{R}^N is an input vector
y=W x∈R N y = Wx \in \mathbb{R}^N is an output vector
x ˉ,y ˉ∈R N\bar{x}, \bar{y} \in \mathbb{R}^N are the gradients of the loss with respect to x x and y y, computed in the backward pass
W ˉ∈R N×N\bar{W} \in \mathbb{R}^{N \times N} is the gradient of the loss with respect to W W

Full fine-tuning performs the following operations:

Forward

y=W x y = Wx (N 2 N^2 multiply–adds)

Backward

x ˉ=W T y ˉ\bar{x} = W^T \bar{y} (N 2 N^2 multiply–adds)
W ˉ+=x y ˉ T\bar{W} \mathrel{+}= x \bar{y}^T (N 2 N^2 multiply–adds)

The forward pass requires N 2 N^2 multiply-adds, and the backward pass requires another 2⋅N 2 2 \cdot N^2 for 3 N 2 3N^2 total. Training, which requires both, thus uses 3 times the FLOPs of forward-only inference.

With LoRA, we replace W W by W+B A W + BA, where B∈R N×R B \in \mathbb{R}^{N \times R} and A∈R R×N A \in \mathbb{R}^{R \times N}, with R≪N R \ll N. Since we only update A ˉ\bar{A} and B ˉ\bar{B}, we replace the third step of updating W ˉ\bar{W} with a much cheaper operation. A A and B B are N⋅R N \cdot R matrices, so the full forward-backward computation on each requires 3 N R 3NR multiply-adds instead of 3 N 2 3N^2 for W W. The total for both is 6 N R 6NR. We also perform the forward-backward pass on W x Wx and x ˉ\bar{x}, equivalent to the first two steps of FullFT. The total number of multiply-adds is 2 N 2+6 N R 2N^2 + 6NR. With R≪N R \ll N, this is slightly more than 2 3\frac{2}{3} of 3 N 2 3N^2.

If we plotted LoRA performance over FLOPs This analysis omits FLOPs used for attention, which could be significant in long-context settings. instead of training steps, it would show a clear advantage over FullFT.

Open questions#

There are several questions related to our results that we would love to see investigated in the future:

Sharpening our predictions of LoRA performance and the precise conditions under which it matches full fine-tuning. We have roughly characterized the regime of equal performance and can estimate the required capacity in terms of tokens or episodes, but we can’t yet make accurate forecasts.
Our theoretical understanding of LoRA learning rates and training dynamics is limited. A fuller theory that explains the ratio between LoRA and FullFT learning rates would be valuable.
How do LoRA variants such as PiSSAPiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models (Meng, Wang & Zhang, 2024) perform when measured according to the methodology in this article?
There are various options for applying LoRA to MoE layers. LoRA users would benefit from an investigation into how well they perform, and how compatible each approach is with methods like tensor parallelism and expert parallelism that are important for large MoE models.

Closing thoughts#

At Thinking Machines, we believe in the power of fine-tuning to advance AI usefulness in many domains of expertise. Our interest in LoRA is driven by a goal of making this power widely accessible and easily customizable to specific needs.

Aside from its practical uses, research on LoRA has also led us to deeper investigations of model capacity, dataset complexity, and sample efficiency. Looking at how learning speed and performance depend on capacity provides a lens for studying fundamental questions in machine learning. We look forward to advancing this study in the future.

Acknowledgements#

We thank Dan Alexander Biderman, Weizhu Chen, Daniel Han, and Sadhika Malladi for their insightful feedback on an earlier draft of this post.

Citation#

Please cite this work as:

Copied!

Schulman, John and Thinking Machines Lab, "LoRA Without Regret",
Thinking Machines Lab: Connectionism, Sep 2025.

Or use the BibTeX citation:

Copied!

@article{schulman2025lora,
  author = {John Schulman and Thinking Machines Lab},
  title = {LoRA Without Regret},
  journal = {Thinking Machines Lab: Connectionism},
  year = {2025},
  note = {https://thinkingmachines.ai/blog/lora/},
  doi = {10.64434/tml.20250929},
}

I want to perform a hyperparameter sweeps. Should I start with learning rate, lora variant ranks, or batch size?

Based on the research context you've provided, I'd recommend starting with learning rate sweeps first. Here's why:

Learning rate should come first because:

It has the most direct impact on whether training works at all
The optimal LR for LoRA is consistently ~10x higher than FullFT (as shown in both papers)
You need to find the optimal LR before you can properly evaluate other hyperparameters
The "LoRA Without Regret" blog shows that optimal LR is relatively independent of rank (especially for ranks ≥4)

Suggested order:

Learning rate - sweep this first for a baseline LoRA configuration (e.g., rank=128, batch_size=40)
Rank - once you have a good LR, sweep rank to find the minimum that doesn't hurt performance
Batch size - test last, since the blog showed LoRA can be less tolerant of large batch sizes in some settings

One practical tip from the blog: the optimal LR has a fairly wide peak for LoRA, so you don't need an extremely fine-grained sweep.

What model size and dataset are you planning to start with? This might affect the specific LR range you should explore.

I want to perform a hyperparameter sweeps. Should I start with learning rate, lora variant ranks, or batch size?

Based on the research you've provided, I'd recommend starting with learning rate sweeps first. Here's why:

Learning rate is the most critical hyperparameter - it has the biggest impact on whether your training succeeds or fails. The research shows that LoRA needs roughly 10x higher learning rates than full fine-tuning, and finding the optimal LR is essential before exploring other parameters.

Then rank, then batch size - Here's the logical order:

Learning rate first - Sweep this for a baseline LoRA configuration (e.g., rank=128, moderate batch size)
Rank second - Once you know roughly what LR works, you can explore different ranks. The good news: optimal LR is approximately independent of rank due to the 1/r scaling
Batch size last - The research shows LoRA can be less tolerant of large batch sizes, but both LoRA and FullFT achieve best loss at smaller batches anyway

Practical suggestion: For your SmolLM2-135M experiments, you could start with:

Rank: 128 or 256 (standard middle ground)
Batch size: 32-64 (moderate, proven to work well)
Learning rate: Sweep from ~1e-5 to ~1e-3

Does this approach make sense for your setup? Would you like help designing the specific sweep ranges?

What do I do with lr scheduler?

Good question! The research you've shared used constant learning rate schedules (no warmup or cooldown) for their experiments to eliminate confounds when comparing LoRA to FullFT.

However, for your actual training runs (not just hyperparameter sweeps), you have a few options:

For sweeps:

Use constant LR (like the research) - this makes it easier to compare different configurations cleanly

For full training:

Cosine decay is popular and often works well for fine-tuning
Linear warmup + cosine decay is a common choice
Constant can work fine, especially for shorter runs

The "LoRA Without Regret" blog mentioned that optimal LR might be slightly higher (around 15x vs 10x over FullFT) for very short runs due to how B matrix grows from zero initialization.

What's your training setup - are you doing short runs for sweeps first, or jumping straight into longer training? This might affect which scheduler makes sense to start with.

Learning Rate

I want to start learning rate sweep with numina dataset. But let me try with the full finetuning, first.

Good plan! Starting with full fine-tuning will give you a baseline to compare against when you move to LoRA.

For the numina dataset (Math IFT with 859k sequences), what configuration are you thinking for this initial full fine-tuning run?

A few things to consider:

Learning rate range - For full fine-tuning, you might try something like 1e-6 to 1e-4
Number of epochs - Will you do 1 epoch or multiple?
Batch size - What fits in your memory?

What's your starting point?

My plan is to perform on 10% of the training data. I can fit 4 micro batch size in my machine, but batch size can be bigger with gradient accumulation. What did $lll_md use for their experiment regarding those hyperparameters?

Looking at the paper, for their Math IFT (MetaMathQA) experiments, they used:

Sequence length: 1024
Global batch size: 768
Micro batch size per device: 24
Training duration: Multiple epochs (1, 2, 4, 8, 16)

For full fine-tuning on Math IFT specifically:

Learning rate: 1e-5
Optimizer: LionW (not AdamW)

Since you can fit micro batch size of 4, and they used global batch size of 768 on 32 GPUs, you'd need gradient accumulation steps of 192 to match their effective batch size (if on 1 GPU).

However, for your 10% subset experiment, you might want to start smaller. What are you thinking - try to match their batch size, or start with something more modest?

Okay, I will start from batch size of 512 with 4 micro batch, so that's 128 gradient accumulation steps?

Yes, exactly! With:

Global batch size: 512
Micro batch size: 4
You get: 512 ÷ 4 = 128 gradient accumulation steps

That's a reasonable starting point - large enough to be stable, but not as extreme as their 768.

For your learning rate sweep on full fine-tuning with the numina dataset (10% subset), what range are you planning to explore?

How many steps should I do with these settings?

Good question! Let's figure this out together.

You have:

10% of numina training data = ~77,400 sequences (10% of 774k)
Global batch size = 512

So each step processes 512 sequences, which means:

1 epoch = 77,400 ÷ 512 ≈ 151 steps

The paper trained their Math IFT models for multiple epochs (1, 2, 4, 8, 16). For an initial learning rate sweep, you probably want enough steps to see the learning curve but not waste compute.

How many epochs are you thinking? One epoch (151 steps) might be enough to identify good vs bad learning rates. Or would you prefer 2 epochs (~302 steps) to see more of the learning dynamics?

I am trying 4 microbatch size and 8 total train batch size. Per epoch, weights and bias is logging 888 samples in each epoch. Does this seem right?

Let me check the math with you:

Micro batch size: 4
Total train batch size: 8
This means gradient accumulation steps = 8 ÷ 4 = 2 steps

If W&B is logging 888 samples per epoch, that means:

888 ÷ 8 (batch size) = 111 steps per epoch

This suggests your 10% subset has about 888 sequences total (111 steps × 8 batch size).

Does that match what you expected for 10% of the numina training data? The full training set should be ~774k sequences, so 10% would be ~77k.

Are you perhaps using a much smaller subset, or is there something else going on with how the data is being loaded?

Here is a weights and bias logging on finemath dataset: W&B logging

I used 81 million tokens (1% of the dataset) for those sweeps except for the longer one. I wanted to do max_duration: "819_200_000tok", but I ran out of credits.

Evals

Helpful evaluation commands

Example of using lighteval: lighteval accelerate "model_name=HuggingFaceTB/SmolLM2-135M" "leaderboard|hellaswag|0|0"
With vLLM:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- lighteval vllm “model_name=HuggingFaceTB/SmolLM2-135M" "leaderboard|hellaswag|0|0"

This PR has information about how to use Composer for evals.

I want to write a blog on this project. Help!

Exciting! You've done a lot of great work here. Let's start by thinking about the structure.

Based on everything in this dialog, here are the major topics you could cover:

Project overview - LoRA learns less and forgets less replication
Dataset preparation - decontamination with Elasticsearch
Custom model - why you built SmolLM2 from scratch (unpadding story)
Training optimizations - torch.compile, sequence packing, Liger kernel
Hyperparameter sweeps - learning rate, rank, batch size experiments
Results - learning vs forgetting across LoRA, rsLoRA, DoRA, and full fine-tuning

A few questions to help me guide you:

Who is your target audience? ML practitioners, researchers, or more general readers?
What's the main takeaway you want readers to walk away with?
Do you want one long blog post or a series of shorter ones? There's a LOT of material here - the training optimization journey alone could be its own post.

Local Research Group

Setup dialog

Paper Outline

Datasets

Dataset decontamination

setup tools

Training

Modal

Custom model

Why build custom model?

Unpadding

How unpadding works

Results

torch.compile

Sequence Packing

Why sequence packing

tokens

How Sequence Packing works with microbatch_size and batch_size ?

Initial run

Run with improvements

Liger kernel

Hyperparameter sweep

Purpose

Targets to keep in mind

Context

What matters for LoRA#

Methods and results#

LoRA rank#

Batch size effects#

Layers Where LoRA Is Applied#

Reinforcement learning#

Setting LoRA hyperparameters#

Optimal learning rate and rank#

Parametrization invariances#

Optimal learning rates for LoRA vs. FullFT#

Learning rates in short and long runs#

Discussion#

Why LoRA might be needed on all layers#

How much capacity is needed by supervised and reinforcement learning?#

Compute efficiency advantage of LoRA#

Open questions#

Closing thoughts#

Acknowledgements#

Citation#

Learning Rate

Evals

Helpful evaluation commands

How Sequence Packing works with `microbatch_size` and `batch_size` ?