Try an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.

Local Research Group

The goal of this project is to find out whether LoRA and its variant techniques such as rsLoRA, DoRA learns less and forgets less compared to full fine tuning. This project is based on LoRA learns less and forgets less. Local Research Group is a fastai study group.

Setup dialog

Some utilities for the dialogue.

from IPython.display import Markdown, display
from contextkit import read_file
from pathlib import Path
import json
import re

def cell_to_markdown(cell):
    src = ''.join(cell['source'])
    if cell['cell_type'] == 'markdown':
        if '🤖Reply🤖' in src:
            parts = src.split('##### 🤖Reply🤖<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->')
            return f"**👤 User:**\n\n{parts[0].strip()}\n\n**🤖 Assistant:**\n\n{parts[1].strip()}"
        return src
    if cell['cell_type'] == 'code':
        outs = []
        for o in cell.get('outputs', []):
            if o.get('type') == 'stream': outs.append(''.join(o.get('text', [])))
            elif 'data' in o:
                tp = o['data'].get('text/plain', '')
                outs.append(''.join(tp) if isinstance(tp, list) else tp)
            elif 'text' in o: outs.append(''.join(o['text']) if isinstance(o['text'], list) else o['text'])
        outputs = re.sub(r'\x1b\[[0-9;]*m', '', '\n'.join(outs))
        return f"**Input:**\n```python\n{src}\n```" + (f"\n\n**Output:**\n```\n{outputs}\n```" if outputs else "")
    return ""

def read_nb(nb_path): 
    nb = json.loads(read_file(nb_path)) if isinstance(nb_path, (str, Path)) else nb_path
    return '\n\n'.join(cell_to_markdown(c) for c in nb['cells'])

Adding other useful dialogues where I read blogs or papers.

no_regret_md = read_nb('blog/LoRA Without Regret.ipynb')
display(Markdown(no_regret_md[:1000]))
lora_learns_md = read_nb('../paper/LoRa Learns Less and Forgets Less.ipynb')
display(Markdown(lora_learns_md[:1000]))

Input:

Copied!
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("lora_learns_less.pdf")
print(md_text[:300]) 

Output:

Copied!
Published in Transactions on Machine Learning Research (08/2024)
## **LoRA Learns Less and Forgets Less**

**Dan Biderman** [1] _[,]_ [2] **, Jacob Portes** [2] **, Jose Javier Gonzalez Ortiz** [2] **, Mansheej Paul** [2] **, Philip**
**Greengard** [1] **, Connor Jennings** [2] **, Daniel King** [2]

👤 User:

I am reading a paper. You can look at it in $md_text. Can you provide an amazing outline for me before diving into the paper to ask questions?

🤖 Assistant:

🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠🧠

Great! I've looked at the paper "LoRA Learns Less and Forgets Less" in your md_text variable. Here's an outline to help you navigate it:

Paper Outline

Main Topic: Comparing LoRA (Low-Rank Adaptation) vs. full fine-tuning for large language models

Key Sections:

  1. Introduction & Motivation - Why study LoRA's performance trade-of

Datasets

repo dataset size split total_tokens num_sequences avg_length median_length std_length min_length max_length seqs_over_max pct_over_max p25 p50 p75 p95 p99
split-avelina-python-edu avelinapythonedu full train 3,452,655,475 6,910,602 499.62 273 750.00 6 8,192 0 0 143 273 541 1,664 3,817
split-avelina-python-edu avelinapythonedu full test 382,424,047 767,845 498.05 273 744.83 14 8,192 0 0 143 273 540 1,655 3,789
split-finemath finemath full train 8,745,300,721 6,029,543 1,450.41 919 1,575.68 34 8,192 0 0 562 919 1,652 4,855 8,192
split-finemath finemath full test 970,519,958 669,950 1,448.65 919 1,573.62 38 8,192 0 0 562 919 1,653 4,840 8,192
3_452_655_475 * 0.05
172632773.75
8_745_300_721 * 0.1
874530072.1
87_453_007_210
87453007210.0

Dataset decontamination

To decontaminate the dataset, we use elastic search following tulu paper's approach. We want to make sure our training data do not contain evaluation sets.

Read the decontamination readme and tell me about how elastic search works.

I'll read the decontamination readme from the repository you mentioned to explain how they use Elasticsearch for dataset decontamination.

Copied!
{
  "id": "toolu_01QPPZSpHU8dMy18gMzJSGpK",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://github.com/allenai/open-instruct/tree/main/decontamination"
    }
  },
  "result": "'Skip to content\\n\\n## Navigation Menu\\n\\nToggle navigation\\n\\n[ ](/)\\n\\n[ Sign in ](/login?return_to=https%3A%2F%2Fgithub.com%2Fallenai%2Fopen-instruct%2Ftree%2Fmain%2Fdecontamination)\\n\\nAppearance settings\\n\\n  * Platform\\n\\n    * AI CODE CREATION\\n      * [GitHub CopilotWrite better code with AI](https://github.com/features/copilot)\\n      * [GitHub SparkBuild and deploy intelligent apps](https://github.com/features/spark)\\n      * [GitHub ModelsManage and compare prompts](https://github.com/features/models)\\n      * [MCP RegistryNewDiscover and integrate external tools](https://github.com/mcp)\\n\\n    * DEVELOPER WORKFLOWS\\n      * [ActionsAutomate any workflow](https://github.com/features/actions)\\n      * [CodespacesInstant dev environments](https://github.com/features/codespaces)\\n      * [IssuesPlan and track work](https://github.com/features/issues)\\n      * [Code ReviewManage code changes](https://github.com/features/code-review)\\n\\n    * APPLICATION SECURITY\\n      * [GitHub Advanced SecurityFind and fix vulnerabilities](https://github.com/security/advanced-security)\\n      * [Code securitySecure your code as you build](https://github.com/security/advanced-security/code-security)\\n      * [Secret protectionStop leaks before they start](https://github.com/security/advanced-security/secret-protection)\\n\\n    * EXPLORE\\n      * [Why GitHub](https://github.com/why-github)\\n      * [Documentation](https://docs.github.com)\\n      * [Blog](https://github.blog)\\n      * [Changelog](https://github.blog/changelog)\\n      * [Marketplace](https://github.com/marketplace)\\n\\n[View all features](https://github.com/features)\\n\\n  * Solutions\\n\\n    * BY COMPANY SIZE\\n      * [Enterprises](https://github.com/enterprise)\\n      * [Small and medium teams](https://github.com/team)\\n      * [Startups](https://github.com/enterprise/startups)\\n      * [Nonprofits](https://github.com/solutions/industry/nonprofits)\\n\\n    * BY USE CASE\\n      * [App Modernization](https://github.com/solutions/use-case/app-modernization)\\n      * [DevSecOps](https://github.com/solutions/use-case/devsecops)\\n      * [DevOps](https://github.com/solutions/use-case/devops)\\n      * [CI/CD](https://github.com/solutions/use-case/ci-cd)\\n      * [View all use cases](https://github.com/solutions/use-case)\\n\\n    * BY INDUSTRY\\n      * [Healthcare](https://github.com/solutions/industry/healthcare)\\n      * [Financial services](https://github.com/solutions/industry/financial-services)\\n      * [Manufacturing](https://github.com/solutions/industry/manufacturing)\\n      * [Government](https://github.com/solutions/industry/government)\\n      * [View all industries](https://github.com/solutions/industry)\\n\\n[View all solutions](https://github.com/solutions)\\n\\n  * Resources\\n\\n    * EXPLORE BY TOPIC\\n      * [AI](https://github.com/resources/articles?topic=ai)\\n      * [Software Development](https://github.com/resources/articles?topic=software-development)\\n      * [DevOps](https://github.com/resources/articles?topic=devops)\\n      * [Security](https://github.com/resources/articles?topic=security)\\n      * [View all topics](https://github.com/resources/articles)\\n\\n    * EXPLORE BY TYPE\\n      * [Customer stories](https://github.com/customer-stories)\\n      * [Events & webinars](https://github.com/resources/events)\\n      * [Ebooks & reports](https://github.com/resources/whitepapers)\\n      * [Business insights](https://github.com/solutions/executive-insights)\\n      * [GitHub Skills](https://skills.github.com)\\n\\n    * SUPPORT & SERVICES\\n      * [Documentation](https://docs.github.com)\\n      * [Customer support](https://support.github.com)\\n      * [Community forum](https://github.com/orgs/community/discussions)\\n      * [Trust center](https://github.com/trust-center)\\n      * [Partners](https://github.com/partners)\\n\\n  * Open Source\\n\\n    * COMMUNITY\\n      * [GitHub SponsorsFund open source developers](https://github.com/sponsors)\\n\\n    * PROGRAMS\\n      * [Security Lab](https://securitylab.github.com)\\n      * [Maintainer Community](https://maintainers.github.com)\\n      * [Accelerator](https://github.com/accelerator)\\n      * [Archive Program](https://archiveprogram.github.com)\\n\\n    * REPOSITORIES\\n      * [Topics](https://github.com/topics)\\n      * [Trending](https://github.com/trending)\\n      * [Collections](https://github.com/collections)\\n\\n  * Enterprise\\n\\n    * ENTERPRISE SOLUTIONS\\n      * [Enterprise platformAI-powered developer platform](https://github.com/enterprise)\\n\\n    * AVAILABLE ADD-ONS\\n      * [GitHub Advanced SecurityEnterprise-grade security features](https://github.com/security/advanced-security)\\n      * [Copilot for BusinessEnterprise-grade AI features](https://github.com/features/copilot/copilot-business)\\n      * [Premium SupportEnterprise-grade 24/7 support](https://github.com/premium-support)\\n\\n  * [Pricing](https://github.com/pricing)\\n\\nSearch or jump to...\\n\\n# Search code, repositories, users, issues, pull requests...\\n\\nSearch\\n\\nClear\\n\\n[Search syntax tips](https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax)\\n\\n#  Provide feedback\\n\\nWe read every piece of feedback, and take your input very seriously.\\n\\nInclude my email address so I can be contacted\\n\\nCancel  Submit feedback\\n\\n#  Saved searches\\n\\n## Use saved searches to filter your results more quickly\\n\\nName\\n\\nQuery\\n\\nTo see all available qualifiers, see our [documentation](https://docs.github.com/search-github/github-code-search/understanding-github-code-search-syntax).\\n\\nCancel  Create saved search\\n\\n[ Sign in ](/login?return_to=https%3A%2F%2Fgithub.com%2Fallenai%2Fopen-instruct%2Ftree%2Fmain%2Fdecontamination)\\n\\n[ Sign up ](/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E%2F%3Crepo-name%3E%2Ffiles%2Fdisambiguate&source=header-repo&source_repo=allenai%2Fopen-instruct)\\n\\nAppearance settings\\n\\nResetting focus\\n\\nYou signed in with another tab or window. [Reload]() to refresh your session. You signed out in another tab or window. [Reload]() to refresh your session. You switched accounts on another tab or window. [Reload]() to refresh your session. Dismiss alert\\n\\n{{ message }}\\n\\n[ allenai ](/allenai) / **[open-instruct](/allenai/open-instruct) ** Public\\n\\n  * [ Notifications ](/login?return_to=%2Fallenai%2Fopen-instruct) You must be signed in to change notification settings\\n  * [ Fork 453 ](/login?return_to=%2Fallenai%2Fopen-instruct)\\n  * [ Star  3.3k ](/login?return_to=%2Fallenai%2Fopen-instruct)\\n\\n  * [ Code ](/allenai/open-instruct)\\n  * [ Issues 7 ](/allenai/open-instruct/issues)\\n  * [ Pull requests 39 ](/allenai/open-instruct/pulls)\\n  * [ Actions ](/allenai/open-instruct/actions)\\n  * [ Projects 0 ](/allenai/open-instruct/projects)\\n  * [ Security Uh oh!  There was an error while loading. [Please reload this page](). ](/allenai/open-instruct/security)\\n  * [ Insights ](/allenai/open-instruct/pulse)\\n\\nAdditional navigation options\\n\\n  * [ Code  ](/allenai/open-instruct)\\n  * [ Issues  ](/allenai/open-instruct/issues)\\n  * [ Pull requests  ](/allenai/open-instruct/pulls)\\n  * [ Actions  ](/allenai/open-instruct/actions)\\n  * [ Projects  ](/allenai/open-instruct/projects)\\n  * [ Security  ](/allenai/open-instruct/security)\\n  * [ Insights  ](/allenai/open-instruct/pulse)\\n\\n## Footer\\n\\n[ ](https://github.com) (C) 2025 GitHub, Inc.\\n\\n### Footer navigation\\n\\n  * [Terms](https://docs.github.com/site-policy/github-terms/github-terms-of-service)\\n  * [Privacy](https://docs.github.com/site-policy/privacy-policies/github-privacy-statement)\\n  * [Security](https://github.com/security)\\n  * [Status](https://www.githubstatus.com/)\\n  * [Community](https://github.community/)\\n  * [Docs](https://docs.github.com/)\\n  * [Contact](https://support.github.com?tags=dotcom-footer)\\n  * Manage cookies \\n  * Do not share my personal information \\n\\nYou can\u2019t perform that action at this time.'"
}

Let me try to access the README file directly:

Copied!
{
  "id": "toolu_01NNugoYVKWb4zmP8wz5gRw1",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://raw.githubusercontent.com/allenai/open-instruct/main/decontamination/README.md"
    }
  },
  "result": "'# Scripts for computing overlap between train and test sets\\n\\nThese scripts are for creating Elasticsearch indices over training datasets, particularly instruction tuning datasets, and querying them with test sets to compute overlap. They can be used for quantifying and analyzing training dataset contamination.\\n\\n## Running Elasticsearch\\n\\nElasticsearch needs to up and running for creating and querying indices. You can run it locally by following the steps [here](https://www.elastic.co/guide/en/elasticsearch/reference/current/run-elasticsearch-locally.html). Make sure to keep track of the password and save it as an environment variable, `ELASTIC_PASSWORD`, e.g.:\\n\\n```bash\\nexport ELASTIC_PASSWORD=[password]\\n```\\n\\n## Indexing\\n\\nYou can index the training sets either as text or as dense vectors. The indexing script assumes that the training dataset is a Huggingface dataset, and has a field that contains prompt-response pairs in a conversational format, e.g. a `messages` field that looks like\\n\\n```json\\n[\\n    {\\n        \"role\": \"user\",\\n        \"content\": \"Write me a poem.\"\\n    },\\n    {\\n        \"role\": \"assistant\",\\n        \"content\": \"Sorry, I cannot help you with that.\"\\n    }\\n]\\n```\\n\\nThe script indexes each turn as a separate Elasticsearch document, and importantly only indexes the messages of one specific role. The assumption is that you would want to index only the prompts for quantifying contamination. You can control this behavior using the `--messages_field`, `--query_filter`, and `--query_field` options as follows:\\n\\n```bash\\npython index.py --messages_field messages --query_filter role:user --query_field content\\n```\\n\\nThe setting above looks for the `messages` field in the dataset, finds messages where the `role` is `user` and indexes their `content`.\\n\\n### Indexing multiple datasets\\n\\nYou can index one dataset at a time as follows\\n\\n```bash\\npython index.py --dataset HF_DATASET_NAME\\n```\\n\\nAlternatively, you can pass a training configuration yaml with a `dataset_mixer` field to index all the datasets in the mix.\\n\\n```bash\\npython index.py --dataset_mixer_config config.yaml\\n```\\n\\n### Indexing vector representations\\n\\nBy default, the indexing script indexes the text in the datasets. If you want to perform soft matching, you can change `--index_type`, and specify an embedding model (defaults to [NV-Embed-v2](https://huggingface.co/nvidia/NV-Embed-v2)).\\n\\n```bash\\npython index.py --index_type vector --model nvidia/NV-Embed-v2\\n```\\n\\nThe script assumes you are running this on GPUs and uses all the available devices. You can adjust `--max_batch_tokens` to a suitable value if you run into OOM errors or if you want to use your GPUs more effectively.\\n\\n## Searching\\n\\nThe searching script lets you query one or more `text` or a `vector` indices with a test set. When querying a `text` index, you can perform an ngram match, a full text match, or an embedding-based match of a specified field(s) in the test set. The basic usage looks like\\n\\n```bash\\npython search.py --train_dataset_names allenai/tulu-2-sft-mixture allenai/wildchat-1m --dataset tatsu-lab/alpaca_eval --split eval --field instruction --output_dir /path/to/output\\n```\\n\\nThe command above queries the indices corresponding to the two training sets, `allenai/tulu-2-sft-mixture` and `allenai/wildchat-1m` (assuming these were indexed earlier) with the AlpacaEval dataset, particularly the `instruction` field in the `eval` split.\\n\\nThe script will create in the output directory one `jsonl` file per each pair of index and evaluation dataset with instance-level information about the matches, and a TSV file called `contamination_report.tsv` with a table of contamination scores for all the pairs.\\n\\nLike with the indexing script, a dataset mixer configuration can be passed with the `--dataset_mixer_config` option instead of `--train_dataset_names`.\\n\\n### Checking for contamination against the Tulu 3 evaluation suite\\n\\nIf no evaluation dataset is specified using the `--dataset` option, the entire Tulu 3 evaluation suite will be used to query the specified indices.\\n\\n### Matching ngrams\\n\\nText indexes can be queried for ngram matches instead of full field matches (default) as follows\\n\\n```bash\\npython search.py --train_dataset_names TRAIN_DATASET_NAME --ngram_size SIZE [--match_threshold THRESHOLD]\\n```\\n\\nMatching scores are then computed as follows:\\n- For each token in the test instance, all matching training documents are retrieved. A training document is considered a match for a token if it is part of an ngram of the specified `SIZE` in the test instance, that also occurs in the training document.\\n- The single training document that covers the most number of tokens in the test instance is considered the largest match.\\n- If no threshold is specified, the match score for the test instance is the proportion of the matched tokens. If a threshold is specified, the score is `0` or `1` depending on the threshold.\\n- The evaluation dataset level match (or contamination) score is the average of instance level match scores.\\n\\n### Embedding-based matching\\n\\nIf the index is created using `--index_type vector`, the same option needs to be specified for searching as well, along with the same `--model MODEL_NAME`. The searching script also assumes you are running this on GPUs.\\n\\nYou can specify a `--match_threshold` here as well, and the behavior is similar to that in ngram matching, except that the match scores here come from embedding similarity.\\n\\n### Decontamination\\n\\nIf you need to remove instances from the training sets that match any of the test instances, just pass a `--decontaminate` option to `search.py`. The output directory will contain one decontaminated `jsonl` file per training dataset. If you pass a `--match_treshold`, only those train instances that have a matching score greater than the threshold with *any* of the test instances will be removed.\\n\\nNote that elasticsearch retrieves a limited number of hits each time you search. You can increase this by requesting a larger number of results by passing a different value to `--search_size` (default is 100). Setting this to a larger number (e.g. 10000) is a good idea if you are decontaminating datasets. Since elasticsearch does not necessarily retrieve all the documents that match, it is not guaranteed that decontamination removes all the matching training instances. You can always check for contamination after decontaminating a dataset to see how effective it was. '"
}

Great! I found the README. Let me explain how Elasticsearch works for dataset decontamination based on what I just read.

Elasticsearch is used here as a search engine to detect overlap between training and test datasets. Here's how it works:

Basic concept: They create searchable indices of the training data, then query those indices with test data to find matches (contamination).

The two-step process:

  1. Indexing - They load training datasets into Elasticsearch, storing either:

    • Raw text (for exact/ngram matching)
    • Dense vector embeddings (for semantic similarity matching)
  2. Searching - They query the indices with test set examples to find overlaps

Key matching strategies available:

  • Full text match (exact field matching)
  • N-gram matching (looks for overlapping sequences of N tokens)
  • Embedding-based matching (semantic similarity using models like NV-Embed-v2)

Does this make sense so far? Would you like me to explain more about any particular aspect - like how the n-gram matching works, or how they calculate contamination scores?

We follow this from the tulu paper:

Copied!
We used 8-gram matching for our contamination checks following (Dubey et al., 2024; Singh et al., 2024a). For each token in a test instance, we consider it to match a token in a train instance if the two instances share an 8-gram containing that token, and we consider the test instance itself to have significant overlap with a train instance if more than 50% of the test tokens have 8-gram matches with the same training instance.

Decontamination. 
We consider a training set to be contaminated if any number of its instances overlap with more than 2% of the instances in any of the evaluations in our development and unseen suites. We remove all the training sets that were contaminated with our unseen evaluations. For training sets that were contaminated with our development evaluations, we removed the entire dataset if doing so did not significantly impact the performance of the resulting model; otherwise, we removed the specific instances that match any test instance.  The list of datasets we decontaminated and the versions we released with overlapping samples removed is shown in Table 8. The full list of public datasets that we found to be significantly contaminated with our evaluation sets can be found in Table 37.

Thanks for sharing that! So you're implementing the same approach as the Tulu paper. Let me break down what they're doing:

The 8-gram matching criteria:

  • A token in the test set "matches" if it's part of an 8-gram (8 consecutive tokens) that also appears in a training instance
  • A test instance is considered "significantly overlapping" if >50% of its tokens match the same training instance

The decontamination thresholds:

  • A training dataset is "contaminated" if it overlaps with >2% of instances in ANY evaluation set
  • For unseen eval contamination: remove the entire training dataset
  • For development eval contamination: remove the whole dataset if possible, otherwise just remove the specific overlapping instances

So when you run the Elasticsearch decontamination, you'll want to use:

  • --ngram_size 8
  • --match_threshold 0.5 (for the 50% overlap criterion)

Then you'll need to analyze the contamination report to see which training datasets exceed the 2% threshold for each eval set.

Have you already indexed your training datasets (avelinapythonedu and finemath) in Elasticsearch, or do you need help with that step first?

setup tools

Jae L created awesome repo with all the tools. I couldn't follow it very easily because I was dumb. Let's go through this together.

Let's go ahead and clone this: git@github.com:zzehli/local-research-decontaminate.git and go inside of this directory.

First, we have to setup the docker. I am on a ubuntu machine (pop os). This website: https://docs.docker.com/desktop/setup/install/linux/ubuntu/ explains how to install docker desktop. But I didn't want to download the desktop version because my machine did not have a monitor attached. Following this website: https://docs.docker.com/engine/install/ubuntu/#install-using-the-repository, I went over the Install using the apt repository section.

Copied!
sudo systemctl status docker
sudo systemctl start docker

this started the docker. Checking with docker compose version passed.

But we cannot use docker from our user due to permission issue. Run these commands:

Copied!
sudo usermod -aG docker galopy  # add user to docker group
newgrp docker                   # activate the docker group membership without log out and in.
docker ps                       # test docker permission

After setting up docker, we setup elastic following this: https://www.elastic.co/docs/deploy-manage/deploy/self-managed/local-development-installation-quickstart.

When running curl -fsSL https://elastic.co/start-local | sh, it may complain that the directory already exists. Uninstall what was here. We need to setup a new one.

I was running, but got this error:

Copied!
galopy@pop-os:~/git/local-research-decontaminate$ curl -fsSL https://elastic.co/start-local | sh -s -- --esonly

  ______ _           _   _
 |  ____| |         | | (_)
 | |__  | | __ _ ___| |_ _  ___
 |  __| | |/ _` / __| __| |/ __|
 | |____| | (_| \__ \ |_| | (__
 |______|_|\__,_|___/\__|_|\___|
-------------------------------------------------
🚀 Run Elasticsearch and Kibana for local testing
-------------------------------------------------

ℹ️  Do not use this script in a production environment

⌛️ Setting up Elasticsearch v9.2.0...

- Generated random passwords
- Created the elastic-start-local folder containing the files:
  - .env, with settings
  - docker-compose.yml, for Docker services
  - start/stop/uninstall commands
- Running docker compose up --wait

[+] Running 1/2
  Network elastic-start-local_default           Created                                                                                                                   1.9s
[+] Running 3/3ic-start-local_dev-elasticsearch  Creating                                                                                                                  0.0s
  Network elastic-start-local_default           Created                                                                                                                   1.9s   Volume elastic-start-local_dev-elasticsearch  Created                                                                                                                   0.1s
  Container es-local-dev                        Healthy                                                                                                                  46.0s

🎉 Congrats, Elasticsearch is installed and running in Docker!
🔌 Elasticsearch API endpoint: http://localhost:9200
🔑 API key: ZGx3X1pKb0JMTWZERS16bUVMLWw6YUprYU11QkhKMzd2Xy1kSVJXYWR3dw==

Learn more at https://github.com/elastic/start-local
(modernbert) galopy@pop-os:~/git/local-research-decontaminate$ export ELASTIC_PASSWORD=ZGx3X1pKb0JMTWZERS16bUVMLWw6YUprYU11QkhKMzd2Xy1kSVJXYWR3dw==
(modernbert) galopy@pop-os:~/git/local-research-decontaminate$ uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --index_type text --parallel --chunk_size 3200 --max_chunk_bytes 10000000 --queue_size 16 --thread_count 16
Processing dataset 1 / 1: LocalResearchGroup/split-finemath
Reading LocalResearchGroup/split-finemath subset full split train
Resolving data files: 100%|██████████████████████████| 118/118 [00:00<00:00, 259.69it/s]
Loading dataset shards: 100%|███████████████████████████| 71/71 [00:19<00:00,  3.62it/s]
Reading text from LocalResearchGroup/split-finemath
6029543it [09:22, 10727.48it/s]
Read LocalResearchGroup/split-finemath for indexing. Has 6029543 instances and 6029543 messages.
6029543
Traceback (most recent call last):
  File "/home/galopy/git/local-research-decontaminate/index.py", line 341, in <module>
    main()
    ~~~~^^
  File "/home/galopy/git/local-research-decontaminate/index.py", line 314, in main
    if not es.indices.exists(index=index_name):
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/utils.py", line 415, in wrapped
    return api(*args, **kwargs)
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/indices.py", line 1597, in exists
    return self.perform_request(  # type: ignore[return-value]
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        "HEAD",
        ^^^^^^^
    ...<4 lines>...
        path_parts=__path_parts,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/_base.py", line 422, in perform_request
    return self._client.perform_request(
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        method,
        ^^^^^^^
    ...<5 lines>...
        path_parts=path_parts,
        ^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/_base.py", line 271, in perform_request
    response = self._perform_request(
        method,
    ...<4 lines>...
        otel_span=otel_span,
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/_base.py", line 351, in _perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(
        message=message, meta=meta, body=resp_body
    )
elasticsearch.AuthenticationException: AuthenticationException(401, 'None')

The error shows an authentication issue (401). The problem is that what you exported as ELASTIC_PASSWORD is actually an API key, not a password!

Look at the Elasticsearch setup output - it says:

  • 🔑 API key: ZGx3X1pKb0JMTWZERS16bUVMLWw6YUprYU11QkhKMzd2Xy1kSVJXYWR3dw==

But you need to find the actual password. When Elasticsearch starts, it should have generated a password for the elastic user.

Try checking the .env file that was created:

Copied!
cat elastic-start-local/.env

Look for a line with ELASTIC_PASSWORD= in that file. That's the actual password you need to export.

Alternatively, you might need to configure the indexing script to use the API key instead of password authentication. Does the index.py script have options for API key authentication?

Copied!
galopy@pop-os:~/git/local-research-decontaminate$ cat elastic-start-local/.env
START_LOCAL_VERSION=0.12.0
ES_LOCAL_VERSION=9.2.0
ES_LOCAL_CONTAINER_NAME=es-local-dev
ES_LOCAL_PASSWORD=cXnaRZIB
ES_LOCAL_PORT=9200
ES_LOCAL_URL=http://localhost:${ES_LOCAL_PORT}
ES_LOCAL_DISK_SPACE_REQUIRED=1gb
ES_LOCAL_JAVA_OPTS="-Xms128m -Xmx2g"
ES_LOCAL_API_KEY=ZGx3X1pKb0JMTWZERS16bUVMLWw6YUprYU11QkhKMzd2Xy1kSVJXYWR3dw==
(modernbert) galopy@pop-os:~/git/local-research-decontaminate$ export ELASTIC_PASSWORD=cXnaRZIB

Perfect! You found the actual password (cXnaRZIB) and exported it correctly.

Now try running your indexing command again:

Copied!
uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --index_type text --parallel --chunk_size 3200 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Does it work now, or do you get a different error?

can I check it first to see if it works before running this command?

Good idea! Yes, you can test the Elasticsearch connection first. Try this simple command to check if authentication is working:

Copied!
curl -u elastic:$ELASTIC_PASSWORD http://localhost:9200

This should return some JSON with cluster information if everything is set up correctly. What do you get?

Copied!
galopy@pop-os:~/git/local-research-decontaminate$ curl -u elastic:$ELASTIC_PASSWORD http://localhost:9200
{
  "name" : "be576f909e56",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "WkaOp27OQSGi4Xb2-UjxkQ",
  "version" : {
    "number" : "9.2.0",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "25d88452371273dd27356c98598287b669a03eae",
    "build_date" : "2025-10-21T10:06:21.288851013Z",
    "build_snapshot" : false,
    "lucene_version" : "10.3.1",
    "minimum_wire_compatibility_version" : "8.19.0",
    "minimum_index_compatibility_version" : "8.0.0"
  },
  "tagline" : "You Know, for Search"
}

Excellent! Elasticsearch is working and authenticated properly. You can see it's running version 9.2.0 and responding to your credentials.

Now you're ready to run your indexing command. Go ahead and run it:

Copied!
uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --index_type text --parallel --chunk_size 3200 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Let me know how it goes!

I was dumb and saved API key as ELASTIC_PASSWORD. Make sure to save the password as password

Then we can get the virtual environment setup. For some reason, I was using python 3.14 with uv, which was not compatible with pytorch 2.8. I created venv with 3.13:

Copied!
galopy@pop-os:~/git/local-research-decontaminate$ uv venv --python 3.13
Using CPython 3.13.9
Creating virtual environment at: .venv
✔ A virtual environment already exists at `.venv`. Do you want to replace it? · yes
Activate with: source .venv/bin/activate
galopy@pop-os:~/git/local-research-decontaminate$ source .venv/bin/activate
(local-research-decontaminate) galopy@pop-os:~/git/local-research-decontaminate$ uv sync

Then I replaced it with 3.14, activate the venv, then started installing with uv.

Also install these:

Copied!
uv run python -m ensurepip --upgrade
uv run spacy download en_core_web_lg

Now we are ready to start!

Index the dataset:

Copied!
uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Search and decontaminate:

Copied!
uv run search.py --train_dataset_names LocalResearchGroup/split-finemath --train_split train --dataset openai/gsm8k --field question --output_dir data/finemath/ --ngram_size 8 --match_threshold 0.5 --decontaminate

It removes decontaminated instances, saves the decontaminated set into parquet file, then push to huggingface hub.

Running on the test split:

Copied!
uv run search.py --train_dataset_names LocalResearchGroup/split-finemath --train_split test --dataset openai/gsm8k --field question --output_dir data/finemath/ --ngram_size 8 --match_threshold 0.5 --decontaminate
Copied!
galopy@pop-os:~/git/local-research-decontaminate$ uv run search.py --train_dataset_names LocalResearchGroup/split-finemath --train_split train --dataset openai/gsm8k --field question --output_dir data/finemath/ --ngram_size 8 --match_threshold 0.5 --decontaminate
Querying localresearchgroup_split-finemath_train_text for openai/gsm8k.
  0%|                                                             | 0/2638 [00:10<?, ?it/s]
Traceback (most recent call last):
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connectionpool.py", line 534, in _make_request
    response = conn.getresponse()
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connection.py", line 565, in getresponse
    httplib_response = super().getresponse()
  File "/home/galopy/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/http/client.py", line 1430, in getresponse
    response.begin()
    ~~~~~~~~~~~~~~^^
  File "/home/galopy/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/http/client.py", line 331, in begin
    version, status, reason = self._read_status()
                              ~~~~~~~~~~~~~~~~~^^
  File "/home/galopy/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/http/client.py", line 292, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
               ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/home/galopy/.local/share/uv/python/cpython-3.13.9-linux-x86_64-gnu/lib/python3.13/socket.py", line 719, in readinto
    return self._sock.recv_into(b)
           ~~~~~~~~~~~~~~~~~~~~^^^
TimeoutError: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elastic_transport/_node/_http_urllib3.py", line 167, in perform_request
    response = self.pool.urlopen(
        method,
    ...<4 lines>...
        **kw,  # type: ignore[arg-type]
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connectionpool.py", line 841, in urlopen
    retries = retries.increment(
        method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2]
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/util/retry.py", line 449, in increment
    raise reraise(type(error), error, _stacktrace)
          ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    response = self._make_request(
        conn,
    ...<10 lines>...
        **response_kw,
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connectionpool.py", line 536, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/urllib3/connectionpool.py", line 367, in _raise_timeout
    raise ReadTimeoutError(
        self, url, f"Read timed out. (read timeout={timeout_value})"
    ) from err
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=9200): Read timed out. (read timeout=9.99971655599984)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/galopy/git/local-research-decontaminate/search.py", line 413, in <module>
    main()
    ~~~~^^
  File "/home/galopy/git/local-research-decontaminate/search.py", line 313, in main
    match_scores, output_data, train_indices_with_scores = ngram_match(es, index_name, query_dataset, fields, args.ngram_size, args.search_size)
                                                           ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/galopy/git/local-research-decontaminate/search.py", line 114, in ngram_match
    search_output = es.search(
        index=index_name,
    ...<13 lines>...
        }
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/utils.py", line 415, in wrapped
    return api(*args, **kwargs)
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/__init__.py", line 4796, in search
    return self.perform_request(  # type: ignore[return-value]
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        "POST",
        ^^^^^^^
    ...<5 lines>...
        path_parts=__path_parts,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/_base.py", line 271, in perform_request
    response = self._perform_request(
        method,
    ...<4 lines>...
        otel_span=otel_span,
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elasticsearch/_sync/client/_base.py", line 315, in _perform_request
    meta, resp_body = self.transport.perform_request(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        method,
        ^^^^^^^
    ...<8 lines>...
        otel_span=otel_span,
        ^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elastic_transport/_transport.py", line 344, in perform_request
    resp = node.perform_request(
        method,
    ...<3 lines>...
        request_timeout=request_timeout,
    )
  File "/home/galopy/git/local-research-decontaminate/.venv/lib/python3.13/site-packages/elastic_transport/_node/_http_urllib3.py", line 202, in perform_request
    raise err from e
elastic_transport.ConnectionTimeout: Connection timed out

The search is timing out - the n-gram matching is taking longer than the 10-second default timeout. This can happen with large indices like finemath.

Let's check a few things. First, verify the index was created successfully:

Copied!
curl -u elastic:$ELASTIC_PASSWORD "http://localhost:9200/_cat/indices?v"

This will show all your indices and their sizes. Do you see the finemath index listed?

Copied!
galopy@pop-os:~/git/local-research-decontaminate$ curl -u elastic:$ELASTIC_PASSWORD http://localhost:9200/_cat/indices?v
health status index                                        uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green  open   localresearchgroup_split-finemath_train_text BxJknffNS-CIJkujMh5B0g   1   0    3578414           22     19.2gb         19.2gb       19.2gb

Good! The index exists with 3.5 million documents and is about 19GB. The timeout is happening because searching through that many documents for 8-grams is slow.

You have a few options:

  1. Increase the timeout - The search script might have a timeout parameter, or you could modify the Elasticsearch client timeout setting
  2. Increase Elasticsearch memory - Your .env file shows it's using -Xmx2g (2GB max heap). With a 19GB index, this might be too small
  3. Test with a smaller eval set first - Try a subset to verify everything works before running the full gsm8k

Want to check if search.py has a timeout parameter you can adjust?

Copied!
uv run search.py --help | grep -i timeout

What do you see?

There is nothing..

OK, so no timeout parameter exposed. You have two main options:

Option 1: Increase Elasticsearch's memory allocation. Edit the .env file and change ES_LOCAL_JAVA_OPTS from -Xmx2g to something like -Xmx8g or -Xmx16g (depending on your machine's RAM), then restart Elasticsearch.

Option 2: Look at the search.py code to see where the Elasticsearch client is initialized and manually increase the timeout there.

Which approach do you want to try first? How much RAM does your machine have available?

where is my .env?

It's in the elastic-start-local directory that was created when you ran the Elasticsearch setup. You saw it earlier when you checked for the password:

Copied!
cat elastic-start-local/.env

To edit it, you can use:

Copied!
nano elastic-start-local/.env

or whatever text editor you prefer. Which option do you want to try - increasing the memory, or looking at modifying the search.py timeout?

what does -Xmx2g mean?

-Xmx2g is a Java Virtual Machine (JVM) argument that sets the maximum heap memory that Elasticsearch can use.

  • -Xmx = maximum heap size
  • 2g = 2 gigabytes

So Elasticsearch is currently limited to using at most 2GB of RAM for its operations. With a 19GB index, this is pretty tight and likely why searches are timing out.

How much RAM does your machine have total?

Cool! It is working. Increasing it to 16GB solved it. But it would take an hour. I increased it to 64GB, and it would take less than 30 mins.

Excellent! That's a huge performance improvement. With 64GB allocated, Elasticsearch can keep much more of the index in memory, which makes those n-gram searches much faster.

While it's running, are you planning to decontaminate other evaluation datasets too, or just gsm8k for now? You mentioned wanting to check against multiple eval sets to follow the Tulu paper's approach (the 2% threshold across all evals).

Copied!
Uploading the dataset shards: 100%|███████████████████| 71/71 [46:29<00:00, 39.29s/ shards]
Creating parquet from Arrow format: 100%|██████████████| 6030/6030 [06:15<00:00, 16.06ba/s]
        Wrote parquet files to data/finemath/LocalResearchGroup_split-finemath_train_decontaminated
        Removed 64 train instances.
        Kept 100.00% of the original data.

Ok I got that to work. Can we try to do it for the avelina python?

Sure! First, you'll need to index the avelina python dataset. Based on what you did with finemath, the command would be:

Copied!
uv run index.py --dataset LocalResearchGroup/split-avelina-python-edu --messages_field text --subset full --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Does that look right, or do you need to check the dataset structure first to confirm the field name is text?

That looks correct to me. But I think previous run's still in my memory and using my resources. How do I fix this?

You can check what's currently running and stop it if needed. Try:

Copied!
docker ps

This will show you the running containers. Do you see the Elasticsearch container listed?

Copied!
 galopy@pop-os:~/git/local-research-decontaminate$ docker ps
CONTAINER ID   IMAGE                                                 COMMAND                  CREATED       STATUS                 PORTS                                NAMES
be576f909e56   docker.elastic.co/elasticsearch/elasticsearch:9.2.0   "/bin/tini -- /usr/l…"   3 hours ago   Up 3 hours (healthy)   127.0.0.1:9200->9200/tcp, 9300/tcp   es-local-dev

Good, Elasticsearch is running. The search process that finished should have freed up resources automatically. You can go ahead and start indexing the avelina python dataset now:

Copied!
uv run index.py --dataset LocalResearchGroup/split-avelina-python-edu --messages_field text --subset full --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Does it start running without issues?

It is done! But something seems off..

Copied!
full/test-00004-of-00005.parquet: 100%|█████████████████| 127M/127M [00:11<00:00, 10.7MB/s]
Generating train split: 100%|██████████| 6910602/6910602 [01:32<00:00, 74369.97 examples/s]
Generating test split: 100%|█████████████| 767845/767845 [00:09<00:00, 77718.93 examples/s]
Loading dataset shards: 100%|█████████████████████████████| 25/25 [00:00<00:00, 625.10it/s]
Reading text from LocalResearchGroup/split-avelina-python-edu
6910602it [03:23, 33883.18it/s]
Read LocalResearchGroup/split-avelina-python-edu for indexing. Has 6910602 instances and 6910602 messages.
6910602
Created a new text index: localresearchgroup_split-avelina-python-edu_train_text
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C

I can't exit!!

The indexing process is stuck! It created the index but seems to be hanging while trying to populate it. Try pressing Ctrl+Z to suspend the process, then kill it:

Copied!
kill %1

Or if that doesn't work, find the process ID and kill it:

Copied!
ps aux | grep index.py

Then use kill -9 <PID> with the process ID you find. Can you try that?

I did the kill %1, but I don't think it worked..

Copied!
galopy@pop-os:~/git/local-research-decontaminate$ ps aux | grep index.py
galopy    100439  0.0  0.0  19028  2300 pts/2    SN+  13:42   0:00 grep --color=auto index.py
[1]+  Exit 143                uv run index.py --dataset LocalResearchGroup/split-avelina-python-edu --messages_field text --subset full --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Good news! It actually did work. The Exit 143 message means the process was terminated successfully (143 = 128 + 15, where 15 is SIGTERM). The grep process you see is just your search command itself, not the indexing process.

The indexing seemed to hang after creating the index. Let's check if the index was created and if it has any data:

Copied!
curl -u elastic:$ELASTIC_PASSWORD "http://localhost:9200/_cat/indices?v"

What do you see?

I see

Copied!
galopy@pop-os:~/git/local-research-decontaminate$ curl -u elastic:$ELASTIC_PASSWORD "http://localhost:9200/_cat/indices?v"
health status index                                                  uuid                   pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green  open   localresearchgroup_split-avelina-python-edu_train_text ptW6i5OgTrCxc8Unyhyn1A   1   0    3672872           30      5.9gb          5.9gb        5.9gb
green  open   localresearchgroup_split-finemath_train_text           BxJknffNS-CIJkujMh5B0g   1   0    3578414            1     15.9gb         15.9gb       15.9gb

Great! The avelina-python index was actually created successfully with 3.67 million documents (5.9GB). It must have completed but hung at the end when trying to exit cleanly.

Now you can run the search and decontamination:

Copied!
uv run search.py --train_dataset_names LocalResearchGroup/split-avelina-python-edu --train_split train --dataset openai/gsm8k --field question --output_dir data/avelina-python/ --ngram_size 8 --match_threshold 0.5 --decontaminate

Does it run successfully?

yep let's search now

Great! Now run the search and decontamination command for the avelina python dataset:

Copied!
uv run search.py --train_dataset_names LocalResearchGroup/split-avelina-python-edu --train_split train --dataset openai/openai_humaneval --field prompt --output_dir data/avelina-python/ --ngram_size 8 --match_threshold 0.5 --decontaminate

Let me know how it goes!

Copied!
Creating parquet from Arrow format: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 6911/6911 [03:00<00:00, 38.24ba/s]
        Wrote parquet files to data/avelina-python/LocalResearchGroup_split-avelina-python-edu_train_decontaminated
        Removed 20 train instances.
        Kept 100.00% of the original data.

Good. Now we have to do those steps with test split. Can you write the code for them? I need to index then search on those datasets

Sure! Here are the commands for the test splits:

Index finemath test split:

Copied!
uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --split test --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Index avelina-python-edu test split:

Copied!
uv run index.py --dataset LocalResearchGroup/split-avelina-python-edu --messages_field text --subset full --split test --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Search finemath test against gsm8k:

Copied!
uv run search.py --train_dataset_names LocalResearchGroup/split-finemath --train_split test --dataset openai/gsm8k --field question --output_dir data/finemath/ --ngram_size 8 --match_threshold 0.5 --decontaminate

Search avelina-python-edu test against humaneval:

Copied!
uv run search.py --train_dataset_names LocalResearchGroup/split-avelina-python-edu --train_split test --dataset openai/openai_humaneval --field prompt --output_dir data/avelina-python/ --ngram_size 8 --match_threshold 0.5 --decontaminate

Run these in order. Let me know when you're done!

Copied!
README.md: 100%|██████████████████████████████████████████| 894/894 [00:00<00:00, 10.8MB/s]
Creating parquet from Arrow format: 100%|████████████████| 670/670 [00:38<00:00, 17.25ba/s]
        Wrote parquet files to data/finemath/LocalResearchGroup_split-finemath_test_decontaminated
        Removed 12 train instances.
        Kept 100.00% of the original data.
[1]+  Exit 143                uv run index.py --dataset LocalResearchGroup/split-finemath --messages_field text --subset full --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16 --split test
Copied!
README.md: 100%|██████████████████████████████████████████| 521/521 [00:00<00:00, 6.62MB/s]
Creating parquet from Arrow format: 100%|████████████████| 768/768 [00:20<00:00, 37.94ba/s]
        Wrote parquet files to data/avelina-python/LocalResearchGroup_split-avelina-python-edu_test_decontaminated
        Removed 7 train instances.
        Kept 100.00% of the original data.
[1]+  Exit 143                uv run index.py --dataset LocalResearchGroup/split-avelina-python-edu --messages_field text --subset full --split test --index_type text --parallel --chunk_size 6400 --max_chunk_bytes 10000000 --queue_size 8 --thread_count 16

Training

Modal

Use this command to train peft on modal:

Copied!
MODAL_GPU=L4 TRAIN_YAML=scripts/train/yamls/pretrain/custom_smollm2-135m_peft.yaml IS_PEFT=true OUTPUT_PRECISION=bf16 modal run --detach custom_model_training.py

Custom model

Why build custom model?

Does Hugging Face support unpadding of sequences for training the model? Do a web search.

Yes, Hugging Face does support unpadding of sequences for training! Here's what I found:

* The current implementation does an unpad operation just before the FlashAttention kernel is executed, converting the input to a concatenated 2D tensor. However, this approach has limitations.

* The current integration of FlashAttention in most libraries including HuggingFace transformers is suboptimal when using sequences of variable lengths in a batch (when there's padding).

More recently, there are better approaches:

  1. ModernBERT (added December 2024): * ModernBERT brings improvements including unpadding to avoid wasting compute on padding tokens

  2. DataCollatorWithFlattening: * This data collator concatenates the entire mini batch into a single long sequence and uses a separator to distinguish sequences, with no padding added

The key parameters for unpadding in ModernBERT include indices, cu_seqlens, and max_seqlen which help handle unpadded sequences efficiently.

Would you like more details on how to implement unpadding in your training setup, or are you interested in a specific approach for your LoRA research?

Here is vishal's blog on using flash attention 2. Can summarize this blog regarding the unpadding and speed of the training?

`read_url({"url": "https://vishalbakshi.github.io/blog/posts/2025-06-30-debug-fa2-llm-foundry/"})`

  • &#x27;[ Vishal Bakshi’s Blog ](../../index.html)\n\n * [ Home](../../index.html)\n * [ About](../../about.html)\n\n * [ __ ](../../index.xml)\n\n## On this page\n\n * Background\n * What is the value of the attention_mask?\n * When is flash_attn_varlen_func called?\n * How do we check the value of attention_mask during training?\n * How do we create an attention_mask with 0.0s?\n * How do we know if flash_attn_varlen_func is being used?\n * Wait, flash_attn_varlen_func slows down training?\n * Closing Thoughts\n\n# Debugging Flash Attention in LLM-Foundry (and a 20% Slow Down!)\n\npython\n\ndeep learning\n\nLLM-Foundry\n\nflash_attn_varlen_func in LLM Foundry resulted in a surprising 20% training slowdown. This post details the debugging process that uncovered the cause : significant overhead from the HuggingFace implementation repeatedly un-padding and re-padding the batch at every layer.\n\nAuthor\n\nVishal Bakshi\n\nPublished\n\nJune 30, 2025\n\n## Background\n\nI’m learning a lot about LLM-Foundry while working on a group research project. In this blog post I’ll walk through how we figured out two things:\n\n 1. LLM-Foundry, by default when using a HuggingFace LlamaModel, does not use flash_attn_varlen_func and instead uses flash_attn_func. In other words, it doesn’t unpad the batch by default.\n 2. When forcing LLM-Foundry to use flash_attn_varlen_func, it slows down training time.\n\nI’ll start by walking through the forward pass of the HuggingFace LlamaModel down to the attention mechanism which uses the _flash_attention_forward utility function which decides which Flash Attention interface is being used.\n\n## What is the value of the attention_mask?\n\nThe model we’re using is SmolLM2-135M which uses the now-deprecated [LlamaFlashAttention2 module](https://github.com/huggingface/transformers/blob/d363e71d0e32f44d7a5b3571d4921371907bd0ee/src/transformers/models/llama/modeling_llama.py#L324).\n\nInspecting the [LlamaModel forward pass](https://gi…

Based on Vishal's blog, here are the key findings about unpadding and training speed:

Unpadding: By default, LLM-Foundry does not use flash_attn_varlen_func (the unpadding version) and instead uses flash_attn_func. To enable unpadding, they modified the data collator to create an attention mask with 0.0s for padding tokens.

Speed Impact - Surprising Result: When using flash_attn_varlen_func, training time slowed down by over 20%. This was unexpected since unpadding should theoretically be faster by avoiding wasted computation on padding tokens.

The Cause: The HuggingFace implementation unpads and re-pads the batch at every layer. This overhead from repeated unpadding/repadding operations more than cancels out the efficiency gains from avoiding attention computation on padding tokens.

Solution: The team plans to implement a custom model that unpads the batch only once at the beginning, rather than at every layer.

Does this match what you were expecting to find? Are you planning to implement a custom model with single unpadding for your LoRA research?

We created a SmolLM2 model from scratch. Why create a custom model from scratch instead of using Hugging Face's? To train faster! Batches are padded with padding tokens, and it is a waste of compute to include them in our batches for the model. According to vishal's blog, forcing llm-foundry to use flash_attn_varlen_func slowed down the training by 20% because the model unpads and pads sequences each attention layer.

Therefore, we created our custom SmolLM2-135M model, which is similar to Llama2-70B model. It uses RoPE, RMSNorm, LlamaMLP with gated SiLU, Attention mechanism with Group Query Attention. We modified the model so it only unpads in the beginning of the model.

Another possible improvement is through using more efficient implementations of RoPE, RMSNorm, Cross Entropy Loss, etc. with Liger Kernel.

Unpadding

How unpadding works

Our datasets have sequences without padding tokens. However, pytroch dataloader pads these to be the same length when creating each batch. Normally, the length is the longest sequence of the batch. Each sequence has variable lengths in each dataset. Therefore, every batch has different sizes with variable padding tokens. If there is one long sequence with very small sequences, they are all padded to match the long sequence.

Therefore, instead of using the batches pytorch created, we create a special batch. From the batch pytorch dataloader created, we unpad all the padding tokens and put the sequences together into a long sequence. To process this, we use flash attention varlen function. We also keep track of where each sequence starts and ends, and the longest length of the sequences.

Without padding tokens, we can process sequences faster.

Results

I created a PR for unpadding sequences. In the images, red is using unpadding with FA2, green is normal sdpa. Grey one is using unpadding + FA2 + micro batch size of 8 while others are 4. Because we remove padding tokens, we can use bigger micro batch sizes. I think we are getting 30k tokens per sec from 20k tokens per sec with sdpa. And about 50k with increasing micro batch size to 8. One thing to note is that the actual processing speed did not increase. We are still counting padding tokens as tokens when we calculate throughput/tokens per sec.

I think I can start working on Modernbert SequencePacker next.

pasted_image_38386c4a-166d-4d10-90df-6c3651259196.png

pasted_image_1b3077ad-6a71-40ca-a60c-57dcbffd2baf.png

W&B logging

Can you look at the plots?

Looking at your plots, here are some key observations:

Throughput plots (image 2):

  • The red line (unpadding + FA2) shows ~30k tokens/sec, which is a nice improvement over green (sdpa) at ~20k tokens/sec
  • The grey line (unpadding + FA2 + micro batch 8) reaches ~50k tokens/sec - a substantial gain
  • All three show relatively stable throughput over training steps with some fluctuation

Time plots (image 1):

  • Training time progresses linearly for all three approaches (top row)
  • The time/token and time/token_in_epoch plots show similar patterns, with the grey line (larger micro batch) having lower time per token
  • The spiky plots (time/sample_in_epoch, time/token_in_epoch) show the variation within epochs

You mentioned an important caveat: you're still counting padding tokens in the throughput calculation. This means the red/grey lines' advantage might be even larger than shown, since they're processing fewer actual tokens but the metric includes the padded ones.

Are you planning to adjust the throughput metric to exclude padding tokens so you can see the true speedup? Or would you like help interpreting any specific aspect of these results?

Here is the plot for the memory:

pasted_image_4dff1847-6119-4161-b1af-140efac6216d.png

Isn't it interesting?

Yes, very interesting! The memory plots show some notable patterns:

Peak Reserved Memory: The green line (SDPA) shows higher peak reserved memory (~17-18 GB) compared to the red/pink lines with unpadding (~15 GB). This makes sense since SDPA is processing padded sequences with wasted memory on padding tokens.

Inactive Memory: The green line shows a big jump early on to ~8 GB inactive memory and stays there, while the red lines settle around ~6 GB. This suggests SDPA is allocating more memory that's not actively being used.

Current Reserved/Active Memory: These look fairly similar across all runs (~15-17 GB reserved, ~15 GB active), which suggests the models are using memory efficiently once allocated.

The oscillations: The memory/current_inactive_mem plot shows those spiky oscillations for all runs - this is likely PyTorch's memory allocator doing its job, freeing and reallocating memory as batches are processed.

What's particularly interesting is that despite the grey line (BS=8) processing twice as many samples, it doesn't show dramatically higher memory usage. This confirms that removing padding tokens freed up enough memory to accommodate the larger batch size!

Does this memory behavior match what you expected, or did anything surprise you?

torch.compile

Make GPUs go brrrrrrrrrrrrr

torch.compile has many different parameters. Benjamin suggested using dynamic, no mode, epilogue_fusion, max_autotune, triton_cudagraphs, coordinate_descent_tuning. Here is a W&B logs for all the different combinations of options:

pasted_image_9477ec78-0bba-42e8-8854-bd443e0f132b.png

Initial take:

  • The best one: dynamic + epilogue_fusion + max_autotune. Adding coordinate_descent_tuning took more time to compile, but there seems to be no significant speedup.
  • dynamic + max-autotune mode starts very slowly.
  • Adding cudagraph to dynamic + epilogue_fusion + max_autotune also with slow start.

Benjamin's interpretation:

  • cudagrph doesn't play well with dynamic. That's why the throughput is low.
  • In this model, we are using unpadding, and the input sequence shapes change. That's why dynamic is good. cudagraph has to recompute every single sequence.
  • maxautotune autotunes three matmul hyperparameters: m, n, and k where m and n are block sizes and k is the sub-block size.
    • It might be more important on a bigger GPU.
  • maxautotune gets cached, so modal might not be able to gain performance gain.

Further investigation:

  • Upon further experimentation, I restarted my computer and used dynamic + epilogue_fusion + max_autotune, but performance was very good even restarting my local machine. I should test this on modal.

How is my analysis on torch.compile? Anything to add?

Your analysis looks solid! Here are a few additional points to consider:

On your observations:

  • Your finding about cudagraph + dynamic makes sense - cudagraphs capture the exact computation graph, so with changing sequence shapes from unpadding, it's constantly invalidating and recapturing the graph. This overhead negates the benefits.
  • The slow start with max_autotune is expected - it's searching through different kernel configurations during those initial steps.

Additional considerations:

  • Since max_autotune caching matters, you'll want to ensure your Modal runs have persistent cache storage between runs. Otherwise, you're paying the tuning cost every time.
  • For the coordinate_descent_tuning experiment: did you run it long enough for the tuning to complete? It might show benefits after more steps once it's fully tuned.
  • Consider tracking compilation time separately from throughput - this would make the tradeoffs clearer (e.g., "10% speedup but 2x longer compile time").

Questions for further investigation:

  • What's your typical sequence length distribution? If it's highly variable, dynamic is crucial. If it's more uniform, you might get away without it.
  • Have you profiled where most compute time is spent? This would tell you if max_autotune's matmul optimizations are hitting the right kernels.

What do you think about testing the caching behavior on Modal next?

If I use dynamic, do I need to add max_autotune in the options?

No, you don't need to use max_autotune with dynamic - they're independent options that serve different purposes:

  • dynamic=True handles varying input shapes (important for your unpadding use case)
  • max_autotune=True searches for optimal kernel implementations for operations like matmul

However, your experiments showed that combining them (dynamic + epilogue_fusion + max_autotune) gave the best throughput. So while not required, they complement each other well.

The question is: does the extra compilation time from max_autotune justify the speedup for your specific training runs? If you're doing long training runs where the upfront cost amortizes, it's worth it. For short experiments, maybe not.

Have you measured how much faster dynamic + max_autotune is compared to just dynamic alone?

The compilation time takes a couple minutes. However, it does not really matter when we are doing a very long run.

This error is happening during the compilation phase when torch.compile is trying to apply epilogue fusion patterns. The issue is that it's encountering a float where it expects a tensor with metadata.

A few questions to help narrow this down:

  1. What PyTorch version are you using on Modal vs your local machine (which worked)?
  2. Did your local runs that worked use the same exact torch.compile options (dynamic=True, epilogue_fusion=True, max_autotune=True)?

In the meantime, here are some things to try:

Quick fix - remove epilogue_fusion: Try just dynamic=True, max_autotune=True without epilogue_fusion. This might bypass the bug while still giving you most of the speedup.

Alternative - suppress the error: As the error suggests, you could set torch._dynamo.config.suppress_errors = True to fall back to eager mode for the problematic parts, but this defeats the purpose of compilation.

Which would you like to try first?

torch.compile saves a bit of memory (about 5% on the previous settings) compared to not compiling.

Sequence Packing

Why sequence packing

Unpadding allowed us to save compute for actual tokens by not including padding tokens. There is a problem though. Let's say we were training with micro batch size of 8. It was training well, but suddenly gets out of memory error in the middle. What happened? We've only had shorter sequences packed together, but there were some rare long sequences put together as a batch. Do we have to set the batch size to the lowest possible just in case we encounter this in the middle of training? NO! To prevent this from happening, we use modernbert sequence packing!

Benefit of using sequence packing is that we set the size of batches beforehand and pack sequences into it.

tokens

  • It is possible to train with number of tokens using tok like 100tok. It will train until tokens trained is more than 100tok.
  • The max_seq_len is used for calculating the length of sequence for packing.
Copied!
out_batch_size=int(batch_size / micro_batch_size)
out_pseq_len=int(micro_batch_size * max_seq_len)

How tokens are calculated: llmfoundry/data/utils.py, get_tokens_per_batch_func

  • calculates number of tokens by summing up the attention_mask.
  • if no attention_mask, it uses batch['input_ids'].numel()

How Sequence Packing works with microbatch_size and batch_size ?

Example 1:

  • device_train_microbatch_size: 2

  • global_train_batch_size: 16

  • max_seq_len: 1024

  • out_batch_size=int(batch_size / micro_batch_size) = 8

  • out_pseq_len=int(micro_batch_size * max_seq_len) = 2 * 1024 = 2048

  • Sequence packer: [8, 2048] → split into 8 microbatches of [1, 2048]

  • Composer microbatching: 8 microbatches ÷ 2 (device_train_microbatch_size) = 4 batches of [2, 2048]

  • Gradient accumulation: 4 times

    • 4 batches × 2 (device_train_microbatch_size) = 8 effective batch size

Example 2:

  • device_train_microbatch_size: 2

  • global_train_batch_size: 16

  • max_seq_len: 2048

  • out_batch_size=int(batch_size / micro_batch_size) = 16 / 2 = 8

  • out_pseq_len=int(micro_batch_size * max_seq_len) = 2 * 2048 = 4096

  • Sequence packer: [8, 4096] → split into 8 microbatches of [1, 4096]

  • Composer microbatching: 8 microbatches ÷ 2 (device_train_microbatch_size) = 4 batch of [2, 4096]

  • Gradient accumulation: 4.

With sequence packing, I got OOM when I used 4 device_microbatch_size, 16 global_train_batch_size, 1024 max_seq_len. Why would it be the case?

Looking at your sequence packing setup, let's trace through what's happening:

With your parameters:

  • device_train_microbatch_size: 4
  • global_train_batch_size: 16
  • max_seq_len: 1024

This gives you:

  • out_batch_size = 16 / 4 = 4
  • out_pseq_len = 4 * 1024 = 4096

So the sequence packer creates [4, 4096], which gets split into 4 microbatches of [1, 4096]. Then Composer processes these as 4 ÷ 4 = 1 batch of [4, 4096].

The problem: You're trying to fit a [4, 4096] tensor on the GPU at once. That's 16,384 tokens per forward pass, which is significantly larger than your working Example 1 which only had [2, 2048] = 4,096 tokens per forward pass.

The memory usage scales with both batch size AND sequence length. Doubling the microbatch size from 2 to 4 quadrupled your memory requirements compared to Example 1.

Does this match what you observed? What microbatch size worked for you without OOM?

Initial run

Here is my first attempt at using sequence packing from ModernBert. W&B Logging

pasted_image_33424ae1-2913-4810-a2cf-7ddac0e80554.png

It was so jagged! I didn't expect it to be like this. I thought it was due to some sequences packed better than others. Well packed sequences take longer time and less tokens per second.

  • Benjamin agreed that's what was going on. He suggested using PackingEfficiency callback from ModernBert. He said he used very big batches for ModernBert training and packing efficiency was 99.9%.

I should try with bigger batches. I only used batch size of 8.

Run with improvements

We can improve by compiling.

pasted_image_bc24dded-d721-42f2-8b25-7537b640dd7f.png

pasted_image_a92b972c-91e1-4d61-b31e-f4ad52f57fcc.png

How do these plots look like compared to what we had before? Packing efficiency is over 0.99 and throughput is around 80k.

Wow, that's a massive improvement! Let me compare:

Before (unpadding + FA2):

  • Best throughput: ~50k tokens/sec (with micro batch size 8)
  • Lots of variability in the plots

Now (sequence packing + compile):

  • Throughput: ~80k tokens/sec (60% improvement!)
  • Packing efficiency: >0.99 (almost no wasted compute)
  • Much smoother throughput - those dramatic drops are still there but the overall pattern is more stable

The high packing efficiency confirms what Benjamin said - with proper batching, you're utilizing nearly all available compute. The occasional dips in throughput likely correspond to batches that are packed less efficiently, but it's much better than your initial jagged attempt.

One interesting thing: the throughput dips seem periodic. Do they correlate with specific points in your dataset, like when the dataloader moves to a new subset or epoch boundary?

What do you think is causing those remaining periodic drops?

I was using max_seq_len of 1,024, 4 microbatch_size, 40 batch size.

Those are good settings! With batch size 40 and microbatch size 4, you're getting:

  • out_batch_size: 10
  • out_pseq_len: 4,096 tokens

That's a much larger effective batch (10 × 4,096 = 40,960 tokens) compared to your initial attempt with batch size 8. This explains why your packing efficiency jumped to >0.99 - more tokens means better opportunity to pack sequences efficiently.

Are you planning to test even larger batch sizes, or is this hitting your memory limit? Also, have you figured out what's causing those periodic throughput drops?

I was using around 22GB of memory

That's pretty good memory usage! You still have some headroom before hitting typical GPU limits (24GB for consumer cards, 40-80GB for data center GPUs depending on what you're using).

Have you tried pushing the batch size higher to see if you can squeeze out more throughput while staying under your memory limit? Or are you happy with 80k tokens/sec and the 0.99 packing efficiency?

Liger kernel

I worked on liger kernel to see whether I can improve it or not.

Here is the weights and bias logs: https://wandb.ai/local-research-group/liger-kernel?nw=nwusergalopyz.

Here is the dialog I learned liger: https://share.solve.it.com/d/3a8b84b6e3a2c44f7179e759284afa6d. Can you summarize this dialog for the details for me? What I learned and what I did?

Let me read that dialog for you.

read_url(url='<…>…are.solve.it.com/d/3a8b84b6e3a…')
Copied!
{
  "id": "toolu_01UkVDC9u1egawV8jB6X84re",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://share.solve.it.com/d/3a8b84b6e3a2c44f7179e759284afa6d"
    }
  },
  "result": "<TRUNCATED>\u2026eractive version of this dialog: Sign up at [solve.it.com](https://solve.it.com), click Upload, and pass this URL. \n\nPrompt: 78\n\nOpen main menu\n\n[](/show_card_?id_=_aee6bc3d&with_input=1&dlg_name=liger&name=liger)\n\nThis is a website for liger kernel: <https://github.com/linkedin/Liger-Kernel.> Can I use liger kernel with cpu?\n\nAssistant: 435\n\nOpen main menu\n\n[](/show_card_?id_=_aee6bc3d&with_input=&dlg_name=liger&name=liger)\n\nCopied!\n```\n{\n  \"id\": \"toolu_01SbrJYbJwrjPRVHbDURUvXZ\",\n  \"call\": {\n    \"function\": \"read_url\",\n    \"arguments\": {\n      \"url\": \"https://github.com/linkedin/Liger-Kernel\"\n    }\n  },\n  \"result\": \"'Skip to content\\\\n\\\\n## Navigation Menu\\\\n\\\\nToggle navigation\\\\n\\\\n[ ](/)\\\\n\\\\n[ Sign in ](/login?return_t<TRUNCATED>\"\n}\n\n```\n\nBased on the GitHub repository information, Liger Kernel is a collection of Triton kernels designed specifically for LLM training, and Triton is the underlying technology.\n\nLooking at the dependencies section, I can see that L\u2026</TRUNCATED>"
}

Here's a summary of what you learned and did in your Liger Kernel dialog:

What you learned about Liger Kernel (paper reading):

  • Core concept: Liger Kernel is a library of Triton kernels that optimize LLM training through operation fusion and input chunking — keeping intermediate data in fast on-chip SRAM instead of slow HBM, reducing memory round-trips
  • GPU memory hierarchy: HBM is large but slow (~2 TB/s), SRAM is tiny (~50MB) but fast — the mismatch is the key bottleneck Liger addresses
  • Specific kernels: RMSNorm, LayerNorm, RoPE, SwiGLU/GeGLU, CrossEntropy, and FusedLinearCrossEntropy (FLCE) — each fusing multiple operations into single kernels
  • Online softmax: Computes softmax in a single pass instead of two, halving memory reads
  • FLCE: Chunks hidden states to avoid materializing enormous logit tensors (e.g., 16.8 GB for Gemma's 256k vocab)
  • Results: ~20% throughput increase, ~60% memory reduction on average; individual kernels show even bigger gains (RMSNorm: 7x faster)
  • Technology landscape: You explored differences between Python, PyTorch, Triton, CUDA, C++, plus alternatives like tinygrad and Mojo

What you built/did (code):

  • Created a %%gpu IPython cell magic that runs code on Modal with GPU access
  • Used Modal's Cls pattern with @modal.enter() for persistent state across cells
  • Added matplotlib image capture so plots render locally from remote GPU execution
  • Benchmarked Liger RMSNorm vs HuggingFace RMSNorm — confirmed ~6-10x speedup, with the gap widening at larger sequence lengths
  • Noted the benchmark isn't entirely "fair" since it compares Triton kernels vs PyTorch eager mode — a fairer test would include torch.compile()

Thanks. Can you read the weights and bias logs https://wandb.ai/local-research-group/liger-kernel?nw=nwusergalopyz? And see what I have found?

read_url(url='<…>…ndb.ai/local-research-group/li…')
Copied!
{
  "id": "toolu_01AWGnKwLigDUNr6vrPftBbG",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://wandb.ai/local-research-group/liger-kernel?nw=nwusergalopyz"
    }
  },
  "result": "No output"
}

Unfortunately, the W&B page didn't return any readable content — it's likely behind authentication or rendered dynamically with JavaScript.

Could you share what you found? Maybe paste a screenshot of the key plots, or tell me which Liger kernels you tested and what the throughput/memory results looked like?

image.png

I can see you ran a lot of experiments! From the runs list, it looks like you systematically tested different combinations of Liger kernels:

  • liger-all: All Liger kernels enabled
  • liger-none: No Liger kernels (baseline)
  • liger-mlp: Only Liger MLP (SwiGLU)
  • liger-rms: Only Liger RMSNorm
  • liger-rms_mlp: RMSNorm + MLP combined
  • liger-rms_fusedcross: RMSNorm + FusedLinearCrossEntropy
  • llama-rms: Using Llama's RMSNorm instead
  • Various 8192 sequence length runs

All with torch.compile(dynamic=True).

The image is a bit small to read the exact numbers clearly. From what I can make out, the throughput differences between runs seem relatively modest, and memory usage looks similar across most configurations.

Could you share the key takeaways? For instance:

  • Did any specific Liger kernel combination give a clear throughput or memory win?
  • Did the 8192 sequence length runs behave differently?

All of the liger kernel implementations improved tokens per sec except for FusedLinearCrossEntropy. Using liger-mlp and liger-rms improved the tps. I could not use Liger RoPE because we are using sequence packing. So just using ModernBert RoPE impelmentation. Using FusedLinearCrossEntropy actually slowed down quite a bit, but the GPU RAM usage went down from 12GB to 8GB, but throughput went down from 27k to 23k. I thought it would've been okay to just use all liger and still get the speedup, but it was slower. I think it was because we are using such a small model.

That's a great finding! Your analysis makes sense:

  • Liger RMSNorm + MLP: Pure wins — faster kernels with no tradeoff
  • FusedLinearCrossEntropy: Classic speed vs memory tradeoff — saves 4GB RAM but costs ~15% throughput

Your intuition about the small model size is likely correct. FLCE's main benefit is avoiding materializing the full logit tensor, which matters most with large vocabularies and large hidden dimensions. With SmolLM2-135M, the logit tensor is small enough that the overhead of chunking and fusing outweighs the memory savings.

That 4GB memory saving from FLCE could still be useful later though — if you ever need to increase batch size or move to a larger model, it might become worth the throughput cost. Did you find that with the 8192 sequence length runs, the results changed at all?

not really.. Small models are not really worth using liger kernel I guess.

Yeah, that's a reasonable conclusion. The fusion and chunking optimizations in Liger are designed to address bottlenecks (memory bandwidth, large logit tensors) that become more pronounced with larger models. At 135M parameters, the overhead of the fused kernels can outweigh the benefits.

The good news is you now know exactly which kernels help (RMSNorm, MLP) and which don't (FLCE, RoPE incompatible with packing) for your setup. When your group scales up to larger models, you'll have a clear picture of what to enable.

What are you working on next?

Hyperparameter sweep

Purpose

The goal of is to perform hyperparameter sweep with SmolLM2-135M model on the following datasets:

Dataset Domain Number of sequences (train/test)
avelinapythonedu Code CPT 7.68M (6.91M/0.768M)
finemath Math CPT 6.7M (6.03M/0.67M)
glaive Code IFT 950k (855k/95k)
numina Math IFT 859k (774k/86k)
tulu Math IFT 796k (716k/80k)
repo dataset size split total_tokens num_sequences avg_length median_length std_length min_length max_length seqs_over_max pct_over_max p25 p50 p75 p95 p99
split-avelina-python-edu avelinapythonedu full train 3,452,655,475 6,910,602 499.62 273 750.00 6 8,192 0 0 143 273 541 1,664 3,817
split-avelina-python-edu avelinapythonedu full test 382,424,047 767,845 498.05 273 744.83 14 8,192 0 0 143 273 540 1,655 3,789
split-finemath finemath full train 8,745,300,721 6,029,543 1,450.41 919 1,575.68 34 8,192 0 0 562 919 1,652 4,855 8,192
split-finemath finemath full test 970,519,958 669,950 1,448.65 919 1,573.62 38 8,192 0 0 562 919 1,653 4,840 8,192

numina about 450 tokens per sequence. However, 32176880 / 3000 (ba) / 8 (bs) = 1,300 per sequence with padding when training.

We aim to train with full finetuning, LoRA, rsLoRA, and DoRA with different learning rates, LoRA variant ranks, and batch size.

We want to measure learning vs. forgetting on those techniques using the following benchmarks:

Purpose General / Reasoning tasks Math Code
Development set MMLU, MMLU-Pro, HellaSwag, TruthfulQA GSM8K HumanEval
Unseen Set MATH, Deepmind Mathematics BigBench (Python), MBPP

Targets to keep in mind

  • Source domain (knowledge we’re trying to retain after finetuning)
    • HellaSwag: 42.1
    • WinoGrande: 51.3
    • ARC (average): 43.9
  • Target domain (new knowledge we’re trying to learn with finetuning)
    • GSM8k (math): 1.4
    • HumanEval (code): need to compute for SmolLM2-135M

Context

import pymupdf4llm
lll_md = pymupdf4llm.to_markdown("../paper/lora_learns_less.pdf")
print(lll_md[:300]) 
Published in Transactions on Machine Learning Research (08/2024)
## **LoRA Learns Less and Forgets Less**

**Dan Biderman** [1] _[,]_ [2] **, Jacob Portes** [2] **, Jose Javier Gonzalez Ortiz** [2] **, Mansheej Paul** [2] **, Philip**
**Greengard** [1] **, Connor Jennings** [2] **, Daniel King** [2]

I want to perform a hyperparameter sweeps. Should I start with learning rate, lora variant ranks, or batch size?

Based on the research context you've provided, I'd recommend starting with learning rate sweeps first. Here's why:

Learning rate should come first because:

  • It has the most direct impact on whether training works at all
  • The optimal LR for LoRA is consistently ~10x higher than FullFT (as shown in both papers)
  • You need to find the optimal LR before you can properly evaluate other hyperparameters
  • The "LoRA Without Regret" blog shows that optimal LR is relatively independent of rank (especially for ranks ≥4)

Suggested order:

  1. Learning rate - sweep this first for a baseline LoRA configuration (e.g., rank=128, batch_size=40)
  2. Rank - once you have a good LR, sweep rank to find the minimum that doesn't hurt performance
  3. Batch size - test last, since the blog showed LoRA can be less tolerant of large batch sizes in some settings

One practical tip from the blog: the optimal LR has a fairly wide peak for LoRA, so you don't need an extremely fine-grained sweep.

What model size and dataset are you planning to start with? This might affect the specific LR range you should explore.

I want to perform a hyperparameter sweeps. Should I start with learning rate, lora variant ranks, or batch size?

Based on the research you've provided, I'd recommend starting with learning rate sweeps first. Here's why:

Learning rate is the most critical hyperparameter - it has the biggest impact on whether your training succeeds or fails. The research shows that LoRA needs roughly 10x higher learning rates than full fine-tuning, and finding the optimal LR is essential before exploring other parameters.

Then rank, then batch size - Here's the logical order:

  1. Learning rate first - Sweep this for a baseline LoRA configuration (e.g., rank=128, moderate batch size)
  2. Rank second - Once you know roughly what LR works, you can explore different ranks. The good news: optimal LR is approximately independent of rank due to the 1/r scaling
  3. Batch size last - The research shows LoRA can be less tolerant of large batch sizes, but both LoRA and FullFT achieve best loss at smaller batches anyway

Practical suggestion: For your SmolLM2-135M experiments, you could start with:

  • Rank: 128 or 256 (standard middle ground)
  • Batch size: 32-64 (moderate, proven to work well)
  • Learning rate: Sweep from ~1e-5 to ~1e-3

Does this approach make sense for your setup? Would you like help designing the specific sweep ranges?

What do I do with lr scheduler?

Good question! The research you've shared used constant learning rate schedules (no warmup or cooldown) for their experiments to eliminate confounds when comparing LoRA to FullFT.

However, for your actual training runs (not just hyperparameter sweeps), you have a few options:

For sweeps:

  • Use constant LR (like the research) - this makes it easier to compare different configurations cleanly

For full training:

  • Cosine decay is popular and often works well for fine-tuning
  • Linear warmup + cosine decay is a common choice
  • Constant can work fine, especially for shorter runs

The "LoRA Without Regret" blog mentioned that optimal LR might be slightly higher (around 15x vs 10x over FullFT) for very short runs due to how B matrix grows from zero initialization.

What's your training setup - are you doing short runs for sweeps first, or jumping straight into longer training? This might affect which scheduler makes sense to start with.

Learning Rate

I want to start learning rate sweep with numina dataset. But let me try with the full finetuning, first.

Good plan! Starting with full fine-tuning will give you a baseline to compare against when you move to LoRA.

For the numina dataset (Math IFT with 859k sequences), what configuration are you thinking for this initial full fine-tuning run?

A few things to consider:

  • Learning rate range - For full fine-tuning, you might try something like 1e-6 to 1e-4
  • Number of epochs - Will you do 1 epoch or multiple?
  • Batch size - What fits in your memory?

What's your starting point?

My plan is to perform on 10% of the training data. I can fit 4 micro batch size in my machine, but batch size can be bigger with gradient accumulation. What did $lll_md use for their experiment regarding those hyperparameters?

Looking at the paper, for their Math IFT (MetaMathQA) experiments, they used:

  • Sequence length: 1024
  • Global batch size: 768
  • Micro batch size per device: 24
  • Training duration: Multiple epochs (1, 2, 4, 8, 16)

For full fine-tuning on Math IFT specifically:

  • Learning rate: 1e-5
  • Optimizer: LionW (not AdamW)

Since you can fit micro batch size of 4, and they used global batch size of 768 on 32 GPUs, you'd need gradient accumulation steps of 192 to match their effective batch size (if on 1 GPU).

However, for your 10% subset experiment, you might want to start smaller. What are you thinking - try to match their batch size, or start with something more modest?

Okay, I will start from batch size of 512 with 4 micro batch, so that's 128 gradient accumulation steps?

Yes, exactly! With:

  • Global batch size: 512
  • Micro batch size: 4
  • You get: 512 ÷ 4 = 128 gradient accumulation steps

That's a reasonable starting point - large enough to be stable, but not as extreme as their 768.

For your learning rate sweep on full fine-tuning with the numina dataset (10% subset), what range are you planning to explore?

How many steps should I do with these settings?

Good question! Let's figure this out together.

You have:

  • 10% of numina training data = ~77,400 sequences (10% of 774k)
  • Global batch size = 512

So each step processes 512 sequences, which means:

  • 1 epoch = 77,400 ÷ 512 ≈ 151 steps

The paper trained their Math IFT models for multiple epochs (1, 2, 4, 8, 16). For an initial learning rate sweep, you probably want enough steps to see the learning curve but not waste compute.

How many epochs are you thinking? One epoch (151 steps) might be enough to identify good vs bad learning rates. Or would you prefer 2 epochs (~302 steps) to see more of the learning dynamics?

I am trying 4 microbatch size and 8 total train batch size. Per epoch, weights and bias is logging 888 samples in each epoch. Does this seem right?

Let me check the math with you:

  • Micro batch size: 4
  • Total train batch size: 8
  • This means gradient accumulation steps = 8 ÷ 4 = 2 steps

If W&B is logging 888 samples per epoch, that means:

  • 888 ÷ 8 (batch size) = 111 steps per epoch

This suggests your 10% subset has about 888 sequences total (111 steps × 8 batch size).

Does that match what you expected for 10% of the numina training data? The full training set should be ~774k sequences, so 10% would be ~77k.

Are you perhaps using a much smaller subset, or is there something else going on with how the data is being loaded?

Here is a weights and bias logging on finemath dataset: W&B logging

pasted_image_a6bd5f5c-35e4-481f-a662-383c0695cc41.png

I used 81 million tokens (1% of the dataset) for those sweeps except for the longer one. I wanted to do max_duration: "819_200_000tok", but I ran out of credits.

Evals

Helpful evaluation commands

  • Example of using lighteval: lighteval accelerate "model_name=HuggingFaceTB/SmolLM2-135M" "leaderboard|hellaswag|0|0"

  • With vLLM:

    • export VLLM_WORKER_MULTIPROC_METHOD=spawn
    • lighteval vllm “model_name=HuggingFaceTB/SmolLM2-135M" "leaderboard|hellaswag|0|0"

This PR has information about how to use Composer for evals.

I want to write a blog on this project. Help!

Exciting! You've done a lot of great work here. Let's start by thinking about the structure.

Based on everything in this dialog, here are the major topics you could cover:

  1. Project overview - LoRA learns less and forgets less replication
  2. Dataset preparation - decontamination with Elasticsearch
  3. Custom model - why you built SmolLM2 from scratch (unpadding story)
  4. Training optimizations - torch.compile, sequence packing, Liger kernel
  5. Hyperparameter sweeps - learning rate, rank, batch size experiments
  6. Results - learning vs forgetting across LoRA, rsLoRA, DoRA, and full fine-tuning

A few questions to help me guide you:

  • Who is your target audience? ML practitioners, researchers, or more general readers?
  • What's the main takeaway you want readers to walk away with?
  • Do you want one long blog post or a series of shorter ones? There's a LOT of material here - the training optimization journey alone could be its own post.