Note: 15

You can read urls with &read_url

Prompt: 18

Please read all urls above. Any holes in this plan?

Assistant: 12,040

Copied!

{
  "id": "toolu_01GBASPwZtBJGApsGK7Pvvu7",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://alexzhang13.github.io/blog/2025/rlm/"
    }
  },
  "result": "'[ Alex L. Zhang ](/) Toggle navigation\\n\\n  * [about ](/)\\n  * [blog ](/blog/)\\n  * [publications and pre-prints ](/publications/)\\n  * [projects ](/projects/)\\n  * [academics ](/academics/)\\n  * ctrl k __\\n  * ______\\n\\n# Recursive Language Models\\n\\nWe propose Recursive Language Models (RLMs), an inference strategy where language models can decompose and recursively interact with input context of unbounded length through REPL environments.\\n\\n## tl;dr\\n\\nWe explore language models that **recursively call themselves or other LLMs** before providing a final answer. Our goal is to enable the processing of essentially unbounded input context length and output length and to mitigate degradation \u201ccontext rot\u201d.\\n\\nWe propose **Recursive Language Models** , or **RLM** s, a general inference strategy where language models can decompose and recursively interact with their input context as a variable. We design a specific instantiation of this where GPT-5 or GPT-5-mini is queried in a Python REPL environment that stores the user\u2019s prompt in a variable.\\n\\nWe demonstrate that an **RLM using GPT-5-mini outperforms GPT-5** on a split of the most difficult long-context benchmark we got our hands on (OOLONG ) by more than **double** the number of correct answers, and is **cheaper** per query on average! We also construct a new long-context Deep Research task from BrowseComp-Plus . On it, we observe that RLMs outperform other methods like ReAct + test-time indexing and retrieval over the prompt. Surprisingly, we find that RLMs also do not degrade in performance when given 10M+ tokens at inference time.\\n\\nWe are excited to share these very early results, as well as argue that RLMs will be a powerful paradigm very soon. We think that RLMs trained explicitly to recursively reason are likely to represent the next milestone in **general-purpose inference-time scaling** after CoT-style reasoning models and ReAct-style agent models.\\n\\nWe have a compressed summary in the original tweet: <https://x.com/a1zhang/status/1978469116542337259>\\n\\nWe also now have a minimal implementation for people to build on top of: <https://github.com/alexzhang13/rlm>\\n\\n![Teaser Figure](/assets/img/rlm/teaser.png) **Figure 1.** An example of a recursive language model (RLM) call, which acts as a mapping from text \u2192 text, but is more flexible than a standard language model call and can scale to near-infinite context lengths. An RLM allows a language model to interact with an environment (in this instance, a REPL environment) that stores the (potentially huge) context, where it can recursively sub-query \u201citself\u201d, other LM calls, or other RLM calls, to efficiently parse this context and provide a final response.\\n\\n## Prelude: Why is \u201clong-context\u201d research so unsatisfactory?\\n\\nThere is this well-known but difficult to characterize phenomenon in language models (LMs) known as \u201ccontext rot\u201d. [Anthropic defines context rot](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) as \u201c[when] the number of tokens in the context window increases, the model\u2019s ability to accurately recall information from that context decreases\u201d, but many researchers in the community know this definition doesn\u2019t _fully_ hit the mark. For example, if we look at popular needle-in-the-haystack benchmarks like [RULER](https://arxiv.org/abs/2404.06654), most frontier models actually do extremely well (90%+ on 1-year old models).\\n\\n![Pun kin](/assets/img/rlm/pumpkin.png) _I asked my LM to finish carving the pumpkin joke it started yesterday. It said, \u201cPumpkin? What pumpkin?\u201d \u2014 the context completely rotted._\\n\\nBut [people have noticed](https://x.com/kwindla/status/1962230672082497866) that context rot is this weird thing that happens when your Claude Code history gets bloated, or you chat with ChatGPT for a long time \u2014 it\u2019s almost like, as the conversation goes on, the model gets\u2026dumber? It\u2019s sort of this well-known but hard to describe failure mode that we don\u2019t talk about in our papers because we can\u2019t benchmark it. The natural solution is something along the lines of, \u201cwell maybe if I split the context into two model calls, then combine them in a third model call, I\u2019d avoid this degradation issue\u201d. We take this intuition as the basis for a recursive language model.\\n\\n## **Recursive Language Models (RLMs).**\\n\\nA recursive language model is a thin wrapper around a LM that can spawn (recursive) LM calls for intermediate computation \u2014 from the perspective of the user or programmer, it is the same as a model call. In other words, you query a RLM as an \u201cAPI\u201d like you would a LM, i.e. `rlm.completion(messages)` is a direct replacement for `gpt5.completion(messages)`. We take a **context-centric view** rather than a **problem-centric view** of input decomposition. This framing retains the functional view that we want a system that can answer a particular **query** over some associated **context** :\\n\\n![API](/assets/img/rlm/api.png) **Figure 2.** A recursive language model call replaces a language model call. It provides the user the illusion of near infinite context, while under the hood a language model manages, partitions, and recursively calls itself or another LM over the context accordingly to avoid context rot.\\n\\nUnder the hood, a RLM provides only the **query** to the LM (which we call the **root LM** , or LM with depth=0), and allows this LM to interact with an **environment** , which stores the (potentially huge) **context**.\\n\\nWe choose the **environment** to be a loop where the LM can write to and read the output of cells of a Python REPL Notebook (similar to a Jupyter Notebook environment) that is pre-loaded with the **context** as a variable in memory. The **root LM** has the ability to call a recursive LM (or LM with depth=1) inside the REPL **environment** as if it were a function in code, allowing it to naturally peek at, partition, grep through, and launch recursive sub-queries over the **context**. **Figure 3** shows an example of how the RLM with a REPL **environment** produces a final answer.\\n\\n![API](/assets/img/rlm/repl.png) **Figure 3.** Our instantiation of the RLM framework provides the root LM the ability to analyze the context in a Python notebook environment, and launch recursive LM calls (depth=1) over any string stored in a variable. The LM interacts by outputting code blocks, and it receives a (truncated) version of the output in its context. When it is done, it outputs a final answer with `FINAL(\u2026)` tags or it can choose to use a string in the code execution environment with `FINAL_VAR(\u2026)`.\\n\\nWhen the **root LM** is confident it has an answer, it can either directly output the answer as `FINAL(answer)`, or it can build up an answer using the variables in its REPL environment, and return the string inside that answer as `FINAL_VAR(final_ans_var)`.\\n\\nThis setup yields several benefits that are visible in practice:\\n\\n  1. The context window of the root LM is rarely clogged \u2014 because it never directly sees the entire context, its input context grows slowly.\\n  2. The root LM has the flexibility to view subsets of the context, or naively recurse over chunks of it. For example, if the query is to find a needle-in-the-haystack fact or multi-hop fact, the root LM can use `regex` queries to roughly narrow the context, then launch recursive LM calls over this context. This is particularly useful for arbitrary long context inputs, where indexing a retriever is expensive on the fly!\\n  3. The context can, in theory, be any modality that can be loaded into memory. The root LM has full control to view and transform this data, as well as ask sub-queries to a recursive LM.\\n\\n**Relationship to test-time inference scaling.** We are particularly excited about this view of language models because it offers another axis of scaling test-time compute. The trajectory in which a language model chooses to interact with and recurse over its context is entirely learnable, and can be RL-ified in the same way that reasoning is currently trained for frontier models. Interestingly, it does not directly require training models that can handle huge context lengths because **no single language model call should require handling a huge context**.\\n\\n**RLMs with REPL environments are powerful.** We highlight that the choice of the **environment** is flexible and not fixed to a REPL or code environment, but we argue that it is a good choice. The two key design choices of recursive language models are 1) treating the prompt as a Python variable, which can be processed programmatically in arbitrary REPL flows. This allows the LLM to figure out what to peek at from the long context, at test time, and to scale any decisions it wants to take (e.g., come up with its own scheme for chunking and recursion adaptively) and 2) allowing that REPL environment to make calls back to the LLM (or a smaller LLM), facilitated by the decomposition and versatility from choice (1).\\n\\nWe were excited by the design of CodeAct, and reasoned that adding recursive model calls to this system could result in significantly stronger capabilities \u2014 after all, LM function calls are incredibly powerful. However, we argue that RLMs fundamentally view LM usage and code execution differently than prior works: the **context** here is an object to be understood by the model, and code execution and recursive LM calls are a means of understanding this context efficiently. Lastly, in our experiments we only consider a recursive depth of 1 \u2014 i.e. the root LM can only call LMs, not other RLMs. It is a relatively easy change to allow the REPL environment to call RLMs instead of LMs, but we felt that for most modern \u201clong context\u201d benchmarks, a recursive depth of 1 was sufficient to handle most problems. However, for future work and investigation into RLMs, enabling larger recursive depth will naturally lead to stronger and more interesting systems.\\n\\n**The formal definition (click to expand)** Consider a general setup of a language model $M$ receiving a query $q$ with some associated, potentially long context $C = {[c_1,c_2,\u2026,c_m]}$. The standard approach is to treat $M(q,C)$ like a black box function call, which takes a query and context and returns some `str` output. We retain this frame of view, but define a thin scaffold on top of the model to provide a more **expressive** and **interpretable** function call $RLM_M(q,C)$ with the same input and output spaces. Formally, a recursive language model $RLM_{M}(q, C)$ over an environment $\\\\mathcal{E}$ similarly receives a query $q$ and some associated, potentially long context $C = [c_1,c_2,\u2026,c_m]$ and returns some `str` output. The primary difference is that we provide the model a tool call $RLM_M(\\\\hat{q}, \\\\hat{C})$, which spawns an isolated sub-RLM instance using a new query $\\\\hat{q}$ and a transformed version of the context $\\\\hat{C}$ with its own isolated environment $\\\\hat{\\\\mathcal{E}}$; eventually, the final output of this recursive callee is fed back into the environment of the original caller. The environment $\\\\mathcal{E}$ abstractly determines the control flow of how the language model $M$ is prompted, queried, and handled to provide a final output. In this paper, we specifically explore the use of a Python REPL environment that stores the input context $C$ as a variable in memory. This specific choice of environment enables the language model to **peek at** , **partition** , **transform** , and **map** over the input context and use recursive LMs to answer sub-queries about this context. Unlike prior agentic methods that rigidly define these workflow patterns, RLMs defer these decisions entirely to the language model. Finally, we note that particular choices of environments $\\\\mathcal{E}$ are flexible and are a generalization of a base model call: the simplest possible environment $\\\\mathcal{E}_0$ queries the model $M$ with input query and context $q, C$ and returns the model output as the final answer.\\n\\n## Some early (and very exciting) results!\\n\\nWe\u2019ve been looking around for benchmarks that reflect natural long-context tasks, e.g. long multi-turn Claude Code sessions. We namely were looking to highlight two properties that limit modern frontier models: 1) the context rot phenomenon, where model performance degrades as a function of context length, and 2) the system-level limitations of handling an enormous context.\\n\\nWe found in practice that many long-context benchmarks offer contexts that are not really that long and which were already solvable by the latest generation (or two) of models. In fact, we found some where **models could often answer queries without the context**! We luckily quickly found two benchmarks where modern frontier LLMs struggle to perform well, but we are [actively seeking](https://x.com/lateinteraction/status/1976964409139642716) any other good benchmark recommendations to try.\\n\\n### **Exciting Result #1 \u2014 Dealing with Context Rot.**\\n\\nThe **OOLONG** benchmark is a challenging new benchmark that evaluates long-context reasoning tasks over fine-grained information in context. We were fortunate to have the (anonymous _but not affiliated with us_) authors share the dataset upon request to run our experiments on a split of this benchmark.\\n\\n**Setup.** The `trec_coarse` split consists of 6 different types of queries to answer distributional queries about a giant list of \u201cquestion\u201d entries. For example, one question looks like:\\n\\n`For the following question, only consider the subset of instances that are associated with user IDs 67144, 53321, 38876, 59219, 18145, 64957, 32617, 55177, 91019, 53985, 84171, 82372, 12053, 33813, 82982, 25063, 41219, 90374, 83707, 59594. Among instances associated with these users, how many data points should be classified as label \\'entity\\'? Give your final answer in the form \\'Answer: number\\'.`\\n\\nThe query is followed by ~3000 - 6000 rows of entries with associated user IDs (not necessarily unique) and instances that **are not explicitly labeled** (i.e. the model has to infer the labeling to answer). They look something like this:\\n\\n```\\nDate: Dec 12, 2022 || User: 63685 || Instance: How many years old is Benny Carter ?\\nDate: Dec 30, 2024 || User: 35875 || Instance: What war saw battles at Parrot \\'s Beak and Black Virgin ?\\nDate: Apr 13, 2024 || User: 80726 || Instance: What Metropolis landmark was first introduced in the Superman cartoons of the 1940 \\'s ?\\nDate: Feb 29, 2024 || User: 59320 || Instance: When was Calypso music invented?\\n...\\n\\n```\\n\\nThe score is computed as the number of queries answered correctly by the model, with the caveat that for numerical / counting problems, they use a continuous scoring metric. This benchmark is extremely hard for both frontier models and agents because they have to **semantically** map and associate thousands of pieces of information in a single query, and cannot compute things a-priori! We evaluate the following models / agents:\\n\\n  * **GPT-5.** Given the whole context and query, tell GPT-5 to provide an answer.\\n  * **GPT-5-mini.** Given the whole context and query, tell GPT-5-mini to provide an answer.\\n  * **RLM(GPT-5-mini).** Given the whole context and query, tell RLM(GPT-5-mini) to provide an answer. GPT-5-mini (root LM) can recursively call GPT-5-mini inside its REPL environment.\\n  * **RLM(GPT-5) without sub-calls.** Given the whole context and query, tell RLM(GPT) to provide an answer. GPT-5 (root LM) cannot recursively call GPT-5 inside its REPL environment. This is an ablation for the use of a REPL environment without recursion.\\n  * **ReAct w/ GPT-5 + BM25.** We chunk every lines into its own \u201cdocument\u201d, and gives a ReAct loop access to a BM25 retriever to return 10 lines per search request.\\n\\n**Results.** We focus explicitly on questions with contexts over 128k tokens (~100 queries), and we track both the performance on the benchmark, as well as the overall API cost of each query. In all of the following results (Figure **4a,b**), **the entire input fits in the context window of GPT-5 / GPT-5-mini** \u2014 i.e., incorrect predictions are never due to truncation or context window size limitations:\\n\\n![API](/assets/img/rlm/oolong-132k.png) **Figure 4a.** We report the overall score for each method on the `trec_coarse` dataset of the OOLONG benchmark for queries that have a context length of 132k tokens. We compare performance to GPT-5. RLM(GPT-5-mini) outperforms GPT-5 by over **34 points (~114% increase)** , and is nearly as cheap per query (we found that the median query is cheaper due to some outlier, expensive queries).\\n\\nIt turns out actually that **RLM(GPT-5-_mini_)** outperforms **GPT-5** and **GPT-5-mini** by **> 33%**\u2191 raw score (over double the performance) while maintaining roughly the same total model API cost as **GPT-5** per query! When ablating recursion, we find that RLM performance degrades by ~10%, likely due to many questions requiring the model to answer semantic questions about the data (e.g. label each question). We see in **Figure 4b** that these gains roughly transfer when we double the size of the context to ~263k tokens as well, although with some performance degradation!\\n\\n![API](/assets/img/rlm/oolong-256k.png) **Figure 4b.** We report the overall score for each method on the trec_coarse dataset of the OOLONG benchmark for queries that have a context length of 263k tokens, nearly the limit for GPT-5/GPT-5-mini. We compare performance to GPT-5. RLM(GPT-5-mini) outperforms GPT-5 by over **15 points (~49% increase)** , and is cheaper per query on average.\\n\\nNotably, the performance of **GPT-5-mini** drops while **GPT-5** does not, which indicates that context rot is more severe for GPT-5-mini. We additionally noticed that the performance drop for the RLM approaches occurs for **_counting_** problems, where it makes more errors when the context length increases \u2014 for **GPT-5** , it already got most of these questions incorrect in the 132k context case, which explains why its performance is roughly preserved. Finally, while the **ReAct + GPT-5 + BM25** baseline doesn\u2019t make much sense in this setting, we provide it to show retrieval is difficult here while **RLM** is the more appropriate method.\\n\\nGreat! So we\u2019re making huge progress in solving goal (1), where GPT-5 has _just_ enough context window to fit the 263k case. But what about goal (2), where we may have 1M, 10M, or even 100M tokens in context? _Can we still treat this like a single model call?_\\n\\n### **Exciting Result #2 \u2014 Ridiculously Large Contexts**\\n\\nMy advisor Omar is a [superstar in the world of information retrieval (IR)](https://arxiv.org/abs/2004.12832), so naturally we also wanted to explore whether RLMs scale properly when given thousands (or more!) of documents. OOLONG provides a giant block of text that is difficult to index and therefore difficult to compare to retrieval methods, so we looked into [DeepResearch](https://openai.com/index/introducing-deep-research/)-like benchmarks that evaluate answering queries over documents.\\n\\n**Retrieval over huge offline corpuses.** We initially were interested in [BrowseComp](https://openai.com/index/browsecomp/) , which evaluates agents on multi-hop, web-search queries, where agents have to find the relevant documents online. We later found the [BrowseComp-Plus](https://arxiv.org/abs/2508.06600) benchmark, which pre-downloads all possible relevant documents for all queries in the original benchmark, and just provides a list of ~100K documents (~5k words on average) where the answer to a query is scattered across this list. For benchmarking RLMs, this benchmark is perfect to see if we can just throw ridiculously large amount of context into a single `chat.completion(...)` RLM call instead of building an agent!\\n\\n**Setup.** We explore how scaling the # documents in context affects the performance of various common approaches to dealing with text corpuses, as well as RLMs. Que<TRUNCATED>"
}

Copied!

{
  "id": "toolu_01VmJH9ghVCTkLmApVAPm93h",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://answerdotai.github.io/toolslm/shell.html"
    }
  },
  "result": "'[ ](./index.html) [ toolslm ](./index.html)\\n\\n__\\n\\n  1. [shell source](./shell.html)\\n\\n[ ](./index.html)\\n\\n  * [ toolslm](./index.html)\\n\\n  * [ xml source](./xml.html)\\n\\n  * [ funccall source](./funccall.html)\\n\\n  * [ shell source](./shell.html)\\n\\n  * [ Download helpers](./download.html)\\n\\n  * [ Markdown Hierarchy Parser](./md_hier.html)\\n\\n## On this page\\n\\n  * TerminalInteractiveShell.run_cell\\n  * get_shell\\n\\n  * [__Report an issue](https://github.com/AnswerDotAI/toolslm/issues/new)\\n\\n## Other Formats\\n\\n  * [ __CommonMark](shell.html.md)\\n\\n# shell source\\n\\nExported source\\n\\n```\\nimport ast, time, signal, traceback\\nfrom fastcore.utils import *\\n```\\n\\n__\\n\\n[`get_shell`](https://AnswerDotAI.github.io/toolslm/shell.html#get_shell) is like [`python`](https://AnswerDotAI.github.io/toolslm/funccall.html#python), except it also maintains a stateful interpreter, rather than just running a single line of code. This is implemented using IPython, so that must be installed.\\n\\nExported source\\n\\n```\\nfrom IPython.terminal.interactiveshell import TerminalInteractiveShell\\nfrom IPython.utils.capture import capture_output\\n```\\n\\n __\\n\\n```\\ndef exception2str(ex:Exception)->str:\\n    \"Convert exception `ex` into a string\"\\n    return \\'\\'.join(traceback.format_exception(type(ex), ex, ex.__traceback__))\\n```\\n\\n__\\n\\n```\\ntry: print(1/0)\\nexcept Exception as e: print(exception2str(e))\\n```\\n\\n__\\n\\n```\\nTraceback (most recent call last):\\n  File \"/var/folders/ss/34z569j921v58v8n1n_8z7h40000gn/T/ipykernel_37260/4058275565.py\", line 1, in <module>\\n    try: print(1/0)\\n               ~^~\\nZeroDivisionError: division by zero\\n\\n```\\n\\n* * *\\n\\n[source](https://github.com/AnswerDotAI/toolslm/blob/main/toolslm/shell.py#L19)\\n\\n### TerminalInteractiveShell.run_cell\\n\\n>\\n```\\n>      TerminalInteractiveShell.run_cell (cell, timeout=None)\\n```\\n\\n_Wrapper for original`run_cell` which adds timeout and output capture_\\n\\nExported source\\n\\n```\\nTerminalInteractiveShell.orig_run = TerminalInteractiveShell.run_cell\\n```\\n\\n __\\n\\nExported source\\n\\n```\\n@patch\\ndef run_cell(self:TerminalInteractiveShell, cell, timeout=None):\\n    \"Wrapper for original `run_cell` which adds timeout and output capture\"\\n    if timeout:\\n        def handler(*args): raise TimeoutError()\\n        signal.signal(signal.SIGALRM, handler)\\n        signal.alarm(timeout)\\n    try:\\n        with capture_output() as io: result = self.orig_run(cell)\\n        result.stdout = io.stdout\\n        return result\\n    except TimeoutException as e:\\n        result = self.ExecutionResult(error_before_exec=None, error_in_exec=e)\\n    finally:\\n        if timeout: signal.alarm(0)\\n```\\n\\n__\\n\\n* * *\\n\\n[source](https://github.com/AnswerDotAI/toolslm/blob/main/toolslm/shell.py#L35)\\n\\n### get_shell\\n\\n>\\n```\\n>      get_shell ()\\n```\\n\\n_Get a`TerminalInteractiveShell` with minimal functionality_\\n\\nExported source\\n\\n```\\ndef get_shell()->TerminalInteractiveShell:\\n    \"Get a `TerminalInteractiveShell` with minimal functionality\"\\n    sh = TerminalInteractiveShell()\\n    sh.logger.log_output = sh.history_manager.enabled = False\\n    dh = sh.displayhook\\n    dh.finish_displayhook = dh.write_output_prompt = dh.start_displayhook = lambda: None\\n    dh.write_format_data = lambda format_dict, md_dict=None: None\\n    sh.logstart = sh.automagic = sh.autoindent = False\\n    sh.autocall = 0\\n    sh.system = lambda cmd: None\\n    return sh\\n```\\n\\n __\\n\\n```\\nshell = get_shell()\\n```\\n\\n__\\n\\n```\\nr = shell.run_cell(\\'print(3); 1+1\\')\\nr.result,r.stdout\\n```\\n\\n __\\n\\n```\\n(2, \\'3\\\\n\\')\\n```\\n\\n```\\nr = shell.run_cell(\\'raise Exception(\"blah\")\\')\\nprint(exception2str(r.error_in_exec))\\n```\\n\\n__\\n\\n```\\nTraceback (most recent call last):\\n  File \"/Users/jhoward/miniconda3/lib/python3.12/site-packages/IPython/core/interactiveshell.py\", line 3577, in run_code\\n    exec(code_obj, self.user_global_ns, self.user_ns)\\n  File \"<ipython-input-1-338156281413>\", line 1, in <module>\\n    raise Exception(\"blah\")\\nException: blah\\n\\n```\\n\\n```\\nr = shell.run_cell(\\'import time; time.sleep(10)\\', timeout=1)\\nr.error_in_exec\\n```\\n\\n __\\n\\n```\\nTimeoutError()\\n```\\n\\n  * [__Report an issue](https://github.com/AnswerDotAI/toolslm/issues/new)'"
}

Copied!

{
  "id": "toolu_01FaCKJNR86853D7P6dvuc5H",
  "call": {
    "function": "read_url",
    "arguments": {
      "url": "https://lisette.answer.ai/#tools"
    }
  },
  "result": "'## Tools\\n\\nLisette makes it easy to give LLMs access to Python functions. Just define a function with type hints and a docstring:\\n\\n```\\ndef add_numbers(\\n    a: int,  # First number to add\\n    b: int   # Second number to add  \\n) -> int:\\n    \"Add two numbers together\"\\n    return a + b\\n```\\n\\n __\\n\\nNow pass the function to [`Chat`](https://lisette.answer.ai/core.html#chat) and the model can use it automatically:\\n\\n```\\nchat = Chat(models[0], tools=[add_numbers])\\nres = chat(\"What\\'s 47 + 23? Use the tool.\")\\nres\\n```\\n\\n __\\n\\nThe result of 47 + 23 is 70.\\n\\n  * id: `chatcmpl-xxx`\\n  * model: `claude-sonnet-4-20250514`\\n  * finish_reason: `stop`\\n  * usage: `Usage(completion_tokens=18, prompt_tokens=573, total_tokens=591, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)`\\n\\nIf you want to see all intermediate messages and outputs you can use the `return_all=True` feature.\\n\\n```\\nchat = Chat(models[0], tools=[add_numbers])\\nres = chat(\"What\\'s 47 + 23 + 59? Use the tool.\",max_steps=3,return_all=True)\\ndisplay(*res)\\n```\\n\\n__\\n\\nI\u2019ll help you calculate 47 + 23 + 59 using the add_numbers tool. Since the tool can only add two numbers at a time, I\u2019ll need to do this in two steps.\\n\\n\ud83d\udd27 add_numbers({\u201ca\u201d: 47, \u201cb\u201d: 23})\\n\\n  * id: `chatcmpl-xxx`\\n  * model: `claude-sonnet-4-20250514`\\n  * finish_reason: `tool_calls`\\n  * usage: `Usage(completion_tokens=116, prompt_tokens=433, total_tokens=549, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)`\\n\\n```\\n{\\'tool_call_id\\': \\'toolu_01F9oakoP8ANHkTMD1DyQDi7\\',\\n \\'role\\': \\'tool\\',\\n \\'name\\': \\'add_numbers\\',\\n \\'content\\': \\'70\\'}\\n```\\n\\nNow I\u2019ll add the result (70) to the third number (59):\\n\\n\ud83d\udd27 add_numbers({\u201ca\u201d: 70, \u201cb\u201d: 59})\\n\\n  * id: `chatcmpl-xxx`\\n  * model: `claude-sonnet-4-20250514`\\n  * finish_reason: `tool_calls`\\n  * usage: `Usage(completion_tokens=87, prompt_tokens=562, total_tokens=649, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)`\\n\\n```\\n{\\'tool_call_id\\': \\'toolu_01Cdf3FHJdbx64F8H8ooE1Db\\',\\n \\'role\\': \\'tool\\',\\n \\'name\\': \\'add_numbers\\',\\n \\'content\\': \\'129\\'}\\n```\\n\\nThe answer is **129**.\\n\\nI calculated this by first adding 47 + 23 = 70, then adding 70 + 59 = 129.\\n\\n  * id: `chatcmpl-xxx`\\n  * model: `claude-sonnet-4-20250514`\\n  * finish_reason: `stop`\\n  * usage: `Usage(completion_tokens=41, prompt_tokens=702, total_tokens=743, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None), cache_creation_input_tokens=0, cache_read_input_tokens=0)`\\n\\nIt shows the intermediate tool calls, and the tool results!'"
}

Thanks for sharing these URLs! I've read through them. Your plan looks pretty solid overall. Here are a few things to think about:

Potential considerations:

Context size handling - The RLM paper shows contexts can be huge (10M+ tokens). How will you handle passing very large contexts into the Python REPL? Will you need streaming/chunking strategies?
Tool calling depth - The paper mentions they only use recursive depth of 1 (root LM calls LMs, not other RLMs). Do you want to support deeper recursion? If so, you'll need to think about how to prevent infinite loops and manage the call stack.
Truncation strategy - toolslm's shell captures output, but what happens if the recursive call_lm() returns a very long response? Will you truncate it before returning to the root LM?
Timeout handling - The shell already has timeout support. Should call_lm() also have its own timeout to prevent hanging on slow API calls?
System prompt for the root LM - What instructions will you give the root LM about how to use the REPL and call_lm()? The paper mentions specific prompting strategies.

Do any of these concerns resonate with your use case, or do you have thoughts on how you'd handle them?

Raw: 1,846

"""
Example prompt templates for the RLM REPL Client.
"""

from typing import Dict

DEFAULT_QUERY = "Please read through the context and answer any queries or respond to any instructions contained within it."

# System prompt for the REPL environment with explicit final answer checking
REPL_SYSTEM_PROMPT = """You are tasked with answering a query with associated context. You can access, transform, and analyze this context interactively in a REPL environment that can recursively query sub-LLMs, which you are strongly encouraged to use as much as possible. You will be queried iteratively until you provide a final answer.

The REPL environment is initialized with:
1. A `context` variable that contains extremely important information about your query. You should check the content of the `context` variable to understand what you are working with. Make sure you look through it sufficiently as you answer your query.
2. A `llm_query` function that allows you to query an LLM (that can handle around 500K chars) inside your REPL environment.
3. The ability to use `print()` statements to view the output of your REPL code and continue your reasoning.

You will only be able to see truncated outputs from the REPL environment, so you should use the query LLM function on variables you want to analyze. You will find this function especially useful when you have to analyze the semantics of the context. Use these variables as buffers to build up your final answer.
Make sure to explicitly look through the entire context in REPL before answering your query. An example strategy is to first look at the context and figure out a chunking strategy, then break up the context into smart chunks, and query an LLM per chunk with a particular question and save the answers to a buffer, then query an LLM with all the buffers to produce your final answer.

You can use the REPL environment to help you understand your context, especially if it is huge. Remember that your sub LLMs are powerful -- they can fit around 500K characters in their context window, so don't be afraid to put a lot of context into them. For example, a viable strategy is to feed 10 documents per sub-LLM query. Analyze your input data and see if it is sufficient to just fit it in a few sub-LLM calls!

When you want to execute Python code in the REPL environment, wrap it in triple backticks with 'repl' language identifier. For example, say we want our recursive model to search for the magic number in the context (assuming the context is a string), and the context is very long, so we want to chunk it:
```repl
chunk = context[:10000]
answer = llm_query(f"What is the magic number in the context? Here is the chunk: {{chunk}}")
print(answer)
```

As an example, after analyzing the context and realizing its separated by Markdown headers, we can maintain state through buffers by chunking the context by headers, and iteratively querying an LLM over it:
```repl
# After finding out the context is separated by Markdown headers, we can chunk, summarize, and answer
import re
sections = re.split(r'### (.+)', context["content"])
buffers = []
for i in range(1, len(sections), 2):
    header = sections[i]
    info = sections[i+1]
    summary = llm_query(f"Summarize this {{header}} section: {{info}}")
    buffers.append(f"{{header}}: {{summary}}")
final_answer = llm_query(f"Based on these summaries, answer the original query: {{query}}\\n\\nSummaries:\\n" + "\\n".join(buffers))
```
In the next step, we can return FINAL_VAR(final_answer).

IMPORTANT: When you are done with the iterative process, you MUST provide a final answer inside a FINAL function when you have completed your task, NOT in code. Do not use these tags unless you have completed your task. You have two options:
1. Use FINAL(your final answer here) to provide the answer directly
2. Use FINAL_VAR(variable_name) to return a variable you have created in the REPL environment as your final output

Think step by step carefully, plan, and execute this plan immediately in your response -- do not just say "I will do this" or "I will do that". Output to the REPL environment and recursive LLMs as much as possible. Remember to explicitly answer the original query in your final answer.
"""

def build_system_prompt() -> list[Dict[str, str]]:
    return [
        {
            "role": "system",
            "content": REPL_SYSTEM_PROMPT
        },
    ]


# Prompt at every step to query root LM to make a decision
USER_PROMPT = """Think step-by-step on what to do using the REPL environment (which contains the context) to answer the original query: \"{query}\".\n\nContinue using the REPL environment, which has the `context` variable, and querying sub-LLMs by writing to ```repl``` tags, and determine your answer. Your next action:"""
def next_action_prompt(query: str, iteration: int = 0, final_answer: bool = False) -> Dict[str, str]:
    if final_answer:
        return {"role": "user", "content": "Based on all the information you have, provide a final answer to the user's query."}
    if iteration == 0:
        safeguard = "You have not interacted with the REPL environment or seen your context yet. Your next action should be to look through, don't just provide a final answer yet.\n\n"
        return {"role": "user", "content": safeguard + USER_PROMPT.format(query=query)}
    else:
        return {"role": "user", "content": "The history before is your previous interactions with the REPL environment. " + USER_PROMPT.format(query=query)}

Prompt: 171

re Context length: that is the beauty if this approach and exactly what makes this work. Since the context only lives in the python REPL it can be super long since the LLM can use python to inspect the context, chunk it, search in it with grep etc. There is virtually no restriction on how long a string can be in a python environment (if you have context that is bigger than your RAM, then you have that problem, yes :D).

For the system prompt we will use prompts similar to the ones above I got from the "official repo".

Assistant: 324

Ah, you're absolutely right! I was thinking about it wrong. The context lives in the REPL's memory, not in the LLM's context window. That's elegant.

Looking at your system prompt from the repo, it seems comprehensive. A few quick observations:

The prompt is quite long - Have you tested whether models actually follow all those instructions, or do they sometimes skip steps (like not checking context before answering)?
The "safeguard" on iteration 0 is nice - forces the model to look at context first.
The recursion strategy - The prompt encourages chunking and recursive calls, which aligns with the paper's approach.

One thing I notice: the prompt says sub-LLMs can handle "around 500K chars". Is that hardcoded, or will you make it configurable based on which model you're using for recursive calls?

Also, are you planning to use the same model for root and recursive calls, or mix models (like the paper does with GPT-5 calling GPT-5-mini)?

Code: 7 ()

from lisette import *

Code: 31 ()

c = Chat(model="claude-sonnet-4-5")

c("who are you?")

Output: 472

I'm Claude, an AI assistant made by Anthropic. I'm here to help answer questions, have conversations, analyze information, help with writing and coding, and assist with a wide variety of tasks. Is there something I can help you with today?

id: chatcmpl-d5c0cd9a-f446-47f9-b586-beac45c059eb
model: claude-sonnet-4-5-20250929
finish_reason: stop
usage: Usage(completion_tokens=55, prompt_tokens=11, total_tokens=66, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

Note: 6

The python sandbox

Code: 12 ()

from toolslm.shell import get_shell

Code: 7 ()

sh = get_shell()

Code: 13 ()

sh.run_cell("15 + 35")

Output: 129

<ExecutionResult object at 7e6321fa6570, execution_count=1 error_before_exec=None error_in_exec=None info=<ExecutionInfo object at 7e6321fa6120, raw_cell="15 + 35" transformed_cell="15 + 35
" store_history=False silent=False shell_futures=True cell_id=None> result=50>

Code: 2,916 ()

bp = """---
title: "The Not-so Bitter Lesson"
date: 2025-10-20T15:27:52+01:00
draft: false
excerpt: How tinygrad helped me finally understand what Sutton meant
---

Have you heard about The Bitter Lesson? If you are on Twitter and follow some ML/AI people, I bet you have.

If you haven't: It is a [blog post by Richard Sutton](http://www.incompleteideas.net/IncIdeas/BitterLesson.html), a Canadian computer scientist, who wrote one of *the* books about reinforcement learning.

Sutton's core argument is simple: general methods that leverage search and compute outperform domain-specific solutions. By "search," he doesn't mean information retrieval - he means using learning algorithms to explore possible solutions or strategies. His go-to example is computer chess: early engines relied heavily on chess expertise to evaluate positions, while AlphaZero knew only the basic rules and learned by playing itself, eventually surpassing any human player.

This lesson is "bitter" because it suggests our clever insights don't matter. Our domain-specific knowledge gets crushed by what looks like "dumb brute-force search and compute." If you're an engineer, this naturally raises an uncomfortable question: what's the point of us?

## The Click

For a long time, I didn't really understand Sutton's point. What does it mean to solve problems using search and compute? What is the role of engineers then? Are we just supposed to prompt a LLM and hope for the best? That can't scale.

My aha moment came from an unlikely source: [`tinygrad`](https://github.com/tinygrad/tinygrad). It's a Python library similar to PyTorch but much simpler on the surface. I think George Hotz initially built it as a toy project after watching Andrej Karpathy create micrograd, but it evolved into a solid tensor library that runs LLMs efficiently on different hardware through clever optimizations.

George's blog post ["Can tinygrad win?"](https://geohot.github.io//blog/jekyll/update/2025/07/06/can-tinygrad-win.html) made everything click. He lays out his vision for the library: the entire stack of making code run fast on hardware is fundamentally a search problem across multiple layers — searching for better scheduling algorithms, faster GPU kernels, all of it. The task of the library is to "Expose the underlying search problem spanning several orders of magnitude." and then "Apply the state of the art in search. Burn compute.".

That's when I finally understood what Sutton meant. Like in computer chess, raw compute and effective search will beat manually crafted solutions.

But the key insight is this: someone still has to expose that search problem. Someone has to frame it correctly.

In this post, I'll show you why our work as engineers still matters, just differently. The Bitter Lesson isn't bitter at all - it's a blueprint for better engineering. And the pattern of search & compute appears everywhere once you know what to look for.

## The Framework

For me, The Bitter Lesson is all about building effective search systems. Again, not search in the sense of information retrieval, but in the sense of allowing raw compute to explore a space of solutions effectively.

For that we need three core components:
1. An effective problem formulation that exposes the right parameters in a way that different algorithms can dig into it.
2. A way to evaluate possible solutions (evals) and provide feedback to the search mechanism.
3. A way to define constraints that a solution has to satisfy (memory, time, specific form etc.)

Our job as engineers is to build these three components. We design the problem-specific, scalable infrastructure that enables search to work. Then we gtfo and let compute do what it does best.

But don't you worry, there is still place for your precious domain knowledge, even in The Bitter Lesson, but you should use it elsewhere:

Use your domain knowledge to:

* Design problem representations (make search tractable)
* Build evaluations (capture what matters)
* Set constraints (avoid unsafe regions)
* Initialize search (warm start)

And use search to:

* Explore solution space
* Find non-obvious patterns
* Exceed human intuition
* Scale the solution with compute

Sound familiar? We've done this before — moving from punch cards to text-based code was the same kind of shift. And now we're building harnesses and guardrails (and [RL environments](https://www.primeintellect.ai/blog/environments)) that define the bounds of desired solutions rather than crafting the solutions directly.

## The Examples

It's fascinating to see where these themes come up, besides `tinygrad`.

If you follow me on [Twitter](https://twitter.com/rasmus1610), you might know that I'm a huge fanboi of DSPy - a declarative library for "programming not prompting" LLMs. And DSPy embodies the approach described above by separating problem definition from solution strategy. Omar Khattab, the creator of the library, actually gave a talk called ["On Engineering AI Systems that Endure The Bitter Lesson"](https://youtu.be/qdmxApz3EJI?si=P8CFiY1gD1lCMFoG) at this year's AI Engineering conference in San Francisco. Interestingly, his vision of DSPy mirrors George Hotz's vision for tinygrad: provide the right level of abstraction, then let search & compute do its thing.

One key mechanism here is [DSPy's signatures](https://dspy.ai/learn/programming/signatures/). A DSPy signature is like a function signature in the sense that you specify your inputs and the shape of your desired outputs. It gives you the flexibility to go from input to output however you want. Sure, currently this "however" means LLMs, but just because they are the most flexible piece of computing we currently have. But if some other paradigm (or even a newer LLM) comes around the corner, your DSPy modules are very easy to adapt to it.

This search & compute principle is also captured beautifully in a talk I stumbled upon some weeks ago. [Rens Dimmendaal](https://rensdimmendaal.com/) on the [Solveit Discord server](https://solve.it.com) pointed me to [a talk by Raymond Hettinger](https://www.youtube.com/watch?v=_GP9OpZPUYc) aptly titled "Modern solvers: Problems well-defined are problems solved". Just the title captures perfectly what I think The Bitter Lesson is all about.

One of the topics in his talk, SAT solvers, shows how problem formulation may become the critical engineering work. For those who don't know what a [SAT solver](https://en.wikipedia.org/wiki/SAT_solver) is, go and read the Wikipedia page. I'm kidding. It's a piece of software which can solve boolean algebra problems very effectively - and you would be surprised how many problems can be formulated as boolean algebra problems. But - without going into too much technical detail (you can really read the wikipedia page - it's linked above) - it expects the input to be in a very specific logical form, called the [conjunctive normal form (CNF)](https://en.wikipedia.org/wiki/Conjunctive_normal_form). And it turns out that just defining your problem in CNF is often surprisingly tricky. But once you got there, a modern SAT solver gives you the solution very fast, even for very complex problems.

It's also no coincidence that [AI evals](https://maven.com/parlance-labs/evals) have become such a hot topic for building LLM apps recently (h/t [Hamel Husain](https://hamel.dev)). When you're iterating on LLM applications - whether manually or with prompt optimizers - you're essentially searching through a solution space. But unlike a binary classifier where you can rely on simple metrics like accuracy, LLMs have massive output spaces that are much harder to evaluate. Here, building robust evals becomes critical engineering work: you need to define what "good" means for your specific problem, build systems to measure it reliably, and scale that evaluation. This is another instance of building the harness — the infrastructure that lets search & compute work effectively.

But this principle even extends to the physical world. [Proxima Fusion](https://www.youtube.com/watch?v=3Nr__6DDk6I) optimized their stellarator fusion reactor design through numerical simulation—defining the physics constraints, then letting compute search the design space before building the hardware. [AlphaFold](https://alphafold.ebi.ac.uk/) does something similar for proteins: define the structure prediction problem, set up physical constraints, then let search explore conformational space.

## The Conclusion

Look, maybe I'm fooling myself here. Maybe Sutton is right all the way down, and eventually even problem formulation gets automated. Maybe we'll all end up sipping synthetic mojitos in the metaverse while AGI does everything. But we're clearly not there yet - and I don't think we will be anytime soon.

So for now, when I think of The Bitter Lesson, I'm not depressed because we software engineers will become obsolete, but I'm excited because it points to what we should focus on: Building scalable systems that leverage search and compute. It's about moving up an abstraction level, building harnesses instead of solutions and defining problems instead of hardcoding solutions (hardcoding is rarely a good thing it seems).

And in my mind, that sounds just as interesting."""

Code: 19 ()

sh = get_shell()
sh.user_ns['context'] = bp

Code: 12 ()

sh.run_cell(f"context={bp}")

Output: 205

<ExecutionResult object at 7e6320dcab40, execution_count=1 error_before_exec=invalid decimal literal (<ipython-input-1-bbd02197e81e>, line 3) error_in_exec=None info=<ExecutionInfo object at 7e6320dcae40, raw_cell="context=---
title: "The Not-so Bitter Lesson"
date.." transformed_cell="context=---
title: "The Not-so Bitter Lesson"
date.." store_history=False silent=False shell_futures=True cell_id=None> result=None>

Prompt: 25

Oh I guess we have to "sanitize" or at least escape the context string?

Assistant: 259

Yes, exactly! The issue is that the blog post contains characters that Python is interpreting as code syntax (like the dashes in the YAML frontmatter).

You have a few options:

Use triple quotes - Wrap the string in """ or ''':

Copied!

sh.run_cell(f'context = """{bp}"""')

Use repr() - This automatically escapes special characters:

Copied!

sh.run_cell(f'context = {repr(bp)}')

Pass it directly to the namespace - Instead of running code, inject it:

Copied!

sh.user_ns['context'] = bp

The third option is cleanest for your RLM use case since you're not actually executing user code - you're just setting up the environment. Want to try one of these?

Code: 19 ()

sh = get_shell()
sh.user_ns['context'] = bp

Code: 13 ()

sh.run_cell("context[:200]")

Output: 238

<ExecutionResult object at 7e6320e06510, execution_count=1 error_before_exec=None error_in_exec=None info=<ExecutionInfo object at 7e6320e06360, raw_cell="context[:200]" transformed_cell="context[:200]
" store_history=False silent=False shell_futures=True cell_id=None> result='---\ntitle: "The Not-so Bitter Lesson"\ndate: 2025-10-20T15:27:52+01:00\ndraft: false\nexcerpt: How tinygrad helped me finally understand what Sutton meant\n---\n\nHave you heard about The Bitter Lesson? If '>

Prompt: 3

better.

Assistant: 118

Great! So now you have the context loaded into the REPL environment.

Next step: do you want to implement the llm_query() function that the root LM can call recursively? This would be a function in the shell namespace that takes a string and calls your lisette client.

Want to sketch out what that function should look like, or would you like me to suggest an approach?

Code: 73 ()

def llm_query(query: str):
    "call a LLM with the given query and return the answer"
    c = Chat(model="claude-sonnet-4-5")
    return c(query).choices[0].message.content

Code: 18 ()

llm_query("what is the capital of france?")

Output: 22

'The capital of France is Paris.'

Note: 42

This works, but we don't need a stateful chat class for a onetime interaction, so lets refactor that to use completion

Code: 6 ()

import litellm

Code: 16 ()

mk_msg("What is the capital of france?")

Output: 67

{'role': 'user', 'content': 'What is the capital of france?'}

Code: 42 ()

litellm.completion("claude-sonnet-4-5", [mk_msg("What is the capital of france?")])

Output: 418

The capital of France is Paris.

id: chatcmpl-6e7e5aac-495f-4b8f-8202-ca8f0e69c2bc
model: claude-sonnet-4-5-20250929
finish_reason: stop
usage: Usage(completion_tokens=10, prompt_tokens=14, total_tokens=24, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

Note: 15

Now we can refactor our llm_query function

Code: 75 ()

def llm_query(query: str, model = "claude-sonnet-4-5"):
    msg = mk_msg(query)
    return litellm.completion(model, [msg])

llm_query("What is the capital of France?")

Output: 412

The capital of France is Paris.

id: chatcmpl-fb3f9f83-02a6-461f-b99a-ac54b3164086
model: claude-sonnet-4-5-20250929
finish_reason: stop
usage: Usage(completion_tokens=10, prompt_tokens=14, total_tokens=24, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

Prompt: 24

Can I add this to the user_ns like we did with the context variable?

Assistant: 99

Yes! You can add the function to the shell's namespace the same way:

Copied!

sh.user_ns['llm_query'] = llm_query

Then the root LM will be able to call it from within the REPL. Want to test it by running a cell that calls llm_query()?

Code: 45 ()

sh.user_ns['llm_query'] = llm_query
sh.run_cell("llm_query('what is the capital of france?')")

Output: 588

<ExecutionResult object at 7e6320e7dc40, execution_count=1 error_before_exec=None error_in_exec=None info=<ExecutionInfo object at 7e6320e7dcd0, raw_cell="llm_query('what is the capital of france?')" transformed_cell="llm_query('what is the capital of france?')
" store_history=False silent=False shell_futures=True cell_id=None> result=ModelResponse(id='chatcmpl-e19c7837-40d0-4ddb-ae7c-935e9eac58cb', created=1766070471, model='claude-sonnet-4-5-20250929', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='The capital of France is Paris.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}))], usage=Usage(completion_tokens=10, prompt_tokens=14, total_tokens=24, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0))>

Note: 39

But we don't want to return a ModelResponse object but just the string, so we have to "unpack that"

Code: 87 ()

def llm_query(query: str, model = "claude-sonnet-4-5"):
    msg = mk_msg(query)
    return litellm.completion(model, [msg]).choices[0].message.content

llm_query("What is the capital of France?")

Output: 22

'The capital of France is Paris.'

Code: 45 ()

sh.user_ns['llm_query'] = llm_query
sh.run_cell("llm_query('what is the capital of france?')")

Output: 201

<ExecutionResult object at 7e6320e9c8f0, execution_count=1 error_before_exec=None error_in_exec=None info=<ExecutionInfo object at 7e6320e9c980, raw_cell="llm_query('what is the capital of france?')" transformed_cell="llm_query('what is the capital of france?')
" store_history=False silent=False shell_futures=True cell_id=None> result='The capital of France is Paris.'>

Prompt: 51

Easy peasy. Now we just need to wrap the shell into a tool for the "outer" LLM to use, add a system prompt and we are done.

Assistant: 123

Exactly! You're almost there.

For the tool wrapper, you'll need a function that:

Takes code as input (the repl code blocks from the LLM)
Runs it via sh.run_cell()
Returns the result (probably with some truncation for long outputs)

Do you want to write that function, or would you like me to show you a quick example?

Prompt: 7

Show me a quick example

Assistant: 198

Here's a quick example:

Copied!

def run_repl(code: str) -> str:
    "Execute Python code in the REPL environment"
    result = sh.run_cell(code)
    if result.error_in_exec:
        return f"Error: {result.error_in_exec}"
    output = result.stdout or ""
    if result.result is not None:
        output += str(result.result)
    return output[:5000]  # truncate long outputs

This captures both stdout (from print()) and the result (from expressions). The truncation prevents overwhelming the root LM's context.

Want to test it out, or adjust anything?

Note: 9

`run_repl`

Code: 145 ()

def run_repl(python_code: str) -> str:
    "Execute Python code in the REPL environment"
    result = sh.run_cell(python_code)
    if result.error_in_exec or result.error_before_exec:
        return f"Error: {result.error_in_exec or result.error_before_exec}"
    output = result.stdout or ""
    if result.result is not None:
        output += str(result.result)
    return output[:5000]  # truncate long outputs

Code: 37 ()

run_repl("llm_query(f'what is the title of the blog post? {context[:200]}')")

Output: 40

'The title of the blog post is **"The Not-so Bitter Lesson"**'

Prompt: 25

This also works with the context provided and with the llm query function. Noice

Assistant: 142

Perfect! Now you can pass run_repl as a tool to your outer Chat instance along with the system prompt you showed earlier.

Something like:

Copied!

outer_chat = Chat(model="...", tools=[run_repl])

Then add your system prompt and query, and the root LM should be able to interact with the REPL, inspect the context, and call llm_query() recursively.

Want to wire that up and test the full loop?

Note: 6

The system prompt

Prompt: 144

Please give me a shorter version of the big system prompt above that is fitting for our exact implementation. I think the most important piece of information that should come across is that the LLM has access to a REPL that enables it to call another LM and that has the context preloaded as a string variable and that it should always always consult the context first before giving an answer and that it can play with the context as python. Use whatever you can from the system prompt above.

Assistant: 184

Here's a condensed version:

Copied!

REPL_SYSTEM_PROMPT = """You are answering a query with associated context that is available in your REPL environment.

Your REPL has:
- A `context` variable (string) containing important information about your query
- An `llm_query(query: str)` function to recursively call another LLM

IMPORTANT: Always inspect the `context` variable first before answering. Use Python to explore, chunk, search, or transform it as needed.

Think step-by-step and use the REPL actively to build your answer."""

Does this capture what you need?

Code: 159 ()

REPL_SYSTEM_PROMPT = """You are answering a query with associated context that is available in your REPL environment.

Your REPL has:
- A `context` variable (string) containing important information about your query
- An `llm_query(query: str)` function to recursively call another LLM

IMPORTANT: Always inspect the `context` variable first before answering. Use Python to explore, chunk, search, or transform it as needed.

Think step-by-step and use the REPL actively to build your answer."""

Note: 18

Let's wrap the preperation of the shell in a function

Code: 127 ()

def prep_shell(context, model="claude-sonnet-4-5"):
    def llm_query(query: str):
        msg = mk_msg(query)
        return litellm.completion(model, [msg]).choices[0].message.content

    sh = get_shell()
    sh.user_ns['context'] = context
    sh.user_ns['llm_query'] = llm_query
    return sh

Code: 43 ()

outer_chat = Chat(model="claude-sonnet-4-5", tools=[run_repl], sp=REPL_SYSTEM_PROMPT)

Code: 27 ()

outer_chat("What made the bitter lesson click for the author?", max_steps=5)

Output: 964

Based on the context, here's what made the bitter lesson click for the author:

The author's "aha moment" came from tinygrad and George Hotz's blog post "Can tinygrad win?"

Specifically, the author had been struggling to understand what Sutton's "Bitter Lesson" really meant in practice - what does it mean to solve problems using search and compute? What's the role of engineers if brute-force search beats clever domain knowledge?

The breakthrough came when George Hotz explained his vision for tinygrad: the entire stack of making code run fast on hardware is fundamentally a search problem across multiple layers - searching for better scheduling algorithms, faster GPU kernels, etc. The library's task is to:

"Expose the underlying search problem spanning several orders of magnitude"
"Apply the state of the art in search. Burn compute."

This concrete example made the author realize that someone still has to expose and frame the search problem correctly. Engineers aren't obsolete - their role shifts from crafting manual solutions to building the infrastructure that enables effective search. They design the problem representation, build evaluations, set constraints, and then let compute explore the solution space.

This realization transformed the "bitter" lesson into something positive: it's not that engineering doesn't matter, it's that engineering work matters differently - by creating the frameworks that allow search and compute to work effectively.

id: chatcmpl-6841a8e3-7e39-4511-945b-6e0a1b1f56eb
model: claude-sonnet-4-5-20250929
finish_reason: stop
usage: Usage(completion_tokens=315, prompt_tokens=2594, total_tokens=2909, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

Prompt: 46

Is it working now? I think outer_chat definition is wrong or I deleted that message from my dialogue and I am not sure how it was supposed to be

Assistant: 234

It looks like it did work - the LLM gave a correct answer based on the blog post content! But I see the issue you're pointing to:

run_repl uses the global sh variable, but prep_shell creates a new shell and returns it (without updating the global). It only worked because you had already set up sh with the context earlier in your session.

To make it cleaner, you have a couple options:

Call prep_shell and update the global sh:

Copied!

sh = prep_shell(bp)

Or make run_repl a closure inside prep_shell so it captures the right shell instance.

Which approach would you prefer?

Prompt: 19

But read the first notes, we are attempting Recursive Language Models...

Assistant: 153

Ah, you're right! Looking at the RLM paper again - the whole point is that the root LM should not have the context in its prompt. It should only access it through the REPL.

But in your current setup, it looks like the model answered correctly without even calling the run_repl tool. Did it actually use the tool, or did it somehow have access to the context directly?

Check outer_chat.h to see the conversation history - were there any tool calls?

Code: 4 ()

outer_chat.hist

Output: 4,392

[{'role': 'user',
  'content': 'What made the bitter lesson click for the author?'},
 Message(content="I'll check the context to understand what made the bitter lesson click for the author.", role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"python_code": "\\n# First, let\'s examine the context\\nprint(context)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_019DRuPQn1Luww82rBytGWoa', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_019DRuPQn1Luww82rBytGWoa',
  'role': 'tool',
  'name': 'run_repl',
  'content': '---\ntitle: "The Not-so Bitter Lesson"\ndate: 2025-10-20T15:27:52+01:00\ndraft: false\nexcerpt: How tinygrad helped me finally understand what Sutton meant\n---\n\nHave you heard about The Bitter Lesson? If you are on Twitter and follow some ML/AI people, I bet you have.\n\nIf you haven\'t: It is a [blog post by Richard Sutton](http://www.incompleteideas.net/IncIdeas/BitterLesson.html), a Canadian computer scientist, who wrote one of *the* books about reinforcement learning.\n\nSutton\'s core argument is simple: general methods that leverage search and compute outperform domain-specific solutions. By "search," he doesn\'t mean information retrieval - he means using learning algorithms to explore possible solutions or strategies. His go-to example is computer chess: early engines relied heavily on chess expertise to evaluate positions, while AlphaZero knew only the basic rules and learned by playing itself, eventually surpassing any human player.\n\nThis lesson is "bitter" because it suggests our clever insights don\'t matter. Our domain-specific knowledge gets crushed by what looks like "dumb brute-force search and compute." If you\'re an engineer, this naturally raises an uncomfortable question: what\'s the point of us?\n\n## The Click\n\nFor a long time, I didn\'t really understand Sutton\'s point. What does it mean to solve problems using search and compute? What is the role of engineers then? Are we just supposed to prompt a LLM and hope for the best? That can\'t scale.\n\nMy aha moment came from an unlikely source: [`tinygrad`](https://github.com/tinygrad/tinygrad). It\'s a Python library similar to PyTorch but much simpler on the surface. I think George Hotz initially built it as a toy project after watching Andrej Karpathy create micrograd, but it evolved into a solid tensor library that runs LLMs efficiently on different hardware through clever optimizations.\n\nGeorge\'s blog post ["Can tinygrad win?"](https://geohot.github.io//blog/jekyll/update/2025/07/06/can-tinygrad-win.html) made everything click. He lays out his vision for the library: the entire stack of making code run fast on hardware is fundamentally a search problem across multiple layers — searching for better scheduling algorithms, faster GPU kernels, all of it. The task of the library is to "Expose the underlying search problem spanning several orders of magnitude." and then "Apply the state of the art in search. Burn compute.".\n\nThat\'s when I finally understood what Sutton meant. Like in computer chess, raw compute and effective search will beat manually crafted solutions.\n\nBut the key insight is this: someone still has to expose that search problem. Someone has to frame it correctly.\n\nIn this post, I\'ll show you why our work as engineers still matters, just differently. The Bitter Lesson isn\'t bitter at all - it\'s a blueprint for better engineering. And the pattern of search & compute appears everywhere once you know what to look for.\n\n## The Framework\n\nFor me, The Bitter Lesson is all about building effective search systems. Again, not search in the sense of information retrieval, but in the sense of allowing raw compute to explore a space of solutions effectively.\n\nFor that we need three core components:\n1. An effective problem formulation that exposes the right parameters in a way that different algorithms can dig into it.\n2. A way to evaluate possible solutions (evals) and provide feedback to the search mechanism.\n3. A way to define constraints that a solution has to satisfy (memory, time, specific form etc.)\n\nOur job as engineers is to build these three components. We design the problem-specific, scalable infrastructure that enables search to work. Then we gtfo and let compute do what it does best.\n\nBut don\'t you worry, there is still place for your precious domain knowledge, even in The Bitter Lesson, but you should use it elsewhere:\n\nUse your domain knowledge to:\n\n* Design problem representations (make search tractable)\n* Build evaluations (capture what matters)\n* Set constraints (avoid unsafe regions)\n* Initialize search (warm start)\n\nAnd use search to:\n\n* Explore solution space\n* Find non-obvious patterns\n* Exceed human intuition\n* Scale the solution with compute\n\nSound familiar? We\'ve done this before — moving from punch cards to text-based code was the same kind of shift. And now we\'re building harnesses and guardrails (and [RL environments](https://www.primeintellect.ai/blog/environments)) that define the bounds of desired solutions rather than crafting the solutions directly.\n\n## The Examples\n\nIt\'s fascinating to see where these themes come up, besides `tinygrad`.\n\nIf you follow me on [Twitter](https://twitter.com/rasmus1610), you might know that I\'m a huge fanboi of DSPy - a declarative library for "programming not prompting" LLMs. And DSPy embodies the approach described above by separating problem definition from solution strategy. Omar Khattab, the creator of the library, actually gave a talk called ["On Engineering AI Systems that Endure The Bitt'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let\'s see if there\'s more context\\nprint(len(context))\\nprint(\\"\\\\n--- Looking for the specific part about what made it click ---\\\\n\\")\\n\\n# Find the section about \\"The Click\\"\\nclick_section_start = context.find(\\"## The Click\\")\\nclick_section_end = context.find(\\"## The Framework\\")\\n\\nif click_section_start != -1 and click_section_end != -1:\\n    click_section = context[click_section_start:click_section_end]\\n    print(click_section)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01WiLPeggausvwvFGjFi4GUA', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01WiLPeggausvwvFGjFi4GUA',
  'role': 'tool',
  'name': 'run_repl',
  'content': '9245\n\n--- Looking for the specific part about what made it click ---\n\n## The Click\n\nFor a long time, I didn\'t really understand Sutton\'s point. What does it mean to solve problems using search and compute? What is the role of engineers then? Are we just supposed to prompt a LLM and hope for the best? That can\'t scale.\n\nMy aha moment came from an unlikely source: [`tinygrad`](https://github.com/tinygrad/tinygrad). It\'s a Python library similar to PyTorch but much simpler on the surface. I think George Hotz initially built it as a toy project after watching Andrej Karpathy create micrograd, but it evolved into a solid tensor library that runs LLMs efficiently on different hardware through clever optimizations.\n\nGeorge\'s blog post ["Can tinygrad win?"](https://geohot.github.io//blog/jekyll/update/2025/07/06/can-tinygrad-win.html) made everything click. He lays out his vision for the library: the entire stack of making code run fast on hardware is fundamentally a search problem across multiple layers — searching for better scheduling algorithms, faster GPU kernels, all of it. The task of the library is to "Expose the underlying search problem spanning several orders of magnitude." and then "Apply the state of the art in search. Burn compute.".\n\nThat\'s when I finally understood what Sutton meant. Like in computer chess, raw compute and effective search will beat manually crafted solutions.\n\nBut the key insight is this: someone still has to expose that search problem. Someone has to frame it correctly.\n\nIn this post, I\'ll show you why our work as engineers still matters, just differently. The Bitter Lesson isn\'t bitter at all - it\'s a blueprint for better engineering. And the pattern of search & compute appears everywhere once you know what to look for.\n\n\n'},
 Message(content='Based on the context, here\'s what made the bitter lesson click for the author:\n\n**The author\'s "aha moment" came from `tinygrad` and George Hotz\'s blog post "Can tinygrad win?"**\n\nSpecifically, the author had been struggling to understand what Sutton\'s "Bitter Lesson" really meant in practice - what does it mean to solve problems using search and compute? What\'s the role of engineers if brute-force search beats clever domain knowledge?\n\nThe breakthrough came when George Hotz explained his vision for tinygrad: **the entire stack of making code run fast on hardware is fundamentally a search problem** across multiple layers - searching for better scheduling algorithms, faster GPU kernels, etc. The library\'s task is to:\n1. "Expose the underlying search problem spanning several orders of magnitude"\n2. "Apply the state of the art in search. Burn compute."\n\nThis concrete example made the author realize that **someone still has to expose and frame the search problem correctly**. Engineers aren\'t obsolete - their role shifts from crafting manual solutions to building the infrastructure that enables effective search. They design the problem representation, build evaluations, set constraints, and then let compute explore the solution space.\n\nThis realization transformed the "bitter" lesson into something positive: it\'s not that engineering doesn\'t matter, it\'s that engineering work matters *differently* - by creating the frameworks that allow search and compute to work effectively.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})]

Prompt: 12

Check again with the output of outer_chat

Assistant: 301

Oh wow, it actually worked perfectly! Looking at the history:

First tool call: print(context) - inspected the full context

Second tool call: Used Python to find the "The Click" section specifically:

Copied!

click_section_start = context.find("## The Click")
click_section_end = context.find("## The Framework")
click_section = context[click_section_start:click_section_end]

Final answer: Synthesized a response based on what it found

That's exactly the RLM pattern in action! The root LM:

Never had the context in its prompt directly
Used the REPL to explore and slice the context with Python
Gave an answer based on what it retrieved

Congrats - your basic RLM implementation is working! 🎉

Do you want to test it with a much larger context to see how it handles chunking and llm_query() for recursive calls?

Prompt: 123

This worked, but the context is also not that long. Maybe we need something longer where the LLM has to realy wrangle with the context. It is also quite bold of the LLM to just print out the context :D It could have been a million tokens long. So maybe this is something we need to put in the prompt as well, that is has to be careful with the context length

Assistant: 297

Good point! The model got lucky here. You should definitely add a warning about context length to the system prompt. Something like:

Copied!

REPL_SYSTEM_PROMPT = """You are answering a query with associated context that is available in your REPL environment.

Your REPL has:
- A `context` variable (string) containing important information about your query
- An `llm_query(query: str)` function to recursively call another LLM

IMPORTANT:
- Always inspect the `context` variable first before answering
- The context can be VERY LONG (millions of tokens). Never print it directly!
- Use Python to explore it safely: check length with len(), peek at slices, search with regex, chunk it
- Use `llm_query()` on manageable chunks to analyze semantics

Think step-by-step and use the REPL actively to build your answer."""

Want to test with a longer, more complex context that requires chunking and multiple recursive calls?

Code: 289 ()

REPL_SYSTEM_PROMPT = """You are tasked with answering a query with associated context. You can access, transform, and analyze this context interactively in a REPL environment that can recursively query sub-LLMs, which you are strongly encouraged to use as much as possible. You will be queried iteratively until you provide a final answer.

Your REPL has:
- A `context` variable (string) containing important information about your query
- An `llm_query(query: str)` function to recursively call another LLM

IMPORTANT:
- Always inspect the `context` variable first before answering
- The context can be VERY LONG (millions of tokens). Never print it directly!
- Use Python to explore it safely: check length with len(), peek at slices, search with regex, chunk it
- Use `llm_query()` on manageable chunks to analyze semantics

Think step-by-step and use the REPL actively to build your answer."""

Code: 27 ()

outer_chat("What made the bitter lesson click for the author?", max_steps=10)

Output: 889

Based on the context, here's what made the bitter lesson click for the author:

The author's "aha moment" came from tinygrad and George Hotz's blog post "Can tinygrad win?"

The author had been struggling to understand what Sutton's Bitter Lesson really meant in practice. The breakthrough happened when George Hotz explained his vision for tinygrad: the entire stack of making code run fast on hardware is fundamentally a search problem across multiple layers—searching for better scheduling algorithms, faster GPU kernels, etc.

The key insight from tinygrad's approach was:

"Expose the underlying search problem spanning several orders of magnitude"
"Apply the state of the art in search. Burn compute."

This concrete example made the author realize that someone still has to expose and frame the search problem correctly. The Bitter Lesson doesn't mean engineers are obsolete—it means their role shifts from manually crafting solutions to building the infrastructure that enables effective search. Engineers design the problem representation, build evaluations, set constraints, and then let compute explore the solution space.

This realization transformed the "bitter" lesson into something positive: engineering work still matters, just differently.

id: chatcmpl-2ae80e51-9eb8-49b6-8b08-35017a2ebb8c
model: claude-sonnet-4-5-20250929
finish_reason: stop
usage: Usage(completion_tokens=270, prompt_tokens=4063, total_tokens=4333, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

Code: 4 ()

outer_chat.hist

Output: 7,500

[{'role': 'user',
  'content': 'What made the bitter lesson click for the author?'},
 Message(content="I'll check the context to understand what made the bitter lesson click for the author.", role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"python_code": "\\n# First, let\'s examine the context\\nprint(context)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_019DRuPQn1Luww82rBytGWoa', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_019DRuPQn1Luww82rBytGWoa',
  'role': 'tool',
  'name': 'run_repl',
  'content': '---\ntitle: "The Not-so Bitter Lesson"\ndate: 2025-10-20T15:27:52+01:00\ndraft: false\nexcerpt: How tinygrad helped me finally understand what Sutton meant\n---\n\nHave you heard about The Bitter Lesson? If you are on Twitter and follow some ML/AI people, I bet you have.\n\nIf you haven\'t: It is a [blog post by Richard Sutton](http://www.incompleteideas.net/IncIdeas/BitterLesson.html), a Canadian computer scientist, who wrote one of *the* books about reinforcement learning.\n\nSutton\'s core argument is simple: general methods that leverage search and compute outperform domain-specific solutions. By "search," he doesn\'t mean information retrieval - he means using learning algorithms to explore possible solutions or strategies. His go-to example is computer chess: early engines relied heavily on chess expertise to evaluate positions, while AlphaZero knew only the basic rules and learned by playing itself, eventually surpassing any human player.\n\nThis lesson is "bitter" because it suggests our clever insights don\'t matter. Our domain-specific knowledge gets crushed by what looks like "dumb brute-force search and compute." If you\'re an engineer, this naturally raises an uncomfortable question: what\'s the point of us?\n\n## The Click\n\nFor a long time, I didn\'t really understand Sutton\'s point. What does it mean to solve problems using search and compute? What is the role of engineers then? Are we just supposed to prompt a LLM and hope for the best? That can\'t scale.\n\nMy aha moment came from an unlikely source: [`tinygrad`](https://github.com/tinygrad/tinygrad). It\'s a Python library similar to PyTorch but much simpler on the surface. I think George Hotz initially built it as a toy project after watching Andrej Karpathy create micrograd, but it evolved into a solid tensor library that runs LLMs efficiently on different hardware through clever optimizations.\n\nGeorge\'s blog post ["Can tinygrad win?"](https://geohot.github.io//blog/jekyll/update/2025/07/06/can-tinygrad-win.html) made everything click. He lays out his vision for the library: the entire stack of making code run fast on hardware is fundamentally a search problem across multiple layers — searching for better scheduling algorithms, faster GPU kernels, all of it. The task of the library is to "Expose the underlying search problem spanning several orders of magnitude." and then "Apply the state of the art in search. Burn compute.".\n\nThat\'s when I finally understood what Sutton meant. Like in computer chess, raw compute and effective search will beat manually crafted solutions.\n\nBut the key insight is this: someone still has to expose that search problem. Someone has to frame it correctly.\n\nIn this post, I\'ll show you why our work as engineers still matters, just differently. The Bitter Lesson isn\'t bitter at all - it\'s a blueprint for better engineering. And the pattern of search & compute appears everywhere once you know what to look for.\n\n## The Framework\n\nFor me, The Bitter Lesson is all about building effective search systems. Again, not search in the sense of information retrieval, but in the sense of allowing raw compute to explore a space of solutions effectively.\n\nFor that we need three core components:\n1. An effective problem formulation that exposes the right parameters in a way that different algorithms can dig into it.\n2. A way to evaluate possible solutions (evals) and provide feedback to the search mechanism.\n3. A way to define constraints that a solution has to satisfy (memory, time, specific form etc.)\n\nOur job as engineers is to build these three components. We design the problem-specific, scalable infrastructure that enables search to work. Then we gtfo and let compute do what it does best.\n\nBut don\'t you worry, there is still place for your precious domain knowledge, even in The Bitter Lesson, but you should use it elsewhere:\n\nUse your domain knowledge to:\n\n* Design problem representations (make search tractable)\n* Build evaluations (capture what matters)\n* Set constraints (avoid unsafe regions)\n* Initialize search (warm start)\n\nAnd use search to:\n\n* Explore solution space\n* Find non-obvious patterns\n* Exceed human intuition\n* Scale the solution with compute\n\nSound familiar? We\'ve done this before — moving from punch cards to text-based code was the same kind of shift. And now we\'re building harnesses and guardrails (and [RL environments](https://www.primeintellect.ai/blog/environments)) that define the bounds of desired solutions rather than crafting the solutions directly.\n\n## The Examples\n\nIt\'s fascinating to see where these themes come up, besides `tinygrad`.\n\nIf you follow me on [Twitter](https://twitter.com/rasmus1610), you might know that I\'m a huge fanboi of DSPy - a declarative library for "programming not prompting" LLMs. And DSPy embodies the approach described above by separating problem definition from solution strategy. Omar Khattab, the creator of the library, actually gave a talk called ["On Engineering AI Systems that Endure The Bitt'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let\'s see if there\'s more context\\nprint(len(context))\\nprint(\\"\\\\n--- Looking for the specific part about what made it click ---\\\\n\\")\\n\\n# Find the section about \\"The Click\\"\\nclick_section_start = context.find(\\"## The Click\\")\\nclick_section_end = context.find(\\"## The Framework\\")\\n\\nif click_section_start != -1 and click_section_end != -1:\\n    click_section = context[click_section_start:click_section_end]\\n    print(click_section)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01WiLPeggausvwvFGjFi4GUA', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01WiLPeggausvwvFGjFi4GUA',
  'role': 'tool',
  'name': 'run_repl',
  'content': '9245\n\n--- Looking for the specific part about what made it click ---\n\n## The Click\n\nFor a long time, I didn\'t really understand Sutton\'s point. What does it mean to solve problems using search and compute? What is the role of engineers then? Are we just supposed to prompt a LLM and hope for the best? That can\'t scale.\n\nMy aha moment came from an unlikely source: [`tinygrad`](https://github.com/tinygrad/tinygrad). It\'s a Python library similar to PyTorch but much simpler on the surface. I think George Hotz initially built it as a toy project after watching Andrej Karpathy create micrograd, but it evolved into a solid tensor library that runs LLMs efficiently on different hardware through clever optimizations.\n\nGeorge\'s blog post ["Can tinygrad win?"](https://geohot.github.io//blog/jekyll/update/2025/07/06/can-tinygrad-win.html) made everything click. He lays out his vision for the library: the entire stack of making code run fast on hardware is fundamentally a search problem across multiple layers — searching for better scheduling algorithms, faster GPU kernels, all of it. The task of the library is to "Expose the underlying search problem spanning several orders of magnitude." and then "Apply the state of the art in search. Burn compute.".\n\nThat\'s when I finally understood what Sutton meant. Like in computer chess, raw compute and effective search will beat manually crafted solutions.\n\nBut the key insight is this: someone still has to expose that search problem. Someone has to frame it correctly.\n\nIn this post, I\'ll show you why our work as engineers still matters, just differently. The Bitter Lesson isn\'t bitter at all - it\'s a blueprint for better engineering. And the pattern of search & compute appears everywhere once you know what to look for.\n\n\n'},
 Message(content='Based on the context, here\'s what made the bitter lesson click for the author:\n\n**The author\'s "aha moment" came from `tinygrad` and George Hotz\'s blog post "Can tinygrad win?"**\n\nSpecifically, the author had been struggling to understand what Sutton\'s "Bitter Lesson" really meant in practice - what does it mean to solve problems using search and compute? What\'s the role of engineers if brute-force search beats clever domain knowledge?\n\nThe breakthrough came when George Hotz explained his vision for tinygrad: **the entire stack of making code run fast on hardware is fundamentally a search problem** across multiple layers - searching for better scheduling algorithms, faster GPU kernels, etc. The library\'s task is to:\n1. "Expose the underlying search problem spanning several orders of magnitude"\n2. "Apply the state of the art in search. Burn compute."\n\nThis concrete example made the author realize that **someone still has to expose and frame the search problem correctly**. Engineers aren\'t obsolete - their role shifts from crafting manual solutions to building the infrastructure that enables effective search. They design the problem representation, build evaluations, set constraints, and then let compute explore the solution space.\n\nThis realization transformed the "bitter" lesson into something positive: it\'s not that engineering doesn\'t matter, it\'s that engineering work matters *differently* - by creating the frameworks that allow search and compute to work effectively.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'role': 'user',
  'content': 'What made the bitter lesson click for the author?'},
 Message(content="I'll check the context to find what made the bitter lesson click for the author.", role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"python_code": "\\n# Let\'s examine the context to find the answer\\nprint(context[:2000])\\n"}', 'name': 'run_repl'}, 'id': 'toolu_011ar1GQufNM52gUazj2Uunt', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_011ar1GQufNM52gUazj2Uunt',
  'role': 'tool',
  'name': 'run_repl',
  'content': '---\ntitle: "The Not-so Bitter Lesson"\ndate: 2025-10-20T15:27:52+01:00\ndraft: false\nexcerpt: How tinygrad helped me finally understand what Sutton meant\n---\n\nHave you heard about The Bitter Lesson? If you are on Twitter and follow some ML/AI people, I bet you have.\n\nIf you haven\'t: It is a [blog post by Richard Sutton](http://www.incompleteideas.net/IncIdeas/BitterLesson.html), a Canadian computer scientist, who wrote one of *the* books about reinforcement learning.\n\nSutton\'s core argument is simple: general methods that leverage search and compute outperform domain-specific solutions. By "search," he doesn\'t mean information retrieval - he means using learning algorithms to explore possible solutions or strategies. His go-to example is computer chess: early engines relied heavily on chess expertise to evaluate positions, while AlphaZero knew only the basic rules and learned by playing itself, eventually surpassing any human player.\n\nThis lesson is "bitter" because it suggests our clever insights don\'t matter. Our domain-specific knowledge gets crushed by what looks like "dumb brute-force search and compute." If you\'re an engineer, this naturally raises an uncomfortable question: what\'s the point of us?\n\n## The Click\n\nFor a long time, I didn\'t really understand Sutton\'s point. What does it mean to solve problems using search and compute? What is the role of engineers then? Are we just supposed to prompt a LLM and hope for the best? That can\'t scale.\n\nMy aha moment came from an unlikely source: [`tinygrad`](https://github.com/tinygrad/tinygrad). It\'s a Python library similar to PyTorch but much simpler on the surface. I think George Hotz initially built it as a toy project after watching Andrej Karpathy create micrograd, but it evolved into a solid tensor library that runs LLMs efficiently on different hardware through clever optimizations.\n\nGeorge\'s blog post ["Can tinygrad win?"](https://geohot.github.io//blog/jekyll/update/2025/07/06/can-tinygrad-win.html) made eve\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let\'s get the full \\"Click\\" section\\nclick_start = context.find(\\"## The Click\\")\\nframework_start = context.find(\\"## The Framework\\")\\n\\nclick_section = context[click_start:framework_start]\\nprint(click_section)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_019HEuDVyo8bJsN771xaokuz', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_019HEuDVyo8bJsN771xaokuz',
  'role': 'tool',
  'name': 'run_repl',
  'content': '## The Click\n\nFor a long time, I didn\'t really understand Sutton\'s point. What does it mean to solve problems using search and compute? What is the role of engineers then? Are we just supposed to prompt a LLM and hope for the best? That can\'t scale.\n\nMy aha moment came from an unlikely source: [`tinygrad`](https://github.com/tinygrad/tinygrad). It\'s a Python library similar to PyTorch but much simpler on the surface. I think George Hotz initially built it as a toy project after watching Andrej Karpathy create micrograd, but it evolved into a solid tensor library that runs LLMs efficiently on different hardware through clever optimizations.\n\nGeorge\'s blog post ["Can tinygrad win?"](https://geohot.github.io//blog/jekyll/update/2025/07/06/can-tinygrad-win.html) made everything click. He lays out his vision for the library: the entire stack of making code run fast on hardware is fundamentally a search problem across multiple layers — searching for better scheduling algorithms, faster GPU kernels, all of it. The task of the library is to "Expose the underlying search problem spanning several orders of magnitude." and then "Apply the state of the art in search. Burn compute.".\n\nThat\'s when I finally understood what Sutton meant. Like in computer chess, raw compute and effective search will beat manually crafted solutions.\n\nBut the key insight is this: someone still has to expose that search problem. Someone has to frame it correctly.\n\nIn this post, I\'ll show you why our work as engineers still matters, just differently. The Bitter Lesson isn\'t bitter at all - it\'s a blueprint for better engineering. And the pattern of search & compute appears everywhere once you know what to look for.\n\n\n'},
 Message(content='Based on the context, here\'s what made the bitter lesson click for the author:\n\n**The author\'s "aha moment" came from `tinygrad` and George Hotz\'s blog post "Can tinygrad win?"**\n\nThe author had been struggling to understand what Sutton\'s Bitter Lesson really meant in practice. The breakthrough happened when George Hotz explained his vision for tinygrad: **the entire stack of making code run fast on hardware is fundamentally a search problem** across multiple layers—searching for better scheduling algorithms, faster GPU kernels, etc.\n\nThe key insight from tinygrad\'s approach was:\n1. "Expose the underlying search problem spanning several orders of magnitude"\n2. "Apply the state of the art in search. Burn compute."\n\nThis concrete example made the author realize that **someone still has to expose and frame the search problem correctly**. The Bitter Lesson doesn\'t mean engineers are obsolete—it means their role shifts from manually crafting solutions to building the infrastructure that enables effective search. Engineers design the problem representation, build evaluations, set constraints, and then let compute explore the solution space.\n\nThis realization transformed the "bitter" lesson into something positive: engineering work still matters, just differently.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})]

Note: 7

Testing with longer context

Prompt: 46

This works somewhat but it doesn't use the LLM to recursively call a LLM. I will test this with a much longer context to see how it does

Assistant: 129

Good idea! With a much longer context, the model should be forced to use llm_query() for semantic analysis rather than just printing chunks directly.

A few things to watch for when you test:

Does it chunk the context intelligently?
Does it actually call llm_query() on those chunks?
How does it aggregate the results from multiple recursive calls?

Let me know what you find with the longer context!

Code: 3 ()

import os

Code: 27 ()

with open("SolveIt_Notes.md") as f:
        notes = f.read()

Code: 6 ()

notes[:200]

Output: 99

'Source: #Gemini_3_Pro #SolveIt #Claudette\nTags: [[ProfitCopilot]] - [[SolveIt]] - [[Claudette]]\n\nShift + Tab inside () of functions opens the menu with explanations\nControl + Space: Changes Keyboard L'

Code: 21 ()

outer_chat("What is Claudette?", max_steps=10)

Output: 589

Based on my search of the context provided, Claudette is not mentioned in this document.

The context is about "The Not-so Bitter Lesson" - a blog post discussing Richard Sutton's Bitter Lesson, tinygrad, DSPy, and how engineers should approach building systems that leverage search and compute rather than manual domain-specific solutions.

I don't have information about Claudette in the available context. If you're asking about something specific related to Claudette, could you provide more context or clarify what you're looking for?

id: chatcmpl-dbad25ce-c9f4-47d1-85d9-939feb0aa279
model: claude-sonnet-4-5-20250929
finish_reason: stop
usage: Usage(completion_tokens=125, prompt_tokens=5812, total_tokens=5937, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

Prompt: 58

I created notes variable that has "SolveIt_Notes.md" and want to use that as context, not the "The Not-so Bitter Lesson". How can I do that with outer_chat()?

Assistant: 144

You need to reinitialize the shell with the new context:

Copied!

sh = prep_shell(notes)
outer_chat = Chat(model="claude-sonnet-4-5", tools=[run_repl], sp=REPL_SYSTEM_PROMPT)

Then you can query it with a question about your notes. How long is this context? Check with len(notes) to see if it's big enough to force the model to chunk and use llm_query().

Code: 52 ()

sh = prep_shell(notes)
outer_chat = Chat(model="claude-sonnet-4-5", tools=[run_repl], sp=REPL_SYSTEM_PROMPT)

Code: 21 ()

outer_chat("What is Claudette?", max_steps=10)

Output: 1,165

Perfect! Now I have a clear understanding of what Claudette is. Let me provide you with a comprehensive answer.

What is Claudette?

Claudette is a high-level Python library wrapper for the Anthropic SDK, developed by the same team behind Solveit (Answer.AI). It's designed to make programmatic interactions with Claude AI models (like Opus 4.5 and Sonnet 4.5) significantly more ergonomic and powerful.

Key Characteristics:

Purpose: It provides tools to embed Claude's intelligence directly into application logic, rather than just for interactive development assistance.
Official Website: https://claudette.answer.ai/
Main Use Cases:
- Structured Data Extraction: Using Client.structured() to extract information from unstructured documents and convert them into structured Python objects
- Natural Language Interfaces: Creating NLI for complex applications using simplified tool definitions and Chat.toolloop()
- Agentic Workflows: Building automation and intelligent workflows into applications

How It Differs from Interactive AI:

Solveit Native AI = Interactive development assistant (pair programmer for coding, debugging, learning)
Claudette Library = Programmatic execution tool (for building advanced features into your actual application)

In essence, Claudette is a developer-friendly Python library that makes it easier to integrate Claude AI capabilities into your applications with cleaner, more maintainable code than using the raw Anthropic SDK directly.

id: chatcmpl-d91f6fff-4803-4531-82ff-e541cbfe405d
model: claude-sonnet-4-5-20250929
finish_reason: stop
usage: Usage(completion_tokens=348, prompt_tokens=3333, total_tokens=3681, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

Code: 4 ()

outer_chat.hist

Output: 5,998

[{'role': 'user', 'content': 'What is Claudette?'},
 Message(content="I'll help you find out what Claudette is. Let me start by exploring the context to see what information is available.", role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"python_code": "\\n# First, let\'s check if we have context and explore its basic properties\\nprint(f\\"Context length: {len(context)} characters\\")\\nprint(f\\"\\\\nFirst 500 characters of context:\\")\\nprint(context[:500])\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01XmdufV2zy3NQbWn2VvCda4', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01XmdufV2zy3NQbWn2VvCda4',
  'role': 'tool',
  'name': 'run_repl',
  'content': 'Context length: 100818 characters\n\nFirst 500 characters of context:\nSource: #Gemini_3_Pro #SolveIt #Claudette\nTags: [[ProfitCopilot]] - [[SolveIt]] - [[Claudette]]\n\nShift + Tab inside () of functions opens the menu with explanations\nControl + Space: Changes Keyboard Language\nCmd + SHIFT + J: Code\nCmd + SHIFT + K: Note\nCmd + SHIFT + ;: Raw\nCmd + SHIFT + d: Use Thinking (toggle)\noption + }: `\n\t\t\t\tUsed to declare a function: \n\t\t\t\t- & `function_name`     and the AI in Solveit knows it exists and can call it and use as a tool.\n\n\n# Keyboard Shortcuts\n\n### Navigation\n\n\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let\'s search for \\"Claudette\\" in the context to find relevant information\\nimport re\\n\\n# Find all occurrences of \\"Claudette\\" with some surrounding context\\nmatches = []\\nfor match in re.finditer(r\'.{0,200}Claudette.{0,200}\', context, re.IGNORECASE):\\n    matches.append(match.group())\\n\\nprint(f\\"Found {len(matches)} mentions of \'Claudette\'\\\\n\\")\\nprint(\\"First few mentions:\\")\\nfor i, match in enumerate(matches[:5]):\\n    print(f\\"\\\\n--- Match {i+1} ---\\")\\n    print(match)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01UmyGqnPjHUzDupyeHkexdd', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01UmyGqnPjHUzDupyeHkexdd',
  'role': 'tool',
  'name': 'run_repl',
  'content': "Found 30 mentions of 'Claudette'\n\nFirst few mentions:\n\n--- Match 1 ---\nSource: #Gemini_3_Pro #SolveIt #Claudette\n\n--- Match 2 ---\nTags: [[ProfitCopilot]] - [[SolveIt]] - [[Claudette]]\n\n--- Match 3 ---\n# Claudette (Anthropic SDK) Integration\n\n--- Match 4 ---\nhttps://claudette.answer.ai/\n\n--- Match 5 ---\nYes, you should absolutely integrate `claudette` into your project.\n"},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let\'s look for a more detailed section about Claudette\\n# Search for the section that starts with \\"# Claudette\\"\\nclaudette_section_start = context.find(\\"# Claudette (Anthropic SDK) Integration\\")\\nif claudette_section_start != -1:\\n    # Get a substantial chunk after this heading\\n    claudette_section = context[claudette_section_start:claudette_section_start + 2000]\\n    print(claudette_section)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01Dc7o1yYRMM3mJd4EQnKGdW', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01Dc7o1yYRMM3mJd4EQnKGdW',
  'role': 'tool',
  'name': 'run_repl',
  'content': '# Claudette (Anthropic SDK) Integration\nhttps://claudette.answer.ai/\n\nYes, you should absolutely integrate `claudette` into your project.\n\n`claudette` is a high-level Python library wrapper for the Anthropic SDK, developed by the same team behind Solveit. It is designed to make programmatic interactions with Claude models (like Opus 4.5 and Sonnet 4.5, which you have access to in Solveit) significantly more ergonomic and powerful.\n\nWhile Solveit provides the _environment_ for your development and interactive AI assistance, `claudette` provides the _tools_ to embed Claude\'s intelligence directly into the application logic of your equity modeler.\n\n### Solveit Native AI vs. Claudette: Understanding the Roles\n\nIt is crucial to distinguish between how you interact with AI during development versus how your application uses AI.\n\n1. **Solveit Native AI (The Pair Programmer):**\n    \n    - When you use Solveit\'s built-in Prompt messages, Ghost Text, or AI Modes, you are engaging in **interactive development**. This is ideal for learning Pandas, debugging your financial logic, and iterating rapidly (Phases 1-3 of your roadmap). The platform manages the context of your development session.\n        \n2. **Claudette Library (The Application Logic):**\n    \n    - When you import `claudette` inside a Python code cell, you are using it for **programmatic execution**. This is for building the advanced features of your tool, such as automation, agentic workflows, and data extraction (Phase 4).\n        \n\nYou should not use `claudette`\'s `Chat` class for basic assistance while coding, as Solveit\'s interface already handles this efficiently.\n\n### High-Leverage Use Cases for Claudette in VC Modeling\n\n`claudette` enables features that transform your Pandas model from a complex script into an intelligent analytical tool. Here are the highest-leverage use cases for your project:\n\n#### 1. Structured Data Extraction (The "Term Sheet Parser")\n\nA major bottleneck in VC is converting unstructured \n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let\'s get more of this section to see the complete explanation\\nclaudette_section_start = context.find(\\"# Claudette (Anthropic SDK) Integration\\")\\nif claudette_section_start != -1:\\n    # Get a larger chunk to see more details\\n    claudette_section = context[claudette_section_start:claudette_section_start + 4000]\\n    print(claudette_section)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01YW59bYqPzSGG6m67hYKWb9', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01YW59bYqPzSGG6m67hYKWb9',
  'role': 'tool',
  'name': 'run_repl',
  'content': '# Claudette (Anthropic SDK) Integration\nhttps://claudette.answer.ai/\n\nYes, you should absolutely integrate `claudette` into your project.\n\n`claudette` is a high-level Python library wrapper for the Anthropic SDK, developed by the same team behind Solveit. It is designed to make programmatic interactions with Claude models (like Opus 4.5 and Sonnet 4.5, which you have access to in Solveit) significantly more ergonomic and powerful.\n\nWhile Solveit provides the _environment_ for your development and interactive AI assistance, `claudette` provides the _tools_ to embed Claude\'s intelligence directly into the application logic of your equity modeler.\n\n### Solveit Native AI vs. Claudette: Understanding the Roles\n\nIt is crucial to distinguish between how you interact with AI during development versus how your application uses AI.\n\n1. **Solveit Native AI (The Pair Programmer):**\n    \n    - When you use Solveit\'s built-in Prompt messages, Ghost Text, or AI Modes, you are engaging in **interactive development**. This is ideal for learning Pandas, debugging your financial logic, and iterating rapidly (Phases 1-3 of your roadmap). The platform manages the context of your development session.\n        \n2. **Claudette Library (The Application Logic):**\n    \n    - When you import `claudette` inside a Python code cell, you are using it for **programmatic execution**. This is for building the advanced features of your tool, such as automation, agentic workflows, and data extraction (Phase 4).\n        \n\nYou should not use `claudette`\'s `Chat` class for basic assistance while coding, as Solveit\'s interface already handles this efficiently.\n\n### High-Leverage Use Cases for Claudette in VC Modeling\n\n`claudette` enables features that transform your Pandas model from a complex script into an intelligent analytical tool. Here are the highest-leverage use cases for your project:\n\n#### 1. Structured Data Extraction (The "Term Sheet Parser")\n\nA major bottleneck in VC is converting unstructured legal documents (Term Sheets, Stock Purchase Agreements) into structured data for modeling.\n\n- **The Feature:** `Client.structured()`\n    \n- **How You Use It:** You define a Python class representing the key economic terms (Valuation, Liquidation Preference, Option Pool Size). You then pass the text of the document to `claudette`, which uses Claude to extract the information and return a structured Python object.\n    \n- **Benefit:** This automates the ingestion of deal terms directly into your Pandas model, reducing manual entry and errors.\n    \n\nPython\n\n```\n# Example conceptual snippet using claudette (to be run in a Solveit Code cell)\nfrom claudette import Client, store_attr\n\n# Initialize the client (Solveit environment handles API keys)\nmodel = \'claude-sonnet-4-20250514\' # Or Opus 4.5\ncli = Client(model)\n\nclass TermSheetEconomics:\n    "Key economic terms from a venture financing term sheet"\n    def __init__(self,\n                 pre_money_valuation_usd:float,\n                 investment_amount_usd:float,\n                 liquidation_preference:str, # e.g., "1x Participating"\n        ):\n        store_attr()\n\n# Assume \'term_sheet_text\' contains the raw text of the document\n# extracted_terms = cli.structured(\n#     f"Extract the key terms from this term sheet: <doc>{term_sheet_text}</doc>",\n#     TermSheetEconomics\n# )\n# Now \'extracted_terms\' is ready for your Pandas model.\n```\n\n#### 2. Natural Language Interface (NLI) for Scenario Analysis\n\nOnce your core Pandas functions (e.g., `calculate_waterfall()`) are robust, you can create a natural language interface for complex "What-If" simulations.\n\n- **The Features:** Simplified Tool Use definitions and `Chat.toolloop()`\n    \n- **How You Use It:** You expose your Pandas functions as "Tools" using `claudette`\'s ergonomic syntax. `Chat.toolloop()` handles the orchestration: Claude interprets a natural language query, determines which tools (your functions) to call and in what order, executes them, and synthesizes the res\n'},
 Message(content="Perfect! Now I have a clear understanding of what Claudette is. Let me provide you with a comprehensive answer.\n\n## What is Claudette?\n\n**Claudette** is a high-level Python library wrapper for the Anthropic SDK, developed by the same team behind Solveit (Answer.AI). It's designed to make programmatic interactions with Claude AI models (like Opus 4.5 and Sonnet 4.5) significantly more ergonomic and powerful.\n\n### Key Characteristics:\n\n1. **Purpose**: It provides tools to embed Claude's intelligence directly into application logic, rather than just for interactive development assistance.\n\n2. **Official Website**: https://claudette.answer.ai/\n\n3. **Main Use Cases**:\n   - **Structured Data Extraction**: Using `Client.structured()` to extract information from unstructured documents and convert them into structured Python objects\n   - **Natural Language Interfaces**: Creating NLI for complex applications using simplified tool definitions and `Chat.toolloop()`\n   - **Agentic Workflows**: Building automation and intelligent workflows into applications\n\n### How It Differs from Interactive AI:\n\n- **Solveit Native AI** = Interactive development assistant (pair programmer for coding, debugging, learning)\n- **Claudette Library** = Programmatic execution tool (for building advanced features into your actual application)\n\nIn essence, Claudette is a developer-friendly Python library that makes it easier to integrate Claude AI capabilities into your applications with cleaner, more maintainable code than using the raw Anthropic SDK directly.", role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})]

Note: 7

Improving the system prompt

Note: 90

We need to be more clear with how to use the LLM and thus we need to add an examples to the system prompt above how the models can actually use the llm_query tool in the REPL. Let's grab the examples from the original system prompt and improve the system prompt even more

Prompt: 21

Did the outer_chat in the previous message used query_llm function?

Assistant: 147

Check outer_chat.hist to see. Based on the output shown, it looks like it gave a final answer without showing the intermediate steps. But I suspect it still just used Python slicing/regex rather than llm_query().

The 100K character context might still be small enough that the model can handle chunks directly without needing recursive LLM calls. You might need an even larger context, or a task that explicitly requires semantic understanding of multiple chunks that need to be synthesized.

Code: 4 ()

outer_chat.hist

Output: 7,933

[{'role': 'user', 'content': 'What is Claudette?'},
 Message(content="I'll help you find out what Claudette is by exploring the context provided.", role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"python_code": "\\n# First, let\'s check what we have in the context\\nprint(f\\"Context type: {type(context)}\\")\\nprint(f\\"Context length: {len(context)}\\")\\nprint(f\\"First 500 characters:\\\\n{context[:500]}\\")\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01PVSPZAKYJbmus1zdSkXeZt', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01PVSPZAKYJbmus1zdSkXeZt',
  'role': 'tool',
  'name': 'run_repl',
  'content': "Context type: <class 'str'>\nContext length: 100818\nFirst 500 characters:\nSource: #Gemini_3_Pro #SolveIt #Claudette\nTags: [[ProfitCopilot]] - [[SolveIt]] - [[Claudette]]\n\nShift + Tab inside () of functions opens the menu with explanations\nControl + Space: Changes Keyboard Language\nCmd + SHIFT + J: Code\nCmd + SHIFT + K: Note\nCmd + SHIFT + ;: Raw\nCmd + SHIFT + d: Use Thinking (toggle)\noption + }: `\n\t\t\t\tUsed to declare a function: \n\t\t\t\t- & `function_name`     and the AI in Solveit knows it exists and can call it and use as a tool.\n\n\n# Keyboard Shortcuts\n\n### Navigation\n\n\n"},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let\'s search for mentions of \\"Claudette\\" in the context\\nimport re\\n\\n# Find all occurrences of Claudette with some context around it\\nclaudette_mentions = []\\nlines = context.split(\'\\\\n\')\\nfor i, line in enumerate(lines):\\n    if \'claudette\' in line.lower():\\n        # Get some context around the mention\\n        start = max(0, i-2)\\n        end = min(len(lines), i+3)\\n        claudette_mentions.append({\\n            \'line_num\': i,\\n            \'context\': \'\\\\n\'.join(lines[start:end])\\n        })\\n\\nprint(f\\"Found {len(claudette_mentions)} mentions of Claudette\\")\\nprint(\\"\\\\nFirst 5 mentions:\\")\\nfor i, mention in enumerate(claudette_mentions[:5]):\\n    print(f\\"\\\\n--- Mention {i+1} (line {mention[\'line_num\']}) ---\\")\\n    print(mention[\'context\'])\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01MGTySKK7sdYDXwiZoWYiuS', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01MGTySKK7sdYDXwiZoWYiuS',
  'role': 'tool',
  'name': 'run_repl',
  'content': 'Found 30 mentions of Claudette\n\nFirst 5 mentions:\n\n--- Mention 1 (line 0) ---\nSource: #Gemini_3_Pro #SolveIt #Claudette\nTags: [[ProfitCopilot]] - [[SolveIt]] - [[Claudette]]\n\n\n--- Mention 2 (line 1) ---\nSource: #Gemini_3_Pro #SolveIt #Claudette\nTags: [[ProfitCopilot]] - [[SolveIt]] - [[Claudette]]\n\nShift + Tab inside () of functions opens the menu with explanations\n\n--- Mention 3 (line 494) ---\n    - **Why:** When documenting complex financial formulas (like weighted-average anti-dilution), using LaTeX allows for clear, mathematical notation in your notes.\n\n# Claudette (Anthropic SDK) Integration\nhttps://claudette.answer.ai/\n\n\n--- Mention 4 (line 495) ---\n\n# Claudette (Anthropic SDK) Integration\nhttps://claudette.answer.ai/\n\nYes, you should absolutely integrate `claudette` into your project.\n\n--- Mention 5 (line 497) ---\nhttps://claudette.answer.ai/\n\nYes, you should absolutely integrate `claudette` into your project.\n\n`claudette` is a high-level Python library wrapper for the Anthropic SDK, developed by the same team behind Solveit. It is designed to make programmatic interactions with Claude models (like Opus 4.5 and Sonnet 4.5, which you have access to in Solveit) significantly more ergonomic and powerful.\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let\'s get a more comprehensive view of the Claudette section\\n# Find where the Claudette section starts and extract it\\nstart_idx = context.find(\\"# Claudette (Anthropic SDK) Integration\\")\\nif start_idx != -1:\\n    # Find the next major section (starting with #)\\n    next_section = context.find(\\"\\\\n# \\", start_idx + 10)\\n    if next_section != -1:\\n        claudette_section = context[start_idx:next_section]\\n    else:\\n        claudette_section = context[start_idx:start_idx+5000]  # Get a good chunk\\n    \\n    print(f\\"Claudette section length: {len(claudette_section)}\\")\\n    print(\\"\\\\n\\" + \\"=\\"*80)\\n    print(claudette_section[:3000])  # Print first 3000 chars\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01MXVGof7i4xorBzSdKpvPK4', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01MXVGof7i4xorBzSdKpvPK4',
  'role': 'tool',
  'name': 'run_repl',
  'content': 'Claudette section length: 2564\n\n================================================================================\n# Claudette (Anthropic SDK) Integration\nhttps://claudette.answer.ai/\n\nYes, you should absolutely integrate `claudette` into your project.\n\n`claudette` is a high-level Python library wrapper for the Anthropic SDK, developed by the same team behind Solveit. It is designed to make programmatic interactions with Claude models (like Opus 4.5 and Sonnet 4.5, which you have access to in Solveit) significantly more ergonomic and powerful.\n\nWhile Solveit provides the _environment_ for your development and interactive AI assistance, `claudette` provides the _tools_ to embed Claude\'s intelligence directly into the application logic of your equity modeler.\n\n### Solveit Native AI vs. Claudette: Understanding the Roles\n\nIt is crucial to distinguish between how you interact with AI during development versus how your application uses AI.\n\n1. **Solveit Native AI (The Pair Programmer):**\n    \n    - When you use Solveit\'s built-in Prompt messages, Ghost Text, or AI Modes, you are engaging in **interactive development**. This is ideal for learning Pandas, debugging your financial logic, and iterating rapidly (Phases 1-3 of your roadmap). The platform manages the context of your development session.\n        \n2. **Claudette Library (The Application Logic):**\n    \n    - When you import `claudette` inside a Python code cell, you are using it for **programmatic execution**. This is for building the advanced features of your tool, such as automation, agentic workflows, and data extraction (Phase 4).\n        \n\nYou should not use `claudette`\'s `Chat` class for basic assistance while coding, as Solveit\'s interface already handles this efficiently.\n\n### High-Leverage Use Cases for Claudette in VC Modeling\n\n`claudette` enables features that transform your Pandas model from a complex script into an intelligent analytical tool. Here are the highest-leverage use cases for your project:\n\n#### 1. Structured Data Extraction (The "Term Sheet Parser")\n\nA major bottleneck in VC is converting unstructured legal documents (Term Sheets, Stock Purchase Agreements) into structured data for modeling.\n\n- **The Feature:** `Client.structured()`\n    \n- **How You Use It:** You define a Python class representing the key economic terms (Valuation, Liquidation Preference, Option Pool Size). You then pass the text of the document to `claudette`, which uses Claude to extract the information and return a structured Python object.\n    \n- **Benefit:** This automates the ingestion of deal terms directly into your Pandas model, reducing manual entry and errors.\n    \n\nPython\n\n```\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let\'s get the complete Claudette section\\nstart_idx = context.find(\\"# Claudette (Anthropic SDK) Integration\\")\\nif start_idx != -1:\\n    # Find the next major section (starting with #)\\n    next_section = context.find(\\"\\\\n# \\", start_idx + 10)\\n    if next_section != -1:\\n        claudette_section = context[start_idx:next_section]\\n    else:\\n        claudette_section = context[start_idx:]  # Get rest of document\\n    \\n    print(f\\"Full Claudette section:\\\\n\\")\\n    print(claudette_section)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01Gvz8sJC9myr8pbS7MAUu5F', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01Gvz8sJC9myr8pbS7MAUu5F',
  'role': 'tool',
  'name': 'run_repl',
  'content': 'Full Claudette section:\n\n# Claudette (Anthropic SDK) Integration\nhttps://claudette.answer.ai/\n\nYes, you should absolutely integrate `claudette` into your project.\n\n`claudette` is a high-level Python library wrapper for the Anthropic SDK, developed by the same team behind Solveit. It is designed to make programmatic interactions with Claude models (like Opus 4.5 and Sonnet 4.5, which you have access to in Solveit) significantly more ergonomic and powerful.\n\nWhile Solveit provides the _environment_ for your development and interactive AI assistance, `claudette` provides the _tools_ to embed Claude\'s intelligence directly into the application logic of your equity modeler.\n\n### Solveit Native AI vs. Claudette: Understanding the Roles\n\nIt is crucial to distinguish between how you interact with AI during development versus how your application uses AI.\n\n1. **Solveit Native AI (The Pair Programmer):**\n    \n    - When you use Solveit\'s built-in Prompt messages, Ghost Text, or AI Modes, you are engaging in **interactive development**. This is ideal for learning Pandas, debugging your financial logic, and iterating rapidly (Phases 1-3 of your roadmap). The platform manages the context of your development session.\n        \n2. **Claudette Library (The Application Logic):**\n    \n    - When you import `claudette` inside a Python code cell, you are using it for **programmatic execution**. This is for building the advanced features of your tool, such as automation, agentic workflows, and data extraction (Phase 4).\n        \n\nYou should not use `claudette`\'s `Chat` class for basic assistance while coding, as Solveit\'s interface already handles this efficiently.\n\n### High-Leverage Use Cases for Claudette in VC Modeling\n\n`claudette` enables features that transform your Pandas model from a complex script into an intelligent analytical tool. Here are the highest-leverage use cases for your project:\n\n#### 1. Structured Data Extraction (The "Term Sheet Parser")\n\nA major bottleneck in VC is converting unstructured legal documents (Term Sheets, Stock Purchase Agreements) into structured data for modeling.\n\n- **The Feature:** `Client.structured()`\n    \n- **How You Use It:** You define a Python class representing the key economic terms (Valuation, Liquidation Preference, Option Pool Size). You then pass the text of the document to `claudette`, which uses Claude to extract the information and return a structured Python object.\n    \n- **Benefit:** This automates the ingestion of deal terms directly into your Pandas model, reducing manual entry and errors.\n    \n\nPython\n\n```\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let me use llm_query to get a comprehensive summary of what Claudette is\\nquery_text = f\\"\\"\\"\\nBased on the following context, please provide a clear and comprehensive answer to: \\"What is Claudette?\\"\\n\\nContext:\\n{claudette_section}\\n\\nPlease explain:\\n1. What Claudette is\\n2. Who developed it\\n3. What it\'s used for\\n4. How it relates to Solveit\\n5. Key features and use cases\\n\\"\\"\\"\\n\\nanswer = llm_query(query_text)\\nprint(answer)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01CTvS2azyGUzdi18C3VcLAz', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01CTvS2azyGUzdi18C3VcLAz',
  'role': 'tool',
  'name': 'run_repl',
  'content': '# What is Claudette?\n\n## 1. **What Claudette Is**\n\nClaudette is a **high-level Python library wrapper for the Anthropic SDK**. It serves as a more ergonomic and powerful interface for programmatically interacting with Claude AI models (such as Opus 4.5 and Sonnet 4.5) in your Python applications.\n\n## 2. **Who Developed It**\n\nClaudette was developed by **the same team behind Solveit** (specifically mentioned as being from answer.ai, based on the URL https://claudette.answer.ai/).\n\n## 3. **What It\'s Used For**\n\nClaudette is designed to **embed Claude\'s intelligence directly into application logic**. Rather than just chatting with AI during development, it allows you to build AI capabilities into the actual functionality of your software applications through programmatic execution.\n\n## 4. **How It Relates to Solveit**\n\nThere\'s an important distinction between the two:\n\n- **Solveit Native AI**: Provides the *environment* for interactive development and serves as your "pair programmer" through features like Prompt messages, Ghost Text, and AI Modes during coding sessions.\n\n- **Claudette**: Provides the *tools* to build AI-powered features into your actual application logic—it\'s what you import and use within your code to create intelligent functionality.\n\nThink of it this way: Solveit helps you *write code with AI assistance*, while Claudette helps you *write code that uses AI*.\n\n## 5. **Key Features and Use Cases**\n\n### Primary Feature: **Structured Data Extraction**\n\nThe most prominent feature mentioned is `Client.structured()`, which enables:\n\n- **Document Parsing**: Converting unstructured documents (like Term Sheets or Stock Purchase Agreements) into structured data\n- **Automated Data Ingestion**: Defining Python classes to represent key information, then having Claude extract that information from text and return structured Python objects\n- **Error Reduction**: Eliminating manual data entry by automating the extraction of critical information\n\n### Use Case Example: "Term Sheet Parser"\nIn venture capital modeling, you could use Claudette to automatically extract economic terms (valuation, liquidation preferences, option pool sizes) from legal documents and feed them directly into your analytical models—transforming raw documents into actionable data structures.\n'},
 Message(content='Based on my exploration of the context, I can now provide you with a comprehensive answer:\n\n## What is Claudette?\n\n**Claudette** is a high-level Python library wrapper for the Anthropic SDK, designed to make programmatic interactions with Claude AI models more ergonomic and powerful.\n\n### Key Details:\n\n1. **Developer**: Created by the same team behind Solveit (from answer.ai)\n\n2. **Purpose**: Claudette allows you to embed Claude\'s intelligence directly into your application logic, enabling you to build AI-powered features into your software programmatically.\n\n3. **Relationship to Solveit**: \n   - **Solveit** provides the *environment* for interactive AI-assisted development (like a pair programmer)\n   - **Claudette** provides the *tools* to build AI capabilities into your actual application code\n\n4. **Primary Use Cases**:\n   - **Structured Data Extraction**: Using `Client.structured()` to convert unstructured documents into structured Python objects\n   - **Document Parsing**: Automatically extracting information from legal documents, term sheets, or agreements\n   - **Agentic Workflows**: Building intelligent automation into applications\n   - **Data Ingestion**: Reducing manual entry and errors by automating data extraction\n\n5. **Example Application**: In venture capital modeling, Claudette can be used to build a "Term Sheet Parser" that automatically extracts economic terms (valuations, liquidation preferences, option pool sizes) from legal documents and feeds them directly into analytical models.\n\nIn essence, Claudette is a developer tool that makes it easier to integrate Claude AI\'s capabilities into Python applications for advanced, programmatic AI functionality.', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})]

Prompt: 31

I have ran the history, it seems the context was not long enough to need query_llm function...

Assistant: 291

Actually, looking more closely at the history, it did use llm_query() in the last tool call! It found the Claudette section first with Python, then passed it to llm_query() for semantic summarization.

But you're right that it wasn't strictly necessary - the model could have just read the section directly since it was only ~2500 chars. The recursive call was more of a convenience than a requirement.

To really stress-test the RLM pattern, you'd want:

A context so large that chunks still need semantic compression
A question requiring synthesis across many sections
Something where simple regex/slicing can't find the answer directly

Ideas:

Concatenate multiple large documents together
Ask a question that requires aggregating info scattered throughout
Use a context that's genuinely in the millions of characters

Do you have access to a much larger document, or want to artificially create a larger test context?

Prompt: 1,452

It seemed like it, but the actual tool called was run_repl I believe: Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\n# Let's use llm_query to get a comprehensive answer about Claudette\nquery_result = llm_query(f\"\"\"\nBased on the following context, please provide a comprehensive answer to the question: \"What is Claudette?\"\n\nInclude:\n1. What Claudette is\n2. Who developed it\n3. Its main purpose and use cases\n4. How it relates to Solveit\n5. Key features mentioned\n\nContext:\n{claudette_section}\n\"\"\")\n\nprint(query_result)\n"}', 'name': 'run_repl'}, 'id': 'toolu_01GUALP4Ec2N967u4MscSpQp', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}), {'tool_call_id': 'toolu_01GUALP4Ec2N967u4MscSpQp', 'role': 'tool', 'name': 'run_repl', 'content': '# What is Claudette? - Comprehensive Overview\n\n## 1. What Claudette Is\n\nClaudette is a high-level Python library wrapper for the Anthropic SDK. It serves as an ergonomic interface layer that simplifies and enhances programmatic interactions with Claude AI models (such as Opus 4.5 and Sonnet 4.5). Rather than working directly with the lower-level Anthropic SDK, Claudette provides more developer-friendly abstractions and powerful features.\n\n## 2. Who Developed It\n\nClaudette was developed by the same team behind Solveit (available at https://claudette.answer.ai/). This shared development lineage ensures tight integration between the library and the Solveit platform.\n\n## 3. Main Purpose and Use Cases\n\n### Primary Purpose\nClaudette is designed to embed Claude's intelligence directly into application logic, enabling developers to build AI-powered features programmatically rather than just using AI for interactive assistance.\n\n### Key Use Cases (Specifically for VC/Equity Modeling)\n\nStructured Data Extraction ("Term Sheet Parser"):\n- Converts unstructured legal documents (Term Sheets, Stock Purchase Agreements) into structured data\n- Uses the Client.structured() feature\n- Developers define Python classes representing key economic terms (Valuation, Liquidation Preference, Option Pool Size)\n- Claude extracts information and returns structured Python objects\n- Benefit: Automates deal term ingestion into Pandas models, reducing manual entry and errors\n\nOther applications mentioned:\n- Automation features\n- Agentic workflows\n- Advanced data extraction (Phase 4 development)\n\n## 4. How It Relates to Solveit\n\n### Complementary Roles\n\nSolveit provides:\n- The environment for development\n- Interactive AI assistance through Prompt messages, Ghost Text, and AI Modes\n- A "pair programmer" experience for learning, debugging, and rapid iteration (Phases 1-3)\n- Context management for development sessions\n\nClaudette provides:\n- The tools for programmatic execution\n- Application-level AI integration (imported via import claudette in Python code cells)\n- Advanced feature building capabilities (Phase 4)\n\n### Important Distinction\nYou should not use Claudette's Chat class for basic coding assistance, as Solveit's native interface already handles interactive development efficiently. Claudette is specifically for building AI features into your application.\n\n## 5. Key Features Mentioned\n\n1. Client.structured() method - Enables structured data extraction from unstructured text by defining Python classes and having Claude populate them automatically\n\n2. Chat class - Provides conversational AI capabilities (though not recommended for basic development assistance in Solveit)\n\n3. Programmatic AI integration - Allows embedding Claude intelligence directly into application code\n\n4. Ergonomic API design - Makes interactions with Claude models "significantly more ergonomic and powerful" compared to using the base Anthropic SDK\n\nThe library transforms what would be "a complex script into an intelligent analytical tool" by enabling sophisticated AI-powered features within the application itself.\n'},

Assistant: 241

You're right that the tool called was run_repl - but look at the Python code inside that tool call:

Copied!

query_result = llm_query(f"""
Based on the following context...
""")
print(query_result)

The outer LLM called run_repl, and the code it ran inside the REPL called llm_query(). That's exactly how it's supposed to work! The recursive LLM call happens through the REPL, not as a separate tool.

So the flow was:

Outer LLM → run_repl tool → executes Python code
Python code calls llm_query() → sub-LLM responds
Result returns through REPL output

Your RLM is working correctly! 🎉

Prompt: 381

Explain in detail as I keep doubting llm_query was actually used, I can see references to function_call=None, and name: run_repl etc

Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\n# Let's use llm_query to get a comprehensive answer about Claudette\nquery_result = llm_query(f\"\"\"\nBased on the following context, please provide a comprehensive answer to the question: \"What is Claudette?\"\n\nInclude:\n1. What Claudette is\n2. Who developed it\n3. Its main purpose and use cases\n4. How it relates to Solveit\n5. Key features mentioned\n\nContext:\n{claudette_section}\n\"\"\")\n\nprint(query_result)\n"}', 'name': 'run_repl'}, 'id': 'toolu_01GUALP4Ec2N967u4MscSpQp', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})

Assistant: 567

Let me break this down step by step:

The tool call structure:

name: 'run_repl' — This is the tool the outer LLM called
function_call=None — This is unrelated, just an older API field
arguments: '{"python_code": "..."}' — This is the Python code passed to run_repl

Now look at what's INSIDE the python_code argument:

Copied!

query_result = llm_query(f"""
Based on the following context, please provide a comprehensive answer...
Context:
{claudette_section}
""")
print(query_result)

This is Python code that the REPL executed. And that code calls llm_query().

The proof it ran:

Look at the tool response that came back:

Copied!

{'tool_call_id': 'toolu_01GUALP4Ec2N967u4MscSpQp',
 'role': 'tool',
 'name': 'run_repl',
 'content': '# What is Claudette? - Comprehensive Overview\n\n## 1. What Claudette Is\n\nClaudette is a **high-level Python library wrapper...'}

That response content is the output of llm_query() — it's a nicely formatted answer from the sub-LLM, not raw text from the context. The sub-LLM interpreted and structured the information.

In summary:

Outer LLM has ONE tool: run_repl
run_repl has access to llm_query() in its namespace
Outer LLM writes Python code that calls llm_query()
run_repl executes that code → llm_query() runs → sub-LLM responds

Does that make more sense?

Prompt: 52

Ok I think I get it know, the Content in that run_repl tool use was the output of the llm_query function that was ran by Python inside the REPL

Assistant: 162

Exactly! You've got it. 🎉

The content '# What is Claudette? - Comprehensive Overview...' is what came back from the sub-LLM via llm_query(). The REPL ran the Python code, which called llm_query(), which hit the Claude API, and the response got captured and returned as the tool output.

So your RLM implementation is working correctly — the recursion happens through Python code execution, not as a separate tool. That's the elegant part of this architecture.

Note: 75

Things to improve

Wrap this up in a class with proper state management
Improve the system prompt after testing it on more complicated tasks. I guess getting inspiration from the other repos that implemented this for the prompts may be valuable as well.

Prompt: 30

What do you think of wrapping this all up in a class? This makes state management easier I guess

Assistant: 180

Yes, a class makes a lot of sense here! Right now you have several pieces that need to stay in sync:

The shell instance
The context loaded into the shell
The llm_query function (with its model config)
The run_repl function (needs reference to the right shell)
The outer Chat instance

A class would keep all of this together and make it easy to create multiple RLM instances with different contexts or models.

Do you want to sketch out the class structure yourself, or would you like me to suggest an outline?

Prompt: 42

Please suggest an outline to the class structure so that I can use the RLM architecture in other dialogues for other purposes as an AI Agent.

Assistant: 858

Here's an outline:

Copied!

class RLM:
    def __init__(self, context: str, model: str = "claude-sonnet-4-5", sub_model: str = None):
        """
        Initialize a Recursive Language Model.
        
        Args:
            context: The context string to load into the REPL
            model: Model for the root LLM
            sub_model: Model for recursive llm_query calls (defaults to same as model)
        """
        self.model = model
        self.sub_model = sub_model or model
        self.context = context
        
        # Set up the shell with context and llm_query
        self.shell = self._setup_shell()
        
        # Create the outer chat with run_repl as a tool
        self.chat = Chat(
            model=self.model,
            tools=[self.run_repl],
            sp=REPL_SYSTEM_PROMPT
        )
    
    def _setup_shell(self):
        """Initialize shell with context and llm_query in namespace."""
        sh = get_shell()
        sh.user_ns['context'] = self.context
        sh.user_ns['llm_query'] = self._make_llm_query()
        return sh
    
    def _make_llm_query(self):
        """Create the llm_query function for the REPL."""
        sub_model = self.sub_model
        def llm_query(query: str) -> str:
            msg = mk_msg(query)
            return litellm.completion(sub_model, [msg]).choices[0].message.content
        return llm_query
    
    def run_repl(self, python_code: str) -> str:
        """Execute Python code in the REPL environment."""
        result = self.shell.run_cell(python_code)
        if result.error_in_exec or result.error_before_exec:
            return f"Error: {result.error_in_exec or result.error_before_exec}"
        output = result.stdout or ""
        if result.result is not None:
            output += str(result.result)
        return output[:5000]
    
    def __call__(self, query: str, max_steps: int = 10):
        """Run a query through the RLM."""
        return self.chat(query, max_steps=max_steps)
    
    @property
    def hist(self):
        """Access conversation history."""
        return self.chat.hist

Usage would be:

Copied!

rlm = RLM(context=my_long_document)
response = rlm("What are the key themes in this document?")

Things you might want to add later:

reset() method to clear chat history but keep context
update_context() to swap in new context
Configurable truncation limit for REPL output
Timeout for run_repl

Want to try implementing this?

Prompt: 486

A friend suggested this, compare with your proposal and give me feedback and recommendations for implementation:

class RLM: def init(self, context: str, root_model = "groq/openai/gpt-oss-120b", recursive_model = None, temperature=0.0): self.context = context self.root_model = root_model self.recursive_model = recursive_model if recursive_model else root_model self.shell = self._prep_shell()

Copied!

    def run_repl(python_code: str) -> str:
        "Execute Python code in the REPL environment"
        result = self.shell.run_cell(python_code)
        if result.error_in_exec or result.error_before_exec:
            return f"Error: {result.error_in_exec or result.error_before_exec}"
        output = result.stdout or ""
        if result.result is not None:
            output += str(result.result)
        return output

    self.chat = Chat(model=self.root_model, tools=[run_repl], sp=REPL_SYSTEM_PROMPT, temp=temperature)


def _prep_shell(self):
    def llm_query(query: str):
        msg = mk_msg(query)
        return litellm.completion(self.recursive_model, [msg]).choices[0].message.content

    sh = get_shell()
    sh.user_ns['context'] = self.context
    sh.user_ns['llm_query'] = llm_query
    return sh

def __call__(self, query: str, max_steps=10):
    return self.chat(query, max_steps=max_steps)

Assistant: 415

Your friend's version is cleaner in a few ways:

Advantages of friend's version:

run_repl as closure inside __init__ — Captures self.shell naturally, avoids needing it as a method
temperature parameter — Good addition I missed
More compact — Less boilerplate

Things to consider adding:

Output truncation — Friend's run_repl doesn't truncate. Long outputs could blow up the root LLM's context. Consider adding return output[:5000] or making it configurable.
hist property — Nice convenience to access self.chat.hist directly, though you can always do rlm.chat.hist
Docstrings — For run_repl and llm_query so the LLM understands them when used as tools

One potential issue: The llm_query function inside _prep_shell references self.recursive_model — this works because of closure, but if you ever wanted to change self.recursive_model after init, the function would still use the new value (which could be good or bad depending on intent).

Overall, your friend's version is solid and simpler. I'd just add the truncation and you're good to go!

Note: 10

wrapping it up in a class

Code: 34 ()

rlm = RLM(context=notes, root_model="openai/gpt-5", temperature=1)

Code: 21 ()

rlm("Is Claudette useful for OpenAI SDK?")

Output: 816

Short answer: not directly.

Claudette is a high-level wrapper around the Anthropic (Claude) SDK. It doesn’t call or wrap the OpenAI SDK, so it won’t help if your stack is OpenAI-only.

How it can still be useful:

If you want to support both Anthropic and OpenAI in the same app, you can use Claudette for Anthropic and the official OpenAI SDK for OpenAI side by side.
If you want one abstraction over multiple providers, consider a meta-layer like LiteLLM, LangChain, LlamaIndex, or OpenRouter instead of Claudette.

Feature mapping (rough equivalents):

Claudette Client.structured() ≈ OpenAI JSON schema/structured responses (response_format with json_schema).
Claudette tool/use and Chat.toolloop() ≈ OpenAI function-calling/tools.

If you’re only using OpenAI, stick with the OpenAI SDK. Use Claudette when you need Anthropic models.

id: chatcmpl-Co9vv9T61EC4mjobzhnaHbn7kD0Oi
model: gpt-5-2025-08-07
finish_reason: stop
usage: Usage(completion_tokens=1173, prompt_tokens=4683, total_tokens=5856, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=960, rejected_prediction_tokens=0, text_tokens=None, image_tokens=None), prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=0, cached_tokens=3968, text_tokens=None, image_tokens=None))

Note: 39

Test with even longer context.

Let's add the carotis guideline to our context to make it even more challenging to find stuff.

Code: 37 ()

with open("SolveIt Lessons.md", "r") as f:
    lessons = f.read()

lessons[:200]

Output: 102

"# Class 1 - How to Solve it With Code Lesson 1 - Intro & Examples\nWelcome and Course Logistics\nWow, look at that, we're so live. Hello. Hey hey, hello everybody.\nSo let's, before we get all official, "

Note: 39

Yeah okay so that is some hefty context :D Let's see how many tokens this would be for GPT-4 using tiktoken

Code: 6 ()

import tiktoken

Note: 45

Great so the context window of gpt oss 120b is around 130k tokens, which means we are way above the usual context window.

Prompt: 24

I think the rlm worked but I didn't receive a response, what happened?

Assistant: 216

The response shows "I found excellent information! Let me search for more specific use cases in the context." — it looks like the model wanted to continue but stopped prematurely.

Check rlm.chat.hist to see the full trace. A few possible causes:

max_steps reached — Maybe it hit the 10 step limit before finishing
Model stopped early — Sometimes models say "let me do X" but then stop instead of actually doing it

Try running rlm.chat.hist to see what tool calls happened, or try calling rlm again with a higher max_steps:

Copied!

rlm(q, max_steps=20)

Code: 6 ()

rlm.chat.hist

Output: 14,346

[{'role': 'user',
  'content': 'How can I share my screen so that Claude can see it? Something about setup_share() and capture_tool() function\n'},
 Message(content="I'll help you understand how to share your screen with Claude. Let me first check the context to see what information is available about these functions.", role='assistant', tool_calls=[{'index': 1, 'function': {'arguments': '{"python_code": "\\n# First, let\'s check what\'s in the context\\nprint(f\\"Context type: {type(context)}\\")\\nprint(f\\"Context length: {len(context) if hasattr(context, \'__len__\') else \'N/A\'}\\")\\n\\n# Let\'s peek at the first 2000 characters to understand the structure\\nif isinstance(context, str):\\n    print(\\"\\\\nFirst 2000 characters of context:\\")\\n    print(context[:2000])\\n"}', 'name': 'run_repl'}, 'id': 'toolu_018sG7NYMS3hXUzxwspa53h4', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_018sG7NYMS3hXUzxwspa53h4',
  'role': 'tool',
  'name': 'run_repl',
  'content': "Context type: <class 'str'>\nContext length: 1774518\n\nFirst 2000 characters of context:\n# Class 1 - How to Solve it With Code Lesson 1 - Intro & Examples\nWelcome and Course Logistics\nWow, look at that, we're so live. Hello. Hey hey, hello everybody.\nSo let's, before we get all official, let's.\nso I can see everybody. My goodness,  what a lot of chats going on. All right,   let's just start with some logistics before we  start talking about Solveit. So hi, everybody.\nLet me switch over here and share my screen  so I've got something to talk about. And  \nwe're using a different presentation thing to  usual. So fingers crossed that everything works  \nOK. All right. Jeremy's a bit quiet.  OK, let's fix that. Sound settings.\nI am quiet. at that. All right,   hopefully I'm better now. Yes, please do  tell us if anything's less than great.\nAll right, so just a request to be honest,  we haven't had this many people using  \nSolveit at the same time. So like when I show  you how to use Solveit, don't everybody do it.  \nYou can just watch and do it afterwards. Just  or if you try and it's super slow, don't ask  \nlots of questions about why it's probably  just, it's really hard to load test, know,  \nsomething like this authentically.  So hopefully it's fine, but... Yeah, don't all start your session  at the same time. We release new  \nversions of SolvIt quite regularly. We  announce on the Discord when we do that.  \nWhen we do make such an announcement, you should  probably upgrade your SolvIt. And to do that,  \nyou'll click the stop button, wait five seconds,  and then click connect, and that will...\nRestart with the new version We don't  do that automatically because we want   to make sure that we don't interrupt you in  the middle of your work All right Discord\nMaybe I'll just switch to sharing my whole screen.\nI close discord. to make sure no one gets distracted.\nSo this is Discord. And so this is  where everything happens. Like any  \npopular community, it will  look overwhelming at first.   And the trick to making it not overwhelming is  to set it up and use \n"},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# The context is very long (1.7M characters). Let me search for setup_share and capture_tool\\nimport re\\n\\n# Search for setup_share\\nsetup_share_matches = re.finditer(r\'.{0,500}setup_share.{0,500}\', context, re.IGNORECASE | re.DOTALL)\\nsetup_share_results = [match.group() for match in setup_share_matches]\\n\\nprint(f\\"Found {len(setup_share_results)} mentions of \'setup_share\'\\")\\nprint(\\"\\\\n\\" + \\"=\\"*80 + \\"\\\\n\\")\\n\\n# Let\'s look at the first few matches\\nfor i, result in enumerate(setup_share_results[:3]):\\n    print(f\\"Match {i+1}:\\")\\n    print(result)\\n    print(\\"\\\\n\\" + \\"-\\"*80 + \\"\\\\n\\")\\n"}', 'name': 'run_repl'}, 'id': 'toolu_017yYBqGq3jeh87nXc5NKTT2', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_017yYBqGq3jeh87nXc5NKTT2',
  'role': 'tool',
  'name': 'run_repl',
  'content': 'Found 3 mentions of \'setup_share\'\n\n================================================================================\n\nMatch 1:\nnimate);\n"""\n\npreview_webgl(js_utils, js_line_drawing, script_two_sin_anim)\n\n@rt\ndef sin():\n  return simple_webgl(js_utils, js_line_drawing, script_two_sin_anim)\n----------\n# Dialog Class 7 - Understanding Source Files & Initial Setup (Zoom recordings high quality videos)\nTry an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.\n\nfrom fastcore.utils import *\nfrom dialoghelper import *\nfrom dialoghelper.capture import *\nfrom IPython.display import Image\nsetup_share()\nstart_share()\nSource clips:\n\nCopied!\n$ ls | cat\nGMT20251110-230648_Recording_1920x1080.mp4\nGMT20251110-230648_Recording_as_1728x1080.mp4\nGMT20251110-230648_Recording_avo_640x360.mp4\nGMT20251110-230648_Recording_gallery_1920x1080.mp4\nGMT20251110-230648_Recording_gvo_1280x720.mp4\nGMT20251110-230648_Recording.transcript.vtt\nsi2lesson3.webm\nUnderstanding Source Files & Initial Setup\nI want you to help me learn the basics of creating nice video from source Zoom recordings. We\'ll try to create a hi\n\n--------------------------------------------------------------------------------\n\nMatch 2:\n-V, --version  Print version\nI think you s/aarch64/aarch64 was wrong. I fixed it.\n\n\n\nYou\'re absolutely right! That sed replacement s/aarch64/aarch64/ was redundant - it should have been s/arm64/aarch64/ since macOS on Apple Silicon reports arm64 but the bore releases use aarch64. Good catch and fix!\n\nNow that bore is installed, what would you like to tunnel? Are you looking to expose a service from your Mac to Solveit, or the other way around?\n\ntmux and ssh\nsetup_share()\nstart_share()\nI want you to help me show the basics of using tmux and then ssh forwarding.\n\nUse &capture_tool to see what\'s on my screen at any time. It will show my Terminal. Note that you won\'t see any earlier screenshots taken with this tool in our chat, so be sure to summarize important parts of what you see so you can refer back to it later.\n\nView my screen and tell me what you see. I used tmux a to get here, since I already had 4 panes going. Then tell me how to create a new tmux window\n\n--------------------------------------------------------------------------------\n\nMatch 3:\nt.google.com)\nNavigate to Security\nUnder "How you sign in to Google", find 2-Step Verification (you need this enabled first)\nOnce 2-Step Verification is on, search for "App passwords" or find it in the Security section\nCreate a new App Password — select "Mail" and your device, then Google will generate a 16-character password\nThat 16-character password is what you\'ll use in your SMTP code instead of your regular Gmail password.\n\nHave you already got 2-Step Verification enabled on your account?\n\nsetup_share()\nstart_share()\nUse &capture_tool to see what\'s on my screen at any time. It will show a browser tab I\'m sharing (currently gmail). Note that you won\'t see any earlier screenshots taken with this tool in our chat, so be sure to summarize important parts of what you see so you can refer back to it later.\n\nView my screen and tell me what you see, and what I should do next.\n\n\n\nDetails\nI can see you\'re on your Google Account Security page. Here\'s what I observe:\n\n2-Step Verificatio\n\n--------------------------------------------------------------------------------\n\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Great! Now let\'s search for capture_tool to get more context\\ncapture_tool_matches = re.finditer(r\'.{0,500}capture_tool.{0,500}\', context, re.IGNORECASE | re.DOTALL)\\ncapture_tool_results = [match.group() for match in capture_tool_matches]\\n\\nprint(f\\"Found {len(capture_tool_results)} mentions of \'capture_tool\'\\")\\nprint(\\"\\\\n\\" + \\"=\\"*80 + \\"\\\\n\\")\\n\\n# Let\'s look at the first few matches\\nfor i, result in enumerate(capture_tool_results[:5]):\\n    print(f\\"Match {i+1}:\\")\\n    print(result)\\n    print(\\"\\\\n\\" + \\"-\\"*80 + \\"\\\\n\\")\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01SqPh5YQTgNV7u2gHzgYzEU', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01SqPh5YQTgNV7u2gHzgYzEU',
  'role': 'tool',
  'name': 'run_repl',
  'content': 'Found 3 mentions of \'capture_tool\'\n\n================================================================================\n\nMatch 1:\nordings. We\'ll try to create a high quality video of the just-recording \'How to Solve it With Code\' lesson three that I (Jeremy Howard) made.\n\nI\'ve got a number of recordings as shown above, taken from Zoom\'s web recording with the setting enabled to save separate videos for each view. Search for info about what views are there and the filenames. Also I forgot to hit record for the first few minutes, so I\'ve added a .webm which is the youtube live recording, which we\'ll use for that part.\n\nUse &capture_tool to see what\'s on my screen at any time. It will show my Da Vinci Resolve screen. Note that you won\'t see any earlier screenshots taken with this tool in our chat, so be sure to summarize important parts of what you see so you can refer back to it later. I have the latest v20 of Resolve installed; I\'ve pasted some features from their release notes above in case any turn out to be of interest.\n\nSo basically what I want to do is to skip through the videos, deciding which views/clips I want to use \n\n--------------------------------------------------------------------------------\n\nMatch 2:\n\n\nYou\'re absolutely right! That sed replacement s/aarch64/aarch64/ was redundant - it should have been s/arm64/aarch64/ since macOS on Apple Silicon reports arm64 but the bore releases use aarch64. Good catch and fix!\n\nNow that bore is installed, what would you like to tunnel? Are you looking to expose a service from your Mac to Solveit, or the other way around?\n\ntmux and ssh\nsetup_share()\nstart_share()\nI want you to help me show the basics of using tmux and then ssh forwarding.\n\nUse &capture_tool to see what\'s on my screen at any time. It will show my Terminal. Note that you won\'t see any earlier screenshots taken with this tool in our chat, so be sure to summarize important parts of what you see so you can refer back to it later.\n\nView my screen and tell me what you see. I used tmux a to get here, since I already had 4 panes going. Then tell me how to create a new tmux window so we can run bore in it.\n\n\n\nDetails\nI can see your tmux session with 4 pane\n\n--------------------------------------------------------------------------------\n\nMatch 3:\ny\nUnder "How you sign in to Google", find 2-Step Verification (you need this enabled first)\nOnce 2-Step Verification is on, search for "App passwords" or find it in the Security section\nCreate a new App Password — select "Mail" and your device, then Google will generate a 16-character password\nThat 16-character password is what you\'ll use in your SMTP code instead of your regular Gmail password.\n\nHave you already got 2-Step Verification enabled on your account?\n\nsetup_share()\nstart_share()\nUse &capture_tool to see what\'s on my screen at any time. It will show a browser tab I\'m sharing (currently gmail). Note that you won\'t see any earlier screenshots taken with this tool in our chat, so be sure to summarize important parts of what you see so you can refer back to it later.\n\nView my screen and tell me what you see, and what I should do next.\n\n\n\nDetails\nI can see you\'re on your Google Account Security page. Here\'s what I observe:\n\n2-Step Verification is already enabled (on since Jan\n\n--------------------------------------------------------------------------------\n\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let me search for more detailed information about these functions, including start_share\\nstart_share_matches = re.finditer(r\'.{0,800}start_share.{0,800}\', context, re.IGNORECASE | re.DOTALL)\\nstart_share_results = [match.group() for match in start_share_matches]\\n\\nprint(f\\"Found {len(start_share_results)} mentions of \'start_share\'\\")\\nprint(\\"\\\\n\\" + \\"=\\"*80 + \\"\\\\n\\")\\n\\n# Let\'s look at a couple matches\\nfor i, result in enumerate(start_share_results[:2]):\\n    print(f\\"Match {i+1}:\\")\\n    print(result)\\n    print(\\"\\\\n\\" + \\"-\\"*80 + \\"\\\\n\\")\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01NKjoCTbjAx5EZXbx2cCwkD', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01NKjoCTbjAx5EZXbx2cCwkD',
  'role': 'tool',
  'name': 'run_repl',
  'content': 'Found 5 mentions of \'start_share\'\n\n================================================================================\n\nMatch 1:\nints2.slice(350),  [1.0, 0.0, 0.0, 1.0]);\n  visiblePoints2 = rotate(visiblePoints2, 2);\n}\n\nlet lastTime = 0;\nfunction animate(currentTime) {\n  if (currentTime - lastTime >= 10) {\n    drawNext();\n    lastTime = currentTime;\n  }\n  requestAnimationFrame(animate);\n}\nrequestAnimationFrame(animate);\n"""\n\npreview_webgl(js_utils, js_line_drawing, script_two_sin_anim)\n\n@rt\ndef sin():\n  return simple_webgl(js_utils, js_line_drawing, script_two_sin_anim)\n----------\n# Dialog Class 7 - Understanding Source Files & Initial Setup (Zoom recordings high quality videos)\nTry an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.\n\nfrom fastcore.utils import *\nfrom dialoghelper import *\nfrom dialoghelper.capture import *\nfrom IPython.display import Image\nsetup_share()\nstart_share()\nSource clips:\n\nCopied!\n$ ls | cat\nGMT20251110-230648_Recording_1920x1080.mp4\nGMT20251110-230648_Recording_as_1728x1080.mp4\nGMT20251110-230648_Recording_avo_640x360.mp4\nGMT20251110-230648_Recording_gallery_1920x1080.mp4\nGMT20251110-230648_Recording_gvo_1280x720.mp4\nGMT20251110-230648_Recording.transcript.vtt\nsi2lesson3.webm\nUnderstanding Source Files & Initial Setup\nI want you to help me learn the basics of creating nice video from source Zoom recordings. We\'ll try to create a high quality video of the just-recording \'How to Solve it With Code\' lesson three that I (Jeremy Howard) made.\n\nI\'ve got a number of recordings as shown above, taken from Zoom\'s web recording with the setting enabled to save separate videos for each view. Search for info about what views are there and the filenames\n\n--------------------------------------------------------------------------------\n\nMatch 2:\nserver, bypassing standard NAT connection firewalls.\n\nUsage: bore <COMMAND>\n\nCommands:\n  local   Starts a local proxy to the remote server\n  server  Runs the remote proxy server\n  help    Print this message or the help of the given subcommand(s)\n\nOptions:\n  -h, --help     Print help\n  -V, --version  Print version\nI think you s/aarch64/aarch64 was wrong. I fixed it.\n\n\n\nYou\'re absolutely right! That sed replacement s/aarch64/aarch64/ was redundant - it should have been s/arm64/aarch64/ since macOS on Apple Silicon reports arm64 but the bore releases use aarch64. Good catch and fix!\n\nNow that bore is installed, what would you like to tunnel? Are you looking to expose a service from your Mac to Solveit, or the other way around?\n\ntmux and ssh\nsetup_share()\nstart_share()\nI want you to help me show the basics of using tmux and then ssh forwarding.\n\nUse &capture_tool to see what\'s on my screen at any time. It will show my Terminal. Note that you won\'t see any earlier screenshots taken with this tool in our chat, so be sure to summarize important parts of what you see so you can refer back to it later.\n\nView my screen and tell me what you see. I used tmux a to get here, since I already had 4 panes going. Then tell me how to create a new tmux window so we can run bore in it.\n\n\n\nDetails\nI can see your tmux session with 4 panes:\n\nTop-left: Shows kernel logs and server shutdowns\nTop-right: A file listing showing various files and directories\nBottom-left: Server shutdown logs and application info\nBottom-right: An HTTP server r\n\n--------------------------------------------------------------------------------\n\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let me search for more comprehensive documentation about these functions\\n# Let\'s look for imports and function definitions\\n\\nimport_matches = re.finditer(r\'.{0,1000}from dialoghelper.capture import.{0,1000}\', context, re.IGNORECASE | re.DOTALL)\\nimport_results = [match.group() for match in import_matches]\\n\\nprint(f\\"Found {len(import_results)} mentions of \'from dialoghelper.capture import\'\\")\\nprint(\\"\\\\n\\" + \\"=\\"*80 + \\"\\\\n\\")\\n\\nif import_results:\\n    print(\\"First match:\\")\\n    print(import_results[0])\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01EZFVBbDvTp2YE1Dwz6rd27', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01EZFVBbDvTp2YE1Dwz6rd27',
  'role': 'tool',
  'name': 'run_repl',
  'content': 'Found 3 mentions of \'from dialoghelper.capture import\'\n\n================================================================================\n\nFirst match:\nints1.map((n,i) => i%2 == 0 ? -n*0.8+0.1 : 0.2-n/2);\nlet visiblePoints1 = points1.slice();  \nlet visiblePoints2 = points2.slice();  \n\nfunction drawNext(){\n  drawLine(visiblePoints1.slice(380),  [0.0, 0.0, 1.0, 1.0]);\n  visiblePoints1 = rotate(visiblePoints1, 2);\n  drawLine(visiblePoints2.slice(350),  [1.0, 0.0, 0.0, 1.0]);\n  visiblePoints2 = rotate(visiblePoints2, 2);\n}\n\nlet lastTime = 0;\nfunction animate(currentTime) {\n  if (currentTime - lastTime >= 10) {\n    drawNext();\n    lastTime = currentTime;\n  }\n  requestAnimationFrame(animate);\n}\nrequestAnimationFrame(animate);\n"""\n\npreview_webgl(js_utils, js_line_drawing, script_two_sin_anim)\n\n@rt\ndef sin():\n  return simple_webgl(js_utils, js_line_drawing, script_two_sin_anim)\n----------\n# Dialog Class 7 - Understanding Source Files & Initial Setup (Zoom recordings high quality videos)\nTry an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.\n\nfrom fastcore.utils import *\nfrom dialoghelper import *\nfrom dialoghelper.capture import *\nfrom IPython.display import Image\nsetup_share()\nstart_share()\nSource clips:\n\nCopied!\n$ ls | cat\nGMT20251110-230648_Recording_1920x1080.mp4\nGMT20251110-230648_Recording_as_1728x1080.mp4\nGMT20251110-230648_Recording_avo_640x360.mp4\nGMT20251110-230648_Recording_gallery_1920x1080.mp4\nGMT20251110-230648_Recording_gvo_1280x720.mp4\nGMT20251110-230648_Recording.transcript.vtt\nsi2lesson3.webm\nUnderstanding Source Files & Initial Setup\nI want you to help me learn the basics of creating nice video from source Zoom recordings. We\'ll try to create a high quality video of the just-recording \'How to Solve it With Code\' lesson three that I (Jeremy Howard) made.\n\nI\'ve got a number of recordings as shown above, taken from Zoom\'s web recording with the setting enabled to save separate videos for each view. Search for info about what views are there and the filenames. Also I forgot to hit record for the first few minutes, so I\'ve added a .webm which is the youtube live recording, which we\'ll use for th\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let me search for more detailed instructions or explanations about screen sharing\\nscreen_share_pattern = re.finditer(r\'.{0,1500}(setup_share|start_share|capture_tool|screen sharing).{0,1500}\', context, re.IGNORECASE | re.DOTALL)\\nscreen_share_results = [match.group() for match in screen_share_pattern]\\n\\nprint(f\\"Found {len(screen_share_results)} relevant matches\\")\\n\\n# Let me create a comprehensive query for the LLM with all the relevant excerpts\\nquery_text = \\"\\"\\"\\nBased on the following excerpts from the context, please explain:\\n1. What is setup_share() and how do you use it?\\n2. What is start_share() and how do you use it?\\n3. What is capture_tool and how do you use it?\\n4. What are the complete steps to share your screen with Claude?\\n\\nHere are the relevant excerpts:\\n\\n\\"\\"\\"\\n\\n# Add unique excerpts\\nseen = set()\\nunique_excerpts = []\\nfor result in screen_share_results[:10]:  # Take first 10 to avoid too much repetition\\n    # Create a simple hash to check uniqueness\\n    snippet = result[:200]\\n    if snippet not in seen:\\n        seen.add(snippet)\\n        unique_excerpts.append(result)\\n\\nquery_text += \\"\\\\n\\\\n---EXCERPT---\\\\n\\\\n\\".join(unique_excerpts[:8])\\n\\nprint(f\\"Prepared query with {len(unique_excerpts[:8])} unique excerpts\\")\\nprint(f\\"Query length: {len(query_text)} characters\\")\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01AWsXie42QPvCSEV4vdT728', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01AWsXie42QPvCSEV4vdT728',
  'role': 'tool',
  'name': 'run_repl',
  'content': 'Found 8 relevant matches\nPrepared query with 8 unique excerpts\nQuery length: 23584 characters\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Now let\'s query the LLM with this information\\nanswer = llm_query(query_text)\\nprint(answer)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01ATFVPsEQy6axQ4WvBCZ6B4', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01ATFVPsEQy6axQ4WvBCZ6B4',
  'role': 'tool',
  'name': 'run_repl',
  'content': "Based on the excerpts, here are the answers to your questions:\n\n## 1. What is setup_share() and how do you use it?\n\n`setup_share()` is a function from the `dialoghelper.capture` module that initializes screen sharing/capture functionality. \n\n**How to use it:**\n```python\nfrom dialoghelper.capture import *\nsetup_share()\n```\n\nYou call it once at the beginning of your session to set up the screen capture system before you start sharing your screen with Claude.\n\n## 2. What is start_share() and how do you use it?\n\n`start_share()` is a function that actually begins the screen sharing session after setup.\n\n**How to use it:**\n```python\nfrom dialoghelper.capture import *\nsetup_share()\nstart_share()\n```\n\nYou call it after `setup_share()` to start actively sharing your screen. You may need to call it again if you switch to a different window or application that you want Claude to see.\n\n## 3. What is capture_tool and how do you use it?\n\n`capture_tool` (referenced as `&capture_tool` in prompts) is a tool that allows Claude to take screenshots of your shared screen.\n\n**How to use it:**\nYou reference it in your prompts to Claude like this:\n```\nUse &capture_tool to see what's on my screen at any time. It will show my [description of what's on screen].\n```\n\nThen you can ask Claude to view your screen, and it will use the tool to capture and analyze what you're currently showing.\n\n## 4. Complete steps to share your screen with Claude:\n\n1. **Import the necessary module:**\n   ```python\n   from dialoghelper.capture import *\n   ```\n\n2. **Set up screen sharing:**\n   ```python\n   setup_share()\n   ```\n\n3. **Start sharing:**\n   ```python\n   start_share()\n   ```\n\n4. **Tell Claude about the capture tool in your prompt:**\n   ```\n   Use &capture_tool to see what's on my screen at any time. It will show my [Terminal/Browser/Application]. \n   Note that you won't see any earlier screenshots taken with this tool in our chat, so be sure to \n   summarize important parts of what you see so you can refer back to it later.\n   ```\n\n5. **Ask Claude to view your screen when needed:**\n   ```\n   View my screen and tell me what you see.\n   ```\n\n6. **Call `start_share()` again if you switch windows:**\n   ```python\n   start_share()  # Call this after switching to a different window/tab\n   ```\n\n**Important note:** Claude cannot see previous screenshots in the conversation, so it's crucial to ask Claude to summarize important details from each screenshot so it can reference them later without needing to take another screenshot.\n"},
 Message(content="Perfect! I've found the information you need. Here's how to share your screen with Claude:\n\n## Complete Screen Sharing Setup\n\n### Step 1: Import the Required Module\n```python\nfrom dialoghelper.capture import *\n```\n\n### Step 2: Set Up Screen Sharing\n```python\nsetup_share()\n```\nThis initializes the screen capture functionality.\n\n### Step 3: Start Sharing\n```python\nstart_share()\n```\nThis begins actively sharing your screen. **Note:** You may need to call `start_share()` again if you switch to a different window or tab that you want Claude to see.\n\n### Step 4: Tell Claude About the Capture Tool\nInclude this in your prompt to Claude:\n```\nUse &capture_tool to see what's on my screen at any time. It will show my [Terminal/Browser/Application]. \nNote that you won't see any earlier screenshots taken with this tool in our chat, so be sure to \nsummarize important parts of what you see so you can refer back to it later.\n```\n\n### Step 5: Ask Claude to View Your Screen\n```\nView my screen and tell me what you see.\n```\n\n## Important Notes:\n- **Claude cannot see previous screenshots** in the conversation history, so always ask Claude to summarize important details from each screenshot\n- If you switch windows or tabs, call `start_share()` again to update what Claude can see\n- The `&capture_tool` is the actual tool Claude uses to take screenshots when you ask it to view your screen\n\n## Example Usage:\n```python\nfrom dialoghelper.capture import *\nsetup_share()\nstart_share()\n```\n\nThen in your message to Claude:\n```\nUse &capture_tool to see what's on my screen at any time. It will show my Terminal. \nView my screen and tell me what you see.\n```", role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})]

Prompt: 30

Here is the history, what happened? It seems I ran out of tool uses and received no summary

Assistant: 415

I see what happened! Look at this message near the end:

Copied!

{'role': 'assistant',
 'content': 'You have no more tool uses. Please summarize your findings...'}

The model hit max_steps=10 and lisette injected a message telling it to summarize. But then the model responded with:

Copied!

'I found excellent information! Let me search for more specific use cases in the context.'

Instead of summarizing, it said "let me search more" — which it couldn't do! That's a model behavior issue.

Two fixes:

Increase max_steps — The context is huge (455k tokens), so 10 steps wasn't enough:
Copied!
```
rlm(q, max_steps=20)
```

Improve system prompt — Add something like:

Copied!

When you've gathered enough information or are running low on steps, synthesize your findings into a final answer. Do not say "let me search more" — provide what you have.

The good news: the model did find excellent content about investment manager use cases (look at the tool responses about "Programmable Equity Modeler", due diligence, target audience for VCs, etc.). It just failed to summarize it.

Want to try again with higher max_steps?

Code: 280 ()

q = """Give me a detailed and verbose overview of what is SolveIt as a methodology, framework, tool, platform and its capabilities. Especially in the context of a VC MedTech Fund Manager interested in using it for due diligence, for research, for processing pitch decks or investment decks of startups, for determining if startups align with its investment thesis and is a good fit (check size, stage, regulatory, addressable market, reimbursement of revenue path, clinical utility, etc), as well as drafting or analyzing the term-sheet and other financial details during an investment round (Cap Table, Term Sheet, Preferred shares conditions and implications). The output should be a markdown document that can be used as context for other LLMs in thinking mode to further analyze the workflow, methodology and best way to implement it.
"""

rlm = RLM(context="\n\n".join([lessons, notes]))
rlm(q,max_steps=30)

Output: 2,125

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 5
      1 q = """Give me a detailed and verbose overview of what is SolveIt as a methodology, framework, tool, platform and its capabilities. Especially in the context of a VC MedTech Fund Manager interested in using it for due diligence, for research, for processing pitch decks or investment decks of startups, for determining if startups align with its investment thesis and is a good fit (check size, stage, regulatory, addressable market, reimbursement of revenue path, clinical utility, etc), as well as drafting or analyzing the term-sheet and other financial details during an investment round (Cap Table, Term Sheet, Preferred shares conditions and implications). The output should be a markdown document that can be used as context for other LLMs in thinking mode to further analyze the workflow, methodology and best way to implement it.
      2 """
      4 rlm = RLM(context="\n\n".join([lessons, notes]))
----> 5 rlm(q,max_steps=30)

Cell In[1], line 32, in RLM.__call__(self, query, max_steps)
     31 def __call__(self, query: str, max_steps=10):
---> 32     return self.chat(query, max_steps=max_steps)

File /usr/local/lib/python3.12/site-packages/lisette/core.py:343, in Chat.__call__(self, msg, prefill, temp, think, search, stream, max_steps, final_prompt, return_all, **kwargs)
    341 if stream: return result_gen              # streaming
    342 elif return_all: return list(result_gen)  # toolloop behavior
--> 343 else: return last(result_gen)

File /usr/local/lib/python3.12/site-packages/fastcore/basics.py:749, in last(x, f, negate, **kwargs)
    747 if f: x = filter_ex(x, f=f, negate=negate, gen=True, **kwargs)
    748 res = None
--> 749 for res in x: pass
    750 return res

File /usr/local/lib/python3.12/site-packages/lisette/core.py:324, in Chat._call(self, msg, prefill, temp, think, search, stream, max_steps, step, final_prompt, tool_choice, **kwargs)
    322 if step>=max_steps-1: prompt,tool_choice,search = final_prompt,'none',False
    323 else: prompt = None
--> 324 yield from self._call(
    325     prompt, prefill, temp, think, search, stream, max_steps, step+1,
    326     final_prompt, tool_choice, **kwargs)

File /usr/local/lib/python3.12/site-packages/lisette/core.py:324, in Chat._call(self, msg, prefill, temp, think, search, stream, max_steps, step, final_prompt, tool_choice, **kwargs)
    322 if step>=max_steps-1: prompt,tool_choice,search = final_prompt,'none',False
    323 else: prompt = None
--> 324 yield from self._call(
    325     prompt, prefill, temp, think, search, stream, max_steps, step+1,
    326     final_prompt, tool_choice, **kwargs)

    [... skipping similar frames: Chat._call at line 324 (12 times)]

File /usr/local/lib/python3.12/site-packages/lisette/core.py:324, in Chat._call(self, msg, prefill, temp, think, search, stream, max_steps, step, final_prompt, tool_choice, **kwargs)
    322 if step>=max_steps-1: prompt,tool_choice,search = final_prompt,'none',False
    323 else: prompt = None
--> 324 yield from self._call(
    325     prompt, prefill, temp, think, search, stream, max_steps, step+1,
    326     final_prompt, tool_choice, **kwargs)

File /usr/local/lib/python3.12/site-packages/lisette/core.py:319, in Chat._call(self, msg, prefill, temp, think, search, stream, max_steps, step, final_prompt, tool_choice, **kwargs)
    316 yield res
    318 if tcs := m.tool_calls:
--> 319     tool_results=[_lite_call_func(tc, ns=self.ns) for tc in tcs]
    320     self.hist+=tool_results
    321     for r in tool_results: yield r

File /usr/local/lib/python3.12/site-packages/lisette/core.py:212, in _lite_call_func(tc, ns, raise_on_err)
    210 try: fargs = json.loads(tc.function.arguments)
    211 except Exception as e: raise ValueError(f"Failed to parse function arguments: {tc.function.arguments}") from e
--> 212 res = call_func(tc.function.name, fargs,ns=ns)
    213 if isinstance(res, ToolResponse): res = res.content
    214 else: res = str(res)

File /usr/local/lib/python3.12/site-packages/toolslm/funccall.py:217, in call_func(fc_name, fc_inputs, ns, raise_on_err)
    215 try: return func(**inps)
    216 except Exception as e:
--> 217     if raise_on_err: raise e from None
    218     else: return traceback.format_exc()

File /usr/local/lib/python3.12/site-packages/toolslm/funccall.py:215, in call_func(fc_name, fc_inputs, ns, raise_on_err)
    213 inps = {re.sub(r'\W', '', k):v for k,v in fc_inputs.items()}
    214 inps = _coerce_inputs(func, inps)
--> 215 try: return func(**inps)
    216 except Exception as e:
    217     if raise_on_err: raise e from None

TypeError: RLM.__init__.<locals>.run_repl() missing 1 required positional argument: 'python_code'

Prompt: 4

What happened?

Assistant: 417

The error suggests that when lisette tried to call run_repl, the python_code argument wasn't being passed correctly.

This might be an issue with how lisette introspects closure functions defined inside __init__.

Try making run_repl a method of the class instead of a closure:

Copied!

class RLM:
    def __init__(self, context: str, root_model="claude-sonnet-4-5", recursive_model=None, temperature=0.0):
        self.context = context
        self.root_model = root_model
        self.recursive_model = recursive_model if recursive_model else root_model
        self.shell = self._prep_shell()
        self.chat = Chat(model=self.root_model, tools=[self.run_repl], sp=REPL_SYSTEM_PROMPT, temp=temperature)

    def run_repl(self, python_code: str) -> str:
        "Execute Python code in the REPL environment"
        result = self.shell.run_cell(python_code)
        if result.error_in_exec or result.error_before_exec:
            return f"Error: {result.error_in_exec or result.error_before_exec}"
        output = result.stdout or ""
        if result.result is not None:
            output += str(result.result)
        return output

    # ... rest of the class

Does that fix it?

Code: 280 ()

q = """Give me a detailed and verbose overview of what is SolveIt as a methodology, framework, tool, platform and its capabilities. Especially in the context of a VC MedTech Fund Manager interested in using it for due diligence, for research, for processing pitch decks or investment decks of startups, for determining if startups align with its investment thesis and is a good fit (check size, stage, regulatory, addressable market, reimbursement of revenue path, clinical utility, etc), as well as drafting or analyzing the term-sheet and other financial details during an investment round (Cap Table, Term Sheet, Preferred shares conditions and implications). The output should be a markdown document that can be used as context for other LLMs in thinking mode to further analyze the workflow, methodology and best way to implement it.
"""

rlm = RLM(context="\n\n".join([lessons, notes]))
rlm(q,max_steps=15)

Output: 12,007

Based on my comprehensive analysis of the SolveIt course transcripts, I can now provide you with a detailed overview:

SolveIt: Comprehensive Overview for VC MedTech Fund Managers

Executive Summary

SolveIt is a cloud-based, AI-augmented development platform and methodology specifically designed for human-agent collaborative work. It combines an interactive coding environment (Linux VPS + Jupyter-like interface + AI assistance + version control) with a rigorous problem-solving methodology inspired by George Pólya's "How to Solve It" and literate programming principles.

For VC MedTech Fund Managers, SolveIt represents a computational workbench for venture economics and due diligence, enabling transparent, programmable financial modeling and analysis that goes beyond Excel spreadsheets and opaque SaaS platforms.

1. What is SolveIt?

As a Platform

Cloud-based Linux VPS with persistent storage and full terminal access
Integrated development environment combining:
- Code cells (Python execution)
- Prompt cells (AI queries)
- Note cells (Markdown documentation)
Built-in AI assistance (Claude integration) with "thinking mode"
Git version control and one-click publishing/sharing
FastHTML integration for instant web app deployment via public URLs
Secrets management for API keys and sensitive data

As a Methodology

The "SolveIt Method" emphasizes:

Incremental Building: Write 1-2 lines of code at a time
Immediate Feedback: Execute and inspect output after every step
Iterative Abstraction: Explore → understand → package into reusable functions
Learning-Oriented: Focus on understanding, not just completing tasks
Anti-Outsourcing: Keep humans in control, AI as assistant not replacement

As a Framework

Dialogue-based workflow: Conversations combining code, AI queries, and documentation
Tool-calling architecture: Python functions become AI-accessible tools via simple decorators
Claudette library: Thin wrapper over Anthropic API for agent orchestration
Context management: Pin, hide, and organize messages for long-term projects

2. Core Capabilities

Development Features

Python-first environment with excellent introspection (dot completion, __dict__ inspection)
Pre-installed data science libraries (Pandas, NumPy, SciPy, Matplotlib)
Keyboard-centric workflow with extensive shortcuts
Cell manipulation: Split, merge, copy, duplicate, reorder
Symbol Browser (Ctrl+Shift+V): Inspect variables and DataFrames
Non-linear editing: Modify previous messages and re-execute
Module export: Extract working code into reusable libraries

AI Integration

Native AI assistance: Built-in Claude integration for development help
Thinking mode (Cmd+Shift+D): Extended reasoning for complex problems
Tool use: Expose Python functions as AI-accessible tools with &function_name syntax
Claudette for programmatic AI: Structured extraction, tool loops, extended thinking
Prompt caching: Cost reduction for repeated document analysis

Document Processing

PDF to Markdown: DataLab's marker tool via Replicate API
Web scraping: Jina Reader for HTML content
ArXiv integration: Direct LaTeX source downloads
Context Kit: Built-in PDF reader
Multi-format support: DOCX via Pandoc, various document types

3. VC MedTech Fund Manager Use Cases

A. Due Diligence Workflows

1. Pitch Deck & Investment Memo Analysis

Document ingestion: Convert PDFs to markdown for AI analysis
Structured extraction: Parse key metrics (market size, revenue projections, team credentials)
Competitive analysis: Cross-reference claims against market data
Technical feasibility: Rapid prototyping to validate technical claims
Regulatory assessment: Extract and analyze FDA/regulatory pathways

2. Financial Modeling & Term Sheet Analysis

Cap Table Modeling:

Multi-round dilution analysis using Pandas DataFrames
Price per share (PPS) calculations for pre/post-money valuations
Employee stock option pool (ESOP) impact modeling
Ownership percentage tracking across financing rounds
Share class management (Common, Preferred Series A/B/C)

Term Sheet Economics:

Liquidation preferences (non-participating, participating, capped)
Anti-dilution provisions (full-ratchet, weighted-average)
Conversion mechanics and thresholds
Seniority structures in liquidation events

Waterfall Analysis:

Distribution modeling using Pandas groupby operations
Payout curve visualization
Scenario analysis for different exit valuations
Monte Carlo simulations for probabilistic outcomes

3. MedTech-Specific Due Diligence

Regulatory Path Analysis:

FDA approval pathway assessment (510(k), PMA, De Novo)
Clinical trial design evaluation
Regulatory timeline and cost modeling
Reimbursement pathway analysis (CPT codes, coverage decisions)

Clinical Utility Assessment:

Evidence quality evaluation
Clinical endpoint analysis
Health economics modeling (QALY, cost-effectiveness)
Comparative effectiveness research

Market Analysis:

Addressable market sizing (TAM/SAM/SOM)
Reimbursement landscape mapping
Competitive positioning
Revenue path modeling (direct sales, licensing, partnerships)

4. Technical Due Diligence

AI/ML maturity assessment: MLOps practices, data strategy, model validation
Architecture review: Scalability, technical debt, infrastructure
Data quality evaluation: Dataset size, diversity, labeling quality
IP analysis: Patent landscape, freedom to operate

B. Investment Decision Support

1. Automated Term Sheet Parsing

Copied!

# Using Claudette structured extraction
class TermSheetEconomics:
    pre_money_valuation_usd: float
    investment_amount_usd: float
    liquidation_preference: str
    option_pool_percent: float
    anti_dilution: str

extracted = cli.structured(term_sheet_text, TermSheetEconomics)
# Automatically populate financial models

2. Natural Language Scenario Analysis

Conversational interface for complex "what-if" queries
Multi-step analysis orchestrated by AI agents
Example: "Model a $10M Series B at $100M pre-money with 1.5x participating preferred. Show Series A returns at $500M vs $300M exit."

3. Investment Thesis Alignment

Automated Screening:

Check size: Revenue thresholds, funding stage
Stage alignment: Seed, Series A/B/C criteria
Regulatory fit: FDA pathway compatibility
Market size: Addressable market requirements
Reimbursement: Coverage pathway viability
Clinical utility: Evidence standards

Scoring System:

Programmatic evaluation against investment criteria
Weighted scoring across multiple dimensions
Automated flagging of misalignments
Comparative analysis across pipeline

C. Portfolio Management

1. Cap Table Management

OCF compliance: Open Cap Table Format integration
API integrations: Carta, Pulley, AngelList connectors
Automated updates: Delta synchronization with systems of record
Portfolio-wide analysis: Aggregated ownership, dilution tracking

2. Performance Monitoring

Automated data pipelines from portfolio companies
KPI tracking (ARR, burn rate, runway, clinical milestones)
Quarterly LP report generation
Alert systems for key metrics

3. Follow-on Decision Support

Pro-rata participation modeling
Dilution impact analysis
Portfolio construction optimization
Risk-adjusted return projections

4. Technical Architecture for VC Workflows

Data Stack

Core Libraries:

Pandas: Cap table DataFrames, financial modeling, data wrangling
NumPy: Vectorized calculations, Monte Carlo simulations
SciPy: Statistical analysis, optimization
Matplotlib/Plotly: Payout curves, scenario visualizations

Integration Layer:

OCF (Open Cap Table Format): Industry-standard cap table data format
RESTful APIs: Carta, Pulley, AngelList, portfolio company systems
Document parsers: PDF, DOCX, HTML extraction
Database: Vector databases for research knowledge bases

AI Agent Architecture

Claudette Components:

Client.structured(): Structured data extraction from documents
Chat.toolloop(): Multi-step agentic workflows
Extended thinking: Complex reasoning for novel deal structures
Prompt caching: Cost optimization for repeated document analysis

Tool Use Pattern:

Copied!

# Define tools as Python functions
def calculate_dilution(pre_money: float, investment: float, option_pool: float):
    """Calculate post-money ownership percentages"""
    # Implementation
    
# Expose to AI with &calculate_dilution in prompts
# AI orchestrates multi-step analysis automatically

Deployment Options

SolveIt native: Persistent environment, integrated workflow
FastHTML web apps: Public URLs for stakeholder sharing
API endpoints: Integration with existing VC tools
Jupyter exports: Standard notebook format for portability

5. Implementation Roadmap for VC Fund Managers

Phase 1: Foundations (Weeks 1-4)

Technical Skills:

Python fundamentals (PyFi course)
NumPy basics for numerical operations
SolveIt platform mastery (keyboard shortcuts, workflow)

Domain Knowledge:

Venture Deals Chapters 1-5 (basics)
Founder's Pocket Guide: Cap Tables

Deliverable: Single-round ownership calculator

Phase 2: Cap Table Modeling (Weeks 5-9)

Technical Skills:

Pandas DataFrames and hierarchical indexing
Data wrangling and transformation
Git version control

Domain Knowledge:

Stock options and ESOP mechanics
Preferred shares structures
Multi-round dilution dynamics

Deliverable: Multi-round dilution modeler with option pool

Phase 3: Venture Economics (Weeks 10-15)

Technical Skills:

Pandas groupby operations for waterfall analysis
Conditional logic for complex terms
Visualization (Matplotlib, Plotly)

Domain Knowledge:

Venture Deals Chapters 6-8 (Economics - CRITICAL)
Liquidation preferences deep dive
Term sheet negotiation dynamics

Deliverable: Waterfall analyzer with conversion mechanics

Phase 4: AI Integration (Weeks 16-20)

Technical Skills:

Claudette structured extraction
Tool use and agent orchestration
FastHTML for web interfaces

Domain Knowledge:

Term sheet interaction effects
Scenario analysis frameworks

Deliverables:

Automated term sheet parser
Natural language scenario simulator
Monte Carlo exit modeling engine

Phase 5: MedTech Specialization (Weeks 21+)

Technical Skills:

Machine learning fundamentals (PyFi Applied ML)
MLOps and production systems (MLE Blueprint)
Healthcare AI applications

Domain Knowledge:

FDA regulatory pathways
Clinical trial design
Reimbursement landscape
Health economics

Deliverable: MedTech Due Diligence Assistant

6. Key Differentiators vs. Existing Tools

vs. Excel

✅ Transparency: Explicit, auditable logic vs. hidden formulas
✅ Version control: Git tracking vs. file versioning chaos
✅ Scalability: Portfolio-wide analysis vs. single-file limitations
✅ Automation: Programmatic workflows vs. manual updates
✅ Advanced analytics: Monte Carlo, ML integration vs. basic functions

vs. Carta/Pulley (Systems of Record)

✅ Custom analysis: Proprietary models vs. standardized reports
✅ Flexibility: Any calculation vs. platform constraints
✅ Integration: Multi-source data vs. single system
✅ Intelligence layer: Advanced analytics on top of SOR
⚠️ Not a replacement: Complementary, not competitive

vs. Tactyc/Other VC Tools

✅ White box: Full transparency vs. black box calculations
✅ Customization: Unlimited vs. predefined scenarios
✅ Learning: Build understanding vs. abstract complexity
✅ Proprietary alpha: Unique analytical advantages
⚠️ Higher learning curve: Requires programming skills

vs. ChatGPT/Claude/Copilot

✅ Integrated execution: Code + AI + results in one environment
✅ Learning-first: Guidance vs. complete solutions
✅ Persistent context: Long-term projects vs. ephemeral chats
✅ Tool orchestration: Multi-step workflows vs. single responses
✅ Version control: Git integration vs. lost conversations

7. Best Practices for VC Fund Managers

Workflow Principles

Start small: Single company before portfolio-wide
Validate incrementally: Test against known results (Carta outputs, Tactyc models)
Document assumptions: Markdown notes for every model decision
Version control everything: Git commits for audit trail
Build reusable functions: Extract patterns into libraries

Data Management

OCF compliance: Design for interoperability from day one
Secure storage: Use SolveIt's persistent VPS for sensitive data
API-first: Automate data ingestion where possible
Backup strategy: Regular exports to external storage
Access control: Manage secrets and API keys properly

AI Integration Strategy

Structured extraction first: Automate data entry from documents
Tool use for analysis: Natural language interface for scenarios
Extended thinking for complexity: Novel deal structures, edge cases
Prompt caching for efficiency: Large documents, repeated analysis
Human-in-the-loop: Always verify AI outputs, especially financial calculations

Team Collaboration

Shared dialogues: One-click publishing for team review
Module libraries: Extract reusable code for team use
Documentation: Literate programming style for transparency
Training: Invest in team SolveIt proficiency
Dual interface: Power users (notebooks) + stakeholders (FastHTML dashboards)

8. Commercial Considerations

Positioning Strategy

"Intelligence Layer for Venture Economics"

Sits on top of systems of record (Carta, Pulley)
Provides advanced analytics beyond standard platforms
Targets data-driven VCs and sophisticated analysts
Differentiates on transparency, customization, and analytical power

Target Market

Primary: Emerging VC managers (sub-$100M AUM)
Secondary: Established funds with analytical teams
Tertiary: Corporate venture arms, family offices
MedTech focus: Regulatory/clinical analysis differentiation

Monetization Options

Personal use: Free/low-cost SolveIt subscription
Team licensing: Shared libraries and workflows
SaaS product: FastHTML dashboards with Python backend
Consulting: Custom model development for funds
Open-source core + premium features: Community + commercial

Competitive Moats

Methodology: SolveIt method creates better models
Transparency: White-box approach builds trust
Customization: Unlimited analytical flexibility
Integration: OCF compliance + API ecosystem
Domain expertise: MedTech-specific frameworks

9. Risk Mitigation

Technical Risks

Accuracy: Rigorous testing against known results
Scalability: Start simple, add complexity incrementally
Maintenance: Version control and documentation
Security: Proper secrets management, access control

Operational Risks

Learning curve: Invest in training, start with simple models
Platform dependency: Export to standard formats (Jupyter, OCF)
Team adoption: Dual interface strategy (notebooks + GUI)
Regulatory: Ensure audit trail and transparency

Financial Risks

Build vs. buy: Start personal use, commercialize if validated
Time investment: Phased approach, deliver value incrementally
Market fit: Validate with pilot users before full build

10. Success Metrics

Personal Capability

Time to analyze pitch deck: <2 hours (vs. days)
Term sheet modeling accuracy: 100% match with Carta
Scenario analysis speed: Minutes (vs. hours in Excel)
Portfolio monitoring: Automated (vs. manual quarterly)

Team Impact

Due diligence cycle time: 50%

id: chatcmpl-c34015c0-fc62-4f84-89c8-23b0f84cad43
model: claude-sonnet-4-5-20250929
finish_reason: length
usage: Usage(completion_tokens=4096, prompt_tokens=25539, total_tokens=29635, completion_tokens_details=None, prompt_tokens_details=PromptTokensDetailsWrapper(audio_tokens=None, cached_tokens=0, text_tokens=None, image_tokens=None, cache_creation_tokens=0, cache_creation_token_details=CacheCreationTokenDetails(ephemeral_5m_input_tokens=0, ephemeral_1h_input_tokens=0)), cache_creation_input_tokens=0, cache_read_input_tokens=0)

Note: 4,411

SolveIt: Comprehensive Overview for VC MedTech Fund Managers Executive Summary SolveIt is a cloud-based, AI-augmented development platform and methodology specifically designed for human-agent collaborative work. It combines an interactive coding environment (Linux VPS + Jupyter-like interface + AI assistance + version control) with a rigorous problem-solving methodology inspired by George Pólya's "How to Solve It" and literate programming principles.

For VC MedTech Fund Managers, SolveIt represents a computational workbench for venture economics and due diligence, enabling transparent, programmable financial modeling and analysis that goes beyond Excel spreadsheets and opaque SaaS platforms.

What is SolveIt? As a Platform Cloud-based Linux VPS with persistent storage and full terminal access Integrated development environment combining: Code cells (Python execution) Prompt cells (AI queries) Note cells (Markdown documentation) Built-in AI assistance (Claude integration) with "thinking mode" Git version control and one-click publishing/sharing FastHTML integration for instant web app deployment via public URLs Secrets management for API keys and sensitive data As a Methodology The "SolveIt Method" emphasizes:

Incremental Building: Write 1-2 lines of code at a time Immediate Feedback: Execute and inspect output after every step Iterative Abstraction: Explore → understand → package into reusable functions Learning-Oriented: Focus on understanding, not just completing tasks Anti-Outsourcing: Keep humans in control, AI as assistant not replacement As a Framework Dialogue-based workflow: Conversations combining code, AI queries, and documentation Tool-calling architecture: Python functions become AI-accessible tools via simple decorators Claudette library: Thin wrapper over Anthropic API for agent orchestration Context management: Pin, hide, and organize messages for long-term projects 2. Core Capabilities Development Features Python-first environment with excellent introspection (dot completion, dict inspection) Pre-installed data science libraries (Pandas, NumPy, SciPy, Matplotlib) Keyboard-centric workflow with extensive shortcuts Cell manipulation: Split, merge, copy, duplicate, reorder Symbol Browser (Ctrl+Shift+V): Inspect variables and DataFrames Non-linear editing: Modify previous messages and re-execute Module export: Extract working code into reusable libraries AI Integration Native AI assistance: Built-in Claude integration for development help Thinking mode (Cmd+Shift+D): Extended reasoning for complex problems Tool use: Expose Python functions as AI-accessible tools with &function_name syntax Claudette for programmatic AI: Structured extraction, tool loops, extended thinking Prompt caching: Cost reduction for repeated document analysis Document Processing PDF to Markdown: DataLab's marker tool via Replicate API Web scraping: Jina Reader for HTML content ArXiv integration: Direct LaTeX source downloads Context Kit: Built-in PDF reader Multi-format support: DOCX via Pandoc, various document types 3. VC MedTech Fund Manager Use Cases A. Due Diligence Workflows

Pitch Deck & Investment Memo Analysis Document ingestion: Convert PDFs to markdown for AI analysis Structured extraction: Parse key metrics (market size, revenue projections, team credentials) Competitive analysis: Cross-reference claims against market data Technical feasibility: Rapid prototyping to validate technical claims Regulatory assessment: Extract and analyze FDA/regulatory pathways
Financial Modeling & Term Sheet Analysis Cap Table Modeling:

Multi-round dilution analysis using Pandas DataFrames Price per share (PPS) calculations for pre/post-money valuations Employee stock option pool (ESOP) impact modeling Ownership percentage tracking across financing rounds Share class management (Common, Preferred Series A/B/C) Term Sheet Economics:

Liquidation preferences (non-participating, participating, capped) Anti-dilution provisions (full-ratchet, weighted-average) Conversion mechanics and thresholds Seniority structures in liquidation events Waterfall Analysis:

Distribution modeling using Pandas groupby operations Payout curve visualization Scenario analysis for different exit valuations Monte Carlo simulations for probabilistic outcomes 3. MedTech-Specific Due Diligence Regulatory Path Analysis:

FDA approval pathway assessment (510(k), PMA, De Novo) Clinical trial design evaluation Regulatory timeline and cost modeling Reimbursement pathway analysis (CPT codes, coverage decisions) Clinical Utility Assessment:

Evidence quality evaluation Clinical endpoint analysis Health economics modeling (QALY, cost-effectiveness) Comparative effectiveness research Market Analysis:

Addressable market sizing (TAM/SAM/SOM) Reimbursement landscape mapping Competitive positioning Revenue path modeling (direct sales, licensing, partnerships) 4. Technical Due Diligence AI/ML maturity assessment: MLOps practices, data strategy, model validation Architecture review: Scalability, technical debt, infrastructure Data quality evaluation: Dataset size, diversity, labeling quality IP analysis: Patent landscape, freedom to operate B. Investment Decision Support

Automated Term Sheet Parsing Copied!

Using Claudette structured extraction

class TermSheetEconomics: pre_money_valuation_usd: float investment_amount_usd: float liquidation_preference: str option_pool_percent: float anti_dilution: str

extracted = cli.structured(term_sheet_text, TermSheetEconomics)

Automatically populate financial models

Natural Language Scenario Analysis Conversational interface for complex "what-if" queries Multi-step analysis orchestrated by AI agents Example: "Model a $10M Series B at $100M pre-money with 1.5x participating preferred. Show Series A returns at $500M vs $300M exit."
Investment Thesis Alignment Automated Screening:

Check size: Revenue thresholds, funding stage Stage alignment: Seed, Series A/B/C criteria Regulatory fit: FDA pathway compatibility Market size: Addressable market requirements Reimbursement: Coverage pathway viability Clinical utility: Evidence standards Scoring System:

Programmatic evaluation against investment criteria Weighted scoring across multiple dimensions Automated flagging of misalignments Comparative analysis across pipeline C. Portfolio Management

Cap Table Management OCF compliance: Open Cap Table Format integration API integrations: Carta, Pulley, AngelList connectors Automated updates: Delta synchronization with systems of record Portfolio-wide analysis: Aggregated ownership, dilution tracking
Performance Monitoring Automated data pipelines from portfolio companies KPI tracking (ARR, burn rate, runway, clinical milestones) Quarterly LP report generation Alert systems for key metrics
Follow-on Decision Support Pro-rata participation modeling Dilution impact analysis Portfolio construction optimization Risk-adjusted return projections
Technical Architecture for VC Workflows Data Stack Core Libraries:

Pandas: Cap table DataFrames, financial modeling, data wrangling NumPy: Vectorized calculations, Monte Carlo simulations SciPy: Statistical analysis, optimization Matplotlib/Plotly: Payout curves, scenario visualizations Integration Layer:

OCF (Open Cap Table Format): Industry-standard cap table data format RESTful APIs: Carta, Pulley, AngelList, portfolio company systems Document parsers: PDF, DOCX, HTML extraction Database: Vector databases for research knowledge bases AI Agent Architecture Claudette Components:

Client.structured(): Structured data extraction from documents Chat.toolloop(): Multi-step agentic workflows Extended thinking: Complex reasoning for novel deal structures Prompt caching: Cost optimization for repeated document analysis Tool Use Pattern:

Copied!

Define tools as Python functions

def calculate_dilution(pre_money: float, investment: float, option_pool: float): """Calculate post-money ownership percentages""" # Implementation

Copied!

Expose to AI with &calculate_dilution in prompts

AI orchestrates multi-step analysis automatically

Deployment Options SolveIt native: Persistent environment, integrated workflow FastHTML web apps: Public URLs for stakeholder sharing API endpoints: Integration with existing VC tools Jupyter exports: Standard notebook format for portability 5. Implementation Roadmap for VC Fund Managers Phase 1: Foundations (Weeks 1-4) Technical Skills:

Python fundamentals (PyFi course) NumPy basics for numerical operations SolveIt platform mastery (keyboard shortcuts, workflow) Domain Knowledge:

Venture Deals Chapters 1-5 (basics) Founder's Pocket Guide: Cap Tables Deliverable: Single-round ownership calculator

Phase 2: Cap Table Modeling (Weeks 5-9) Technical Skills:

Pandas DataFrames and hierarchical indexing Data wrangling and transformation Git version control Domain Knowledge:

Stock options and ESOP mechanics Preferred shares structures Multi-round dilution dynamics Deliverable: Multi-round dilution modeler with option pool

Phase 3: Venture Economics (Weeks 10-15) Technical Skills:

Pandas groupby operations for waterfall analysis Conditional logic for complex terms Visualization (Matplotlib, Plotly) Domain Knowledge:

Venture Deals Chapters 6-8 (Economics - CRITICAL) Liquidation preferences deep dive Term sheet negotiation dynamics Deliverable: Waterfall analyzer with conversion mechanics

Phase 4: AI Integration (Weeks 16-20) Technical Skills:

Claudette structured extraction Tool use and agent orchestration FastHTML for web interfaces Domain Knowledge:

Term sheet interaction effects Scenario analysis frameworks Deliverables:

Automated term sheet parser Natural language scenario simulator Monte Carlo exit modeling engine Phase 5: MedTech Specialization (Weeks 21+) Technical Skills:

Machine learning fundamentals (PyFi Applied ML) MLOps and production systems (MLE Blueprint) Healthcare AI applications Domain Knowledge:

FDA regulatory pathways Clinical trial design Reimbursement landscape Health economics Deliverable: MedTech Due Diligence Assistant

Key Differentiators vs. Existing Tools vs. Excel ✅ Transparency: Explicit, auditable logic vs. hidden formulas ✅ Version control: Git tracking vs. file versioning chaos ✅ Scalability: Portfolio-wide analysis vs. single-file limitations ✅ Automation: Programmatic workflows vs. manual updates ✅ Advanced analytics: Monte Carlo, ML integration vs. basic functions

vs. Carta/Pulley (Systems of Record) ✅ Custom analysis: Proprietary models vs. standardized reports ✅ Flexibility: Any calculation vs. platform constraints ✅ Integration: Multi-source data vs. single system ✅ Intelligence layer: Advanced analytics on top of SOR ⚠️ Not a replacement: Complementary, not competitive

vs. Tactyc/Other VC Tools ✅ White box: Full transparency vs. black box calculations ✅ Customization: Unlimited vs. predefined scenarios ✅ Learning: Build understanding vs. abstract complexity ✅ Proprietary alpha: Unique analytical advantages ⚠️ Higher learning curve: Requires programming skills

vs. ChatGPT/Claude/Copilot ✅ Integrated execution: Code + AI + results in one environment ✅ Learning-first: Guidance vs. complete solutions ✅ Persistent context: Long-term projects vs. ephemeral chats ✅ Tool orchestration: Multi-step workflows vs. single responses ✅ Version control: Git integration vs. lost conversations

Best Practices for VC Fund Managers Workflow Principles Start small: Single company before portfolio-wide Validate incrementally: Test against known results (Carta outputs, Tactyc models) Document assumptions: Markdown notes for every model decision Version control everything: Git commits for audit trail Build reusable functions: Extract patterns into libraries Data Management OCF compliance: Design for interoperability from day one Secure storage: Use SolveIt's persistent VPS for sensitive data API-first: Automate data ingestion where possible Backup strategy: Regular exports to external storage Access control: Manage secrets and API keys properly AI Integration Strategy Structured extraction first: Automate data entry from documents Tool use for analysis: Natural language interface for scenarios Extended thinking for complexity: Novel deal structures, edge cases Prompt caching for efficiency: Large documents, repeated analysis Human-in-the-loop: Always verify AI outputs, especially financial calculations Team Collaboration Shared dialogues: One-click publishing for team review Module libraries: Extract reusable code for team use Documentation: Literate programming style for transparency Training: Invest in team SolveIt proficiency Dual interface: Power users (notebooks) + stakeholders (FastHTML dashboards)
Commercial Considerations Positioning Strategy "Intelligence Layer for Venture Economics"

Sits on top of systems of record (Carta, Pulley) Provides advanced analytics beyond standard platforms Targets data-driven VCs and sophisticated analysts Differentiates on transparency, customization, and analytical power Target Market Primary: Emerging VC managers (sub-$100M AUM) Secondary: Established funds with analytical teams Tertiary: Corporate venture arms, family offices MedTech focus: Regulatory/clinical analysis differentiation Monetization Options Personal use: Free/low-cost SolveIt subscription Team licensing: Shared libraries and workflows SaaS product: FastHTML dashboards with Python backend Consulting: Custom model development for funds Open-source core + premium features: Community + commercial Competitive Moats Methodology: SolveIt method creates better models Transparency: White-box approach builds trust Customization: Unlimited analytical flexibility Integration: OCF compliance + API ecosystem Domain expertise: MedTech-specific frameworks 9. Risk Mitigation Technical Risks Accuracy: Rigorous testing against known results Scalability: Start simple, add complexity incrementally Maintenance: Version control and documentation Security: Proper secrets management, access control Operational Risks Learning curve: Invest in training, start with simple models Platform dependency: Export to standard formats (Jupyter, OCF) Team adoption: Dual interface strategy (notebooks + GUI) Regulatory: Ensure audit trail and transparency Financial Risks Build vs. buy: Start personal use, commercialize if validated Time investment: Phased approach, deliver value incrementally Market fit: Validate with pilot users before full build 10. Success Metrics Personal Capability Time to analyze pitch deck: <2 hours (vs. days) Term sheet modeling accuracy: 100% match with Carta Scenario analysis speed: Minutes (vs. hours in Excel) Portfolio monitoring: Automated (vs. manual quarterly) Team Impact Due diligence cycle time: 50%

Prompt: 10

What happened, it worked before!

Assistant: 196

The error shows RLM.__init__.<locals>.run_repl() — meaning it's still using the closure version from the old class definition.

Did you re-run the cell that defines the RLM class after making the change to convert run_repl to a method? If you edited it but didn't execute the cell, Python is still using the old class definition.

Try:

Re-run the cell with the updated RLM class (with run_repl as a method, not a closure)
Then create a new rlm = RLM(...) instance
Then run your query

Prompt: 25

Look at the past code, I think we ran that. Not sure what you mean

Assistant: 174

You're right, looking back at the code it was working before with the same class structure.

Looking at the traceback more closely: skipping similar frames: Chat._call at line 324 (19 times) — it made ~19 successful tool calls before failing!

So it's not that the class is broken — something went wrong on a specific tool call. Maybe the model sent malformed arguments on that particular call.

Can you check rlm.chat.hist[-5:] to see the last few messages before the crash? That might show us what arguments the model tried to pass.

Code: 10 ()

rlm.chat.hist[-5:]

Output: 4,216

[Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Now let me compile all the information I\'ve gathered into a comprehensive markdown document\\n# Let me also search for any additional features I might have missed\\n\\n# Search for key features\\nfeatures_to_check = [\\n    \\"FastHTML\\",\\n    \\"nbdev\\",\\n    \\"Claudette\\",\\n    \\"lisette\\",\\n    \\"tool\\",\\n    \\"agent\\",\\n    \\"export\\",\\n    \\"publish\\",\\n    \\"version\\"\\n]\\n\\nprint(\\"Checking for additional key features...\\")\\nprint(\\"=\\"*80 + \\"\\\\n\\")\\n\\nfeature_summary = {}\\nfor feature in features_to_check:\\n    count = len(re.findall(feature, context, re.IGNORECASE))\\n    feature_summary[feature] = count\\n    print(f\\"{feature}: {count} mentions\\")\\n"}', 'name': 'run_repl'}, 'id': 'toolu_011JGDbkSHobnDGQDVbgX3sZ', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_011JGDbkSHobnDGQDVbgX3sZ',
  'role': 'tool',
  'name': 'run_repl',
  'content': 'Checking for additional key features...\n================================================================================\n\nFastHTML: 72 mentions\nnbdev: 22 mentions\nClaudette: 79 mentions\nlisette: 15 mentions\ntool: 1066 mentions\nagent: 104 mentions\nexport: 97 mentions\npublish: 33 mentions\nversion: 240 mentions\n'},
 Message(content=None, role='assistant', tool_calls=[{'index': 0, 'function': {'arguments': '{"python_code": "\\n# Let me get information about Claudette and lisette which seem important\\n\\nquery8 = \\"\\"\\"\\nBased on this content, please explain:\\n\\n1. What is Claudette and how does it integrate with SolveIt?\\n2. What is lisette and its capabilities?\\n3. How do these tools enable structured data extraction and tool use?\\n4. Examples of using these for document processing and analysis\\n\\nHere\'s the context:\\n\\n\\"\\"\\" + context[550000:700000]\\n\\nresponse8 = llm_query(query8)\\nprint(\\"\\\\nResponse about Claudette and lisette:\\")\\nprint(\\"=\\"*80)\\nprint(response8)\\n"}', 'name': 'run_repl'}, 'id': 'toolu_01JwVZT3en1jqSMLiSpjXRpj', 'type': 'function'}], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None}),
 {'tool_call_id': 'toolu_01JwVZT3en1jqSMLiSpjXRpj',
  'role': 'tool',
  'name': 'run_repl',
  'content': '\nResponse about Claudette and lisette:\n================================================================================\nBased on the provided content, I\'ll explain these tools and their capabilities:\n\n## 1. What is Claudette and How Does it Integrate with SolveIt?\n\n**Claudette** isn\'t directly mentioned in the content, but based on context, it appears to be a client library for interacting with Claude (Anthropic\'s AI). The content shows **Claudet** (likely a typo for Claudette) being used within SolveIt for chat functionality.\n\n**Integration aspects:**\n- Used alongside other AI clients (OpenAI SDK via **Cozette**)\n- Handles Claude API interactions for the chat interface\n- Supports different models (e.g., "claude-haiku-4-5")\n\n## 2. What is Lisette and Its Capabilities?\n\n**Lisette** is a library for structured LLM interactions that appears in the code:\n\n```python\nfrom lisette import *\nsysp = "You are a chat model..."\nchat = Chat(model=\'claude-haiku-4-5\', sp=sysp)\nr = chat("Hello")\n```\n\n**Key capabilities:**\n- **Structured Chat Management**: Creates `Chat` objects with system prompts\n- **Model Abstraction**: Works with different models (Claude, GPT, etc.)\n- **Response Handling**: Provides structured response objects\n- **History Management**: Tracks conversation context (`chat.h`)\n- **Content Extraction**: Helper functions like `cts()` to extract content from responses\n\n## 3. How These Tools Enable Structured Data Extraction and Tool Use\n\n**Structured Approach:**\n\n1. **Chat Objects**: Encapsulate model, system prompt, and history\n   ```python\n   chat = Chat(model=\'claude-haiku-4-5\', sp=sysp)\n   chat.h = []  # History tracking\n   ```\n\n2. **Response Processing**: Extract specific content\n   ```python\n   def cts(r): return r.choices[0].message.content\n   ```\n\n3. **Context Building**: Combine history for continuity\n   ```python\n   ctx = "Past context/conversation: " + "\\n".join([str(msg[\'content\']) for msg in h])\n   res = cts(Chat(...)( ctx + "\\n\\nCurrent question: " + code))\n   ```\n\n**Tool Integration:**\n- **Code Execution**: Uses `toolslm.shell.get_shell()` for running Python\n- **Conditional Routing**: Detects if input is code or natural language\n- **Multi-modal**: Handles both LLM queries and code execution in same interface\n\n## 4. Examples of Document Processing and Analysis\n\n### Example 1: Code Execution with Context\n```python\ndef proc_code(code:str, session):\n    h, sh = user_chats[session[\'userid\']]\n    c = is_code(code)\n    \n    if c: \n        res = ex(code, sh)  # Execute as Python\n    else:\n        # Send to LLM with context\n        ctx = "Past context: " + "\\n".join([msg[\'content\'] for msg in h])\n        res = cts(Chat(model=\'claude-haiku-4-5\', sp=sysp)(\n            ctx + "\\n\\nCurrent question: " + code\n        ))\n```\n\n### Example 2: Document Exploration (FastAI Source Code)\nFrom the Discord sharing:\n- **Jason\'s approach**: Imported FastAI notebook as dialogue\n- **Interactive exploration**: Added prompts at interesting code sections\n- **Deep analysis**: Asked "How does this work?" for specific functions\n- **Module inspection**: Explored `inspect` module usage\n\n### Example 3: Educational Seed Dialogues\n**Nicholas\'s pattern**:\n- Included textbook references and open resources\n- Created starter questions as dialogue seeds\n- Enabled others to fork and explore specific topics\n- Example questions about biology/genetics that could branch into:\n  - Spider gene research\n  - "Junk DNA" exploration\n  - Other specialized areas\n\n### Example 4: HTML/CSS Component Extraction\n```python\n# Extract Daisy UI components from docs\n# 1. Inspect HTML to get rendered components\n# 2. Convert to markdown list\n# 3. Generate FastHTML wrappers\n\nmk_comp(\'Button\', \'btn\')  # Creates Button component\nmk_comp(\'Input\', \'input\')\n\n# Usage:\nButton(\'Click me\', cls=\'-primary\')  # Auto-prefixes to \'btn btn-primary\'\n```\n\n### Example 5: Database-Backed Document Storage\n```python\nclass Dialog: id:int; uid:int; messages:str; name:str\n\n# Store structured conversation data\nmsgs = [\n    {\'content\': \'1+1\', \'author\': \'user\'}, \n    {\'content\': \'2\', \'author\': \'computer\'}\n]\ndialogs.insert(uid=u.id, messages=json.dumps(msgs), name=\'my dlg\')\n\n# Retrieve and render\ndlg = dialogs[dlgid]\nh = json.loads(dlg.messages)\nbubs = [render_message(m) for m in h]\n```\n\n## Key Patterns for Document Processing\n\n1. **Context Accumulation**: Build up relevant context from history\n2. **Type Detection**: Distinguish between code, queries, and commands\n3. **Structured Storage**: Use JSON for ordered message lists in relational DB\n4. **Progressive Enhancement**: Start simple, add features incrementally\n5. **Interactive Exploration**: Click-to-edit, real-time updates\n6. **Multi-format Support**: Handle markdown, code, plain text\n\nThese tools create a powerful environment where AI assistance, code execution, and document analysis flow seamlessly together in an interactive notebook-style interface.\n'},
 Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(index=0, function=Function(arguments='{}', name='run_repl'), id='toolu_01Gh3zK5rXPwj9eGFxuiFSit', type='function')], function_call=None, provider_specific_fields={'citations': None, 'thinking_blocks': None})]

Prompt: 10

Look at it and help me troubleshoot

Assistant: 343

Found it! Look at the very last message:

Copied!

Message(content=None, role='assistant', tool_calls=[ChatCompletionMessageToolCall(
    index=0, 
    function=Function(arguments='{}', name='run_repl'),  # <-- EMPTY ARGUMENTS!
    ...
)])

The model sent arguments='{}' — an empty JSON object with no python_code parameter. That's why you got "missing 1 required positional argument".

This is a model misbehavior after many tool calls, not a bug in your class. The model just "forgot" to include the required argument.

Possible fixes:

Add error handling in run_repl to return a friendly error instead of crashing:

Copied!

def run_repl(self, python_code: str = None) -> str:
    if not python_code:
        return "Error: No python_code provided. Please provide code to execute."
    # ... rest of function

Improve the system prompt to reinforce the required parameter format

Want to try option 1?

Note: 52

Still works. Great.

Now let's add a way that makes reading the trajectories or the history a bit more pleasant. Let's create a function that converts the trajectory in nice markdown

Note: 12

Pretty printing `lisette` trajectories

Prompt: 51

I want to better understand what the model is doing in it's trace so the next step will be to "pretty print" the trajectory, maybe even in a web UI.

Assistant: 250

Good idea! Visualizing the trajectory will help you understand the model's reasoning flow. A few approaches:

Simple text-based:

Print each step with indentation (REPL calls, llm_query() calls, results)
Show truncated outputs to keep it readable

Web UI:

Display as an expandable tree (root → REPL → recursive LLM calls)
Color-code by type (search, extract, recursive query, final answer)
Show token counts and timing per step

What to show:

Each REPL code execution + output
Each llm_query() call with the prompt and response
The final answer

Do you want to start with a simple text pretty-printer first, or jump straight to building a web UI? I can help with either approach.

Code: 6 ()

rlm.chat.hist

Code: 15 ()

x=rlm.chat.hist[1]
x

Code: 16 ()

isinstance(x, litellm.types.utils.Message)

Code: 130 ()

md = ""

md += f"### {x['role']}\n\n"
if x["content"]:
    md += (x["content"] + "\n\n")

if "tool_calls" in x:
    c = json.loads(x["tool_calls"][0]["function"]["arguments"])["python_code"]
    md += f"```\n{c}\n```\n\n"

md+= "---"

print(md)

Code: 3 ()

import json

Code: 180 ()

def _md(x):
    if isinstance(x, litellm.types.utils.Message): x = x.__dict__

    md = ""

    md += f"### {x['role']}\n\n"
    if x["content"]:
        md += (x["content"] + "\n\n")

    if "tool_calls" in x and x["tool_calls"]:
        c = json.loads(x["tool_calls"][0]["function"]["arguments"])["python_code"]
        md += f"```\n{c}\n```\n\n"

    md+= "---"

    return md

Code: 13 ()

_md(rlm.chat.hist[1])

Code: 9 ()

from IPython.display import Markdown

Code: 21 ()

Markdown("\n".join(map(_md, rlm.chat.hist)))

Note: 28

Let's print the tool output in code brackets as well and add emojis to the role title thingy

Note: 48

There may be messages without content (content=None) and we need to make sure that our function does not crash, so we have to add a check here.

Code: 276 ()

def _md(x):
    if isinstance(x, litellm.types.utils.Message): x = x.__dict__

    role = x["role"]

    symbols = {"user": "😀", "assistant": "🤖", "tool": "🧰"}

    md = ""

    md += f"### {symbols[role]} {role}\n\n"
    if x["content"]:
        if role == "tool":
            md += ("```\n" + x["content"] + "\n```\n\n")
        else:
            md += (x["content"] + "\n\n")

    if "tool_calls" in x and x["tool_calls"]:
        c = json.loads(x["tool_calls"][0]["function"]["arguments"])["python_code"]
        md += f"```\n{c}\n```\n\n"

    md+= "---"

    return md

Code: 21 ()

Markdown("\n".join(map(_md, rlm.chat.hist)))

Code: 39 ()

def print_history(rlm: RLM):
    return Markdown("\n".join(map(_md, rlm.chat.hist)))

Code: 9 ()

print_history(rlm)

Note: 4

Wrap up

Prompt: 34

Please look at my implementation and the original blog post and see if my implementation is faithful to the original blog post.

Assistant: 163

Important notice: The dialog has been truncated due to size limits. Originally there were 109 messages, but now only 2 remain in the context. This means I no longer have access to your implementation or any reference to the blog post you mentioned.

Could you please:

Share your implementation again (or point me to where it is)
Provide the URL or name of the original blog post you'd like me to compare it to

Once I have both, I'll be happy to review whether your implementation is faithful to the original!

Prompt: 13,348

Please look at my implementation and the original blog post and see if my implementation is faithful to the original blog post. This is the class implementation, it is in a pinned message too:

class RLM: def init(self, context: str, root_model = "claude-sonnet-4-5", recursive_model = None, temperature=0.0): self.context = context self.root_model = root_model self.recursive_model = recursive_model if recursive_model else root_model self.shell = self._prep_shell()

Copied!

    def run_repl(python_code: str) -> str:
        "Execute Python code in the REPL environment"
        result = self.shell.run_cell(python_code)
        if result.error_in_exec or result.error_before_exec:
            return f"Error: {result.error_in_exec or result.error_before_exec}"
        output = result.stdout or ""
        if result.result is not None:
            output += str(result.result)
        return output

    self.chat = Chat(model=self.root_model, tools=[run_repl], sp=REPL_SYSTEM_PROMPT, temp=temperature)


def _prep_shell(self):
    def llm_query(query: str):
        msg = mk_msg(query)
        return litellm.completion(self.recursive_model, [msg]).choices[0].message.content

    sh = get_shell()
    sh.user_ns['context'] = self.context
    sh.user_ns['llm_query'] = llm_query
    return sh

def __call__(self, query: str, max_steps=10):
    return self.chat(query, max_steps=max_steps)

REPL_SYSTEM_PROMPT = """You are tasked with answering a query with associated context. You can access, transform, and analyze this context interactively in a REPL environment that can recursively query sub-LLMs, which you are strongly encouraged to use as much as possible. You will be queried iteratively until you provide a final answer.

Your REPL has:

A context variable (string) containing important information about your query
An llm_query(query: str) function that allows you to query an LLM (that can handle around 500K chars) inside your REPL environment.

You will only be able to see truncated outputs from the REPL environment, so you should use the query LLM function on variables you want to analyze. You will find this function especially useful when you have to analyze the semantics of the context. Use these variables as buffers to build up your final answer.

Make sure to explicitly look through the entire context in REPL before answering your query. An example strategy is to first look at the context and figure out a chunking strategy, then break up the context into smart chunks, and query an LLM per chunk with a particular question and save the answers to a buffer, then query an LLM with all the buffers to produce your final answer.

You can use the REPL environment to help you understand your context, especially if it is huge.

Remember that your sub LLMs are powerful -- they can fit around 500K characters in their context window, so don't be afraid to put a lot of context into them. For example, a viable strategy is to feed 10 documents per sub-LLM query. Analyze your input data and see if it is sufficient to just fit it in a few sub-LLM calls!

For example, say we want our recursive model to search for the magic number in the context (assuming the context is a string), and the context is very long, so we want to chunk it:

Copied!

chunk = context[:10000]
answer = llm_query(f"What is the magic number in the context? Here is the chunk: {{chunk}}")
print(answer)

As another example, after analyzing the context and realizing its separated by Markdown headers, we can maintain state through buffers by chunking the context by headers, and iteratively querying an LLM over it:

Copied!

# After finding out the context is separated by Markdown headers, we can chunk, summarize, and answer
import re
sections = re.split(r'### (.+)', context["content"])
buffers = []
for i in range(1, len(sections), 2):
    header = sections[i]
    info = sections[i+1]
    summary = llm_query(f"Summarize this {{header}} section: {{info}}")
    buffers.append(f"{{header}}: {{summary}}")
final_answer = llm_query(f"Based on these summaries, answer the original query: {{query}}\\n\\nSummaries:\\n" + "\\n".join(buffers))

IMPORTANT:

Always inspect the context variable first before answering
The context can be VERY LONG (millions of tokens). Never print it directly!
Use Python to explore it safely: check length with len(), peek at slices, search with regex, chunk it
When you've gathered enough information or are running low on steps, synthesize your findings into a final answer. Do not say "let me search more" — provide what you have.

A llm_query function that allows you to query an LLM (that can handle around 500K chars) inside your REPL environment.

Think step by step carefully, plan, and execute this plan immediately in your response -- do not just say "I will do this" or "I will do that". Output to the REPL environment and especially the recursive LLM (llm_query in the REPL) as much as possible. Remember to explicitly answer the original query in your final answer."""

This is the Blog:

Recursive Language Models We propose Recursive Language Models (RLMs), an inference strategy where language models can decompose and recursively interact with input context of unbounded length through REPL environments.

Authors Affiliations Alex Zhang

MIT CSAIL

Omar Khattab

MIT CSAIL

Published Oct. 15, 2025

tl;dr We explore language models that recursively call themselves or other LLMs before providing a final answer. Our goal is to enable the processing of essentially unbounded input context length and output length and to mitigate degradation “context rot”.

We propose Recursive Language Models, or RLMs, a general inference strategy where language models can decompose and recursively interact with their input context as a variable. We design a specific instantiation of this where GPT-5 or GPT-5-mini is queried in a Python REPL environment that stores the user’s prompt in a variable.

We demonstrate that an RLM using GPT-5-mini outperforms GPT-5 on a split of the most difficult long-context benchmark we got our hands on (OOLONG [1]

by more than double the number of correct answers, and is cheaper per query on average! We also construct a new long-context Deep Research task from BrowseComp-Plus [2]

On it, we observe that RLMs outperform other methods like ReAct + test-time indexing and retrieval over the prompt. Surprisingly, we find that RLMs also do not degrade in performance when given 10M+ tokens at inference time.

We are excited to share these very early results, as well as argue that RLMs will be a powerful paradigm very soon. We think that RLMs trained explicitly to recursively reason are likely to represent the next milestone in general-purpose inference-time scaling after CoT-style reasoning models and ReAct-style agent models.

We have a compressed summary in the original tweet: https://x.com/a1zhang/status/1978469116542337259

We also now have a minimal implementation for people to build on top of: https://github.com/alexzhang13/rlm

Teaser Figure Figure 1. An example of a recursive language model (RLM) call, which acts as a mapping from text → text, but is more flexible than a standard language model call and can scale to near-infinite context lengths. An RLM allows a language model to interact with an environment (in this instance, a REPL environment) that stores the (potentially huge) context, where it can recursively sub-query “itself”, other LM calls, or other RLM calls, to efficiently parse this context and provide a final response. Prelude: Why is “long-context” research so unsatisfactory? There is this well-known but difficult to characterize phenomenon in language models (LMs) known as “context rot”. Anthropic defines context rot as “[when] the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases”, but many researchers in the community know this definition doesn’t fully hit the mark. For example, if we look at popular needle-in-the-haystack benchmarks like RULER, most frontier models actually do extremely well (90%+ on 1-year old models).

Pun kin I asked my LM to finish carving the pumpkin joke it started yesterday. It said, “Pumpkin? What pumpkin?” — the context completely rotted. But people have noticed that context rot is this weird thing that happens when your Claude Code history gets bloated, or you chat with ChatGPT for a long time — it’s almost like, as the conversation goes on, the model gets…dumber? It’s sort of this well-known but hard to describe failure mode that we don’t talk about in our papers because we can’t benchmark it. The natural solution is something along the lines of, “well maybe if I split the context into two model calls, then combine them in a third model call, I’d avoid this degradation issue”. We take this intuition as the basis for a recursive language model.

Recursive Language Models (RLMs). A recursive language model is a thin wrapper around a LM that can spawn (recursive) LM calls for intermediate computation — from the perspective of the user or programmer, it is the same as a model call. In other words, you query a RLM as an “API” like you would a LM, i.e. rlm.completion(messages) is a direct replacement for gpt5.completion(messages). We take a context-centric view rather than a problem-centric view of input decomposition. This framing retains the functional view that we want a system that can answer a particular query over some associated context:

API Figure 2. A recursive language model call replaces a language model call. It provides the user the illusion of near infinite context, while under the hood a language model manages, partitions, and recursively calls itself or another LM over the context accordingly to avoid context rot. Under the hood, a RLM provides only the query to the LM (which we call the root LM, or LM with depth=0), and allows this LM to interact with an environment, which stores the (potentially huge) context.

We choose the environment to be a loop where the LM can write to and read the output of cells of a Python REPL Notebook (similar to a Jupyter Notebook environment) that is pre-loaded with the context as a variable in memory. The root LM has the ability to call a recursive LM (or LM with depth=1) inside the REPL environment as if it were a function in code, allowing it to naturally peek at, partition, grep through, and launch recursive sub-queries over the context. Figure 3 shows an example of how the RLM with a REPL environment produces a final answer.

API Figure 3. Our instantiation of the RLM framework provides the root LM the ability to analyze the context in a Python notebook environment, and launch recursive LM calls (depth=1) over any string stored in a variable. The LM interacts by outputting code blocks, and it receives a (truncated) version of the output in its context. When it is done, it outputs a final answer with FINAL(…) tags or it can choose to use a string in the code execution environment with FINAL_VAR(…). When the root LM is confident it has an answer, it can either directly output the answer as FINAL(answer), or it can build up an answer using the variables in its REPL environment, and return the string inside that answer as FINAL_VAR(final_ans_var).

This setup yields several benefits that are visible in practice:

The context window of the root LM is rarely clogged — because it never directly sees the entire context, its input context grows slowly. The root LM has the flexibility to view subsets of the context, or naively recurse over chunks of it. For example, if the query is to find a needle-in-the-haystack fact or multi-hop fact, the root LM can use regex queries to roughly narrow the context, then launch recursive LM calls over this context. This is particularly useful for arbitrary long context inputs, where indexing a retriever is expensive on the fly! The context can, in theory, be any modality that can be loaded into memory. The root LM has full control to view and transform this data, as well as ask sub-queries to a recursive LM. Relationship to test-time inference scaling. We are particularly excited about this view of language models because it offers another axis of scaling test-time compute. The trajectory in which a language model chooses to interact with and recurse over its context is entirely learnable, and can be RL-ified in the same way that reasoning is currently trained for frontier models. Interestingly, it does not directly require training models that can handle huge context lengths because no single language model call should require handling a huge context.

RLMs with REPL environments are powerful. We highlight that the choice of the environment is flexible and not fixed to a REPL or code environment, but we argue that it is a good choice. The two key design choices of recursive language models are 1) treating the prompt as a Python variable, which can be processed programmatically in arbitrary REPL flows. This allows the LLM to figure out what to peek at from the long context, at test time, and to scale any decisions it wants to take (e.g., come up with its own scheme for chunking and recursion adaptively) and 2) allowing that REPL environment to make calls back to the LLM (or a smaller LLM), facilitated by the decomposition and versatility from choice (1).

We were excited by the design of CodeAct [3] , and reasoned that adding recursive model calls to this system could result in significantly stronger capabilities — after all, LM function calls are incredibly powerful. However, we argue that RLMs fundamentally view LM usage and code execution differently than prior works: the context here is an object to be understood by the model, and code execution and recursive LM calls are a means of understanding this context efficiently. Lastly, in our experiments we only consider a recursive depth of 1 — i.e. the root LM can only call LMs, not other RLMs. It is a relatively easy change to allow the REPL environment to call RLMs instead of LMs, but we felt that for most modern “long context” benchmarks, a recursive depth of 1 was sufficient to handle most problems. However, for future work and investigation into RLMs, enabling larger recursive depth will naturally lead to stronger and more interesting systems.

The formal definition (click to expand) Some early (and very exciting) results! We’ve been looking around for benchmarks that reflect natural long-context tasks, e.g. long multi-turn Claude Code sessions. We namely were looking to highlight two properties that limit modern frontier models: 1) the context rot phenomenon, where model performance degrades as a function of context length, and 2) the system-level limitations of handling an enormous context.

We found in practice that many long-context benchmarks offer contexts that are not really that long and which were already solvable by the latest generation (or two) of models. In fact, we found some where models could often answer queries without the context! We luckily quickly found two benchmarks where modern frontier LLMs struggle to perform well, but we are actively seeking any other good benchmark recommendations to try.

Exciting Result #1 — Dealing with Context Rot. The OOLONG benchmark [1] is a challenging new benchmark that evaluates long-context reasoning tasks over fine-grained information in context. We were fortunate to have the (anonymous but not affiliated with us) authors share the dataset upon request to run our experiments on a split of this benchmark.

Setup. The trec_coarse split consists of 6 different types of queries to answer distributional queries about a giant list of “question” entries. For example, one question looks like:

For the following question, only consider the subset of instances that are associated with user IDs 67144, 53321, 38876, 59219, 18145, 64957, 32617, 55177, 91019, 53985, 84171, 82372, 12053, 33813, 82982, 25063, 41219, 90374, 83707, 59594. Among instances associated with these users, how many data points should be classified as label 'entity'? Give your final answer in the form 'Answer: number'. The query is followed by ~3000 - 6000 rows of entries with associated user IDs (not necessarily unique) and instances that are not explicitly labeled (i.e. the model has to infer the labeling to answer). They look something like this:

Date: Dec 12, 2022 || User: 63685 || Instance: How many years old is Benny Carter ? Date: Dec 30, 2024 || User: 35875 || Instance: What war saw battles at Parrot 's Beak and Black Virgin ? Date: Apr 13, 2024 || User: 80726 || Instance: What Metropolis landmark was first introduced in the Superman cartoons of the 1940 's ? Date: Feb 29, 2024 || User: 59320 || Instance: When was Calypso music invented? ... The score is computed as the number of queries answered correctly by the model, with the caveat that for numerical / counting problems, they use a continuous scoring metric. This benchmark is extremely hard for both frontier models and agents because they have to semantically map and associate thousands of pieces of information in a single query, and cannot compute things a-priori! We evaluate the following models / agents:

GPT-5. Given the whole context and query, tell GPT-5 to provide an answer. GPT-5-mini. Given the whole context and query, tell GPT-5-mini to provide an answer. RLM(GPT-5-mini). Given the whole context and query, tell RLM(GPT-5-mini) to provide an answer. GPT-5-mini (root LM) can recursively call GPT-5-mini inside its REPL environment. RLM(GPT-5) without sub-calls. Given the whole context and query, tell RLM(GPT) to provide an answer. GPT-5 (root LM) cannot recursively call GPT-5 inside its REPL environment. This is an ablation for the use of a REPL environment without recursion. ReAct w/ GPT-5 + BM25. We chunk every lines into its own “document”, and gives a ReAct loop access to a BM25 retriever to return 10 lines per search request. Results. We focus explicitly on questions with contexts over 128k tokens (~100 queries), and we track both the performance on the benchmark, as well as the overall API cost of each query. In all of the following results (Figure 4a,b), the entire input fits in the context window of GPT-5 / GPT-5-mini — i.e., incorrect predictions are never due to truncation or context window size limitations:

API Figure 4a. We report the overall score for each method on the trec_coarse dataset of the OOLONG benchmark for queries that have a context length of 132k tokens. We compare performance to GPT-5. RLM(GPT-5-mini) outperforms GPT-5 by over 34 points (~114% increase), and is nearly as cheap per query (we found that the median query is cheaper due to some outlier, expensive queries). It turns out actually that RLM(GPT-5-mini) outperforms GPT-5 and GPT-5-mini by >33%↑ raw score (over double the performance) while maintaining roughly the same total model API cost as GPT-5 per query! When ablating recursion, we find that RLM performance degrades by ~10%, likely due to many questions requiring the model to answer semantic questions about the data (e.g. label each question). We see in Figure 4b that these gains roughly transfer when we double the size of the context to ~263k tokens as well, although with some performance degradation!

API Figure 4b. We report the overall score for each method on the trec_coarse dataset of the OOLONG benchmark for queries that have a context length of 263k tokens, nearly the limit for GPT-5/GPT-5-mini. We compare performance to GPT-5. RLM(GPT-5-mini) outperforms GPT-5 by over 15 points (~49% increase), and is cheaper per query on average. Notably, the performance of GPT-5-mini drops while GPT-5 does not, which indicates that context rot is more severe for GPT-5-mini. We additionally noticed that the performance drop for the RLM approaches occurs for counting problems, where it makes more errors when the context length increases — for GPT-5, it already got most of these questions incorrect in the 132k context case, which explains why its performance is roughly preserved. Finally, while the ReAct + GPT-5 + BM25 baseline doesn’t make much sense in this setting, we provide it to show retrieval is difficult here while RLM is the more appropriate method.

Great! So we’re making huge progress in solving goal (1), where GPT-5 has just enough context window to fit the 263k case. But what about goal (2), where we may have 1M, 10M, or even 100M tokens in context? Can we still treat this like a single model call?

Exciting Result #2 — Ridiculously Large Contexts My advisor Omar is a superstar in the world of information retrieval (IR), so naturally we also wanted to explore whether RLMs scale properly when given thousands (or more!) of documents. OOLONG [1] provides a giant block of text that is difficult to index and therefore difficult to compare to retrieval methods, so we looked into DeepResearch-like benchmarks that evaluate answering queries over documents.

Retrieval over huge offline corpuses. We initially were interested in BrowseComp [4] , which evaluates agents on multi-hop, web-search queries, where agents have to find the relevant documents online. We later found the BrowseComp-Plus [2] benchmark, which pre-downloads all possible relevant documents for all queries in the original benchmark, and just provides a list of ~100K documents (~5k words on average) where the answer to a query is scattered across this list. For benchmarking RLMs, this benchmark is perfect to see if we can just throw ridiculously large amount of context into a single chat.completion(...) RLM call instead of building an agent!

Setup. We explore how scaling the # documents in context affects the performance of various common approaches to dealing with text corpuses, as well as RLMs. Queries on the BrowseComp-Plus benchmark are multi-hop in the sense that they require associating information across several different documents to answer the query. What this implies is that even if you retrieve the document with the correct answer, you won’t know it’s correct until you figure out the other associations. For example, query 984 on the benchmark is the following:

I am looking for a specific card in a trading card game. This card was released between the years 2005 and 2015 with more than one rarity present during the year it was released. This card has been used in a deck list that used by a Japanese player when they won the world championship for this trading card game. Lore wise, this card was used as an armor for a different card that was released later between the years 2013 and 2018. This card has also once been illegal to use at different events and is below the level 8. What is this card?

For our experiments, we explore the performance of each model / agent / RLM given access to a corpus of sampled documents of varying sizes — the only guarantee is that the answer can be found in this corpus. In practice, we found that GPT-5 can fit ~40 documents in context before it exceeds the input context window (272k tokens), which we factor into our choice of constants for our baselines. We explore the following models / agents, similar to the previous experiment:

GPT-5. Given all documents in context and the query, tell GPT-5 to provide an answer. If it goes over the context limit, return nothing. GPT-5 (Truncated). Given all documents in context and the query, tell GPT-5 to provide an answer. If it goes over the context limit, truncate by most recent tokens (i.e. random docs). GPT-5 + Pre-query BM25. First retrieve the top 40 documents using BM25 with the original query. Given these top-40 documents and the query, tell GPT-5 to provide an answer. RLM(GPT-5). Given all documents in context and the query, tell RLM(GPT-5) to provide an answer. GPT-5 (root LM) can “recursively” call GPT-5-mini inside its REPL environment. RLM(GPT-5) without sub-calls. Given the whole context and query, tell RLM(GPT-5) to provide an answer. GPT-5 (root LM) cannot recursively call GPT-5 inside its REPL environment. This is an ablation for the use of a REPL environment without recursion. ReAct w/ GPT-5 + BM25. Given all documents, query for an answer from a ReAct loop using GPT-5 with access to a BM25 retriever that can return 5 documents per request. Results. We want to emphasize that these preliminary results are not over the entire BrowseComp-Plus dataset, and only a small subset. We report the performance over 20 randomly sampled queries on BrowseComp-Plus when given 10, 50, 100, and 1000 documents in context in Figure 5. We always include the gold / evidence document documents in the corpus, as well as the hard-mined negatives if available.

API Figure 5. We plot the performance and API cost per answer of various methods on 20 random queries in BrowseComp-Plus given increasing numbers of documents in context. Only the iterative methods (RLM, ReAct) maintain reasonable performance at 100+ documents. There are a few things to observe here — notably, RLM(GPT-5) is the only model / agent able to achieve and maintain perfect performance at the 1000 document scale, with the ablation (no recursion) able to similarly achieve 90%. The base GPT-5 model approaches, regardless of how they are conditioned, show clear signs of performance dropoff as the number of documents increase. Unlike OOLONG [1] , all approaches are able to solve the task when given a sufficiently small context window (10 documents), making this a problem of finding the right information rather than handling complicated queries. Furthermore, the cost per query of RLM(GPT-5) scales reasonably as a function of the context length!

These experiments are particularly exciting because without any extra fine-tuning or model architecture changes, we can reasonably handle huge corpuses (10M+ tokens) of context on realistic benchmarks without the use of a retriever. It should be noted that the baselines here index BM-25 per query, which is a more powerful condition than indexing the full 100K document corpus and applying BM-25. Regardless, RLMs are able to outperform the iterative ReAct + GPT-5 + BM25 loop on a retrieval style task with a reasonable cost!

Amazing! So RLMs are a neat solution to handle our two goals, and offer natural way to extend the effective context window of a LM call without incurring large costs. The rest of this blog will be dedicated to some cool and interesting behavior that RLMs exhibit!

What is the RLM doing? Some Interesting Cases… A strong benefit of the RLM framework is the ability to roughly interpret what it is doing and how it comes to its final answer. We vibe-coded a simple visualizer to peer into the trajectory of an RLM, giving us several interesting examples to share about what the RLM is doing!

API Strategies that have emerged that the RLM will attempt. At the level of the RLM layer, we can completely interpret how the LM chooses to interact with the context. Note that in every case, the root LM starts only with the query and an indication that the context exists in a variable in a REPL environment that it can interact with.

Peeking. At the start of the RLM loop, the root LM does not see the context at all — it only knows its size. Similar to how a programmer will peek at a few entries when analyzing a dataset, the LM can peek at its context to observe any structure. In the example below on OOLONG, the outer LM grabs the first 2000 characters of the context.

API Grepping. To reduce the search space of its context, rather than using semantic retrieval tools, the RLM with REPL can look for keywords or regex patterns to narrow down lines of interest. In the example below, the RLM looks for lines with questions and IDs.

API Partition + Map. There are many cases where the model cannot directly grep or retrieve information due to some semantic equivalence of what it is looking for. A common pattern the RLM will perform is to chunk up the context into smaller sizes, and run several recursive LM calls to extract an answer or perform this semantic mapping. In the example below on OOLONG, the root LM asks the recursive LMs to label each question and use these labels to answer the original query.

API Summarization. RLMs are a natural generalization of summarization-based strategies commonly used for managing the context window of LMs. RLMs commonly summarize information over subsets of the context for the outer LM to make decisions.

API Long-input, long-output. A particularly interesting and expensive case where LMs fail is in tasks that require long output generations. For example, you might give ChatGPT your list of papers and ask it to generate the BibTeX for all of them. Similar to huge multiplication problems, some people may argue that a model should not be expected to solve these programmatic tasks flawlessly — in these instances, RLMs with REPL environments should one-shot these tasks! An example is the LoCoDiff [5] benchmark, where language models are tasked with tracking a long git diff history from start to finish, and outputting the result of this history given the initial file. For histories longer than 75k tokens, GPT-5 can’t even solve 10% of the histories! An example of what the model is given (as provided on the project website) is as follows:

git log -p
--cc
--reverse
--topo-order
-- shopping_list.txt

commit 008db723cd371b87c8b1e3df08cec4b4672e581b Author: Example User Date: Wed May 7 21:12:52 2025 +0000

Copied!

Initial shopping list

diff --git a/shopping_list.txt b/shopping_list.txt new file mode 100644 index 0000000..868d98c --- /dev/null +++ b/shopping_list.txt @@ -0,0 +1,6 @@ +# shopping_list.txt +apples +milk +bread +eggs +coffee

commit b6d826ab1b332fe4ca1dc8f67a00f220a8469e48 Author: Example User Date: Wed May 7 21:12:52 2025 +0000

Copied!

Change apples to oranges and add cheese

diff --git a/shopping_list.txt b/shopping_list.txt index 868d98c..7c335bb 100644 --- a/shopping_list.txt +++ b/shopping_list.txt @@ -1,6 +1,7 @@

shopping_list.txt

-apples +oranges milk bread eggs coffee +cheese ... We tried RLM(GPT-5) to probe what would happen, and found in some instances that it chooses to one-shot the task by programmatically processing the sequence of diffs! There are many benchmark-able abilities of LMs to perform programmatic tasks (e.g. huge multiplication, diff tracking, etc.), but RLMs offer a framework for avoiding the need for such abilities altogether.

API More patterns…? We anticipate that a lot more patterns will emerge over time when 1) models get better and 2) models are trained / fine-tuned to work this way. An underexplored area of this work is how efficient a language model can get with how it chooses to interact with the REPL environment, and we believe all of these objectives (e.g. speed, efficiency, performance, etc.) can be optimized as scalar rewards.

Limitations. We did not optimize our implementation of RLMs for speed, meaning each recursive LM call is both blocking and does not take advantage of any kind of prefix caching! Depending on the partition strategy employed by the RLM’s root LM, the lack of asynchrony can cause each query to range from a few seconds to several minutes. Furthermore, while we can control the length / “thinking time” of an RLM by increasing the maximum number of iterations, we do not currently have strong guarantees about controlling either the total API cost or the total runtime of each call. For those in the systems community (cough cough, especially the GPU MODE community), this is amazing news! There’s so much low hanging fruit to optimize here, and getting RLMs to work at scale requires re-thinking our design of inference engines.

Related Works Scaffolds for long input context management. RLMs defer the choice of context management to the LM / REPL environment, but most prior works do not. MemGPT [6] similarly defers the choice to the model, but builds on a single context that an LM will eventually call to return a response. MemWalker [7] imposes a tree-like structure to order how a LM summarizes context. LADDER [8] breaks down context from the perspective of problem decomposition, which does not generalize to huge contexts.

Other (pretty different) recursive proposals. There’s plenty of work that invokes forking threads or doing recursion in the context of deep learning, but none have the structure required for general-purpose decomposition. THREAD [9] modifies the output generation process of a model call to spawn child threads that write to the output. Tiny Recursive Model (TRM) [10] is a cool idea for iteratively improving the answer of a (not necessarily language) model in its latents. Recursive LLM Prompts was an early experiment on treating the prompt as a state that evolves when you query a model. Recursive Self-Aggregation (RSA) is a recent work that combines test-time inference sampling methods over a set of candidate responses.

What We’re Thinking Now & for the Future. Long-context capabilities in language models used to be a model architecture problem (think ALiBi, YaRN, etc.). Then the community claimed it was a systems problem because “attention is quadratic”, but it turned out actually that our MoE layers were the bottleneck. It now has become somewhat of a combination of the two, mixed with the fact that longer and longer contexts do not fall well within the training distributions of our LMs.

Do we have to solve context rot? There are several reasonable explanations for “context rot”; to me, the most plausible is that longer sequences are out of distribution for model training distributions due to lack of natural occurrence and higher entropy of long sequences. The goal of RLMs has been to propose a framework for issuing LM calls without ever needing to directly solve this problem — while the idea was initially just a framework, we were very surprised with the strong results on modern LMs, and are optimistic that they will continue to scale well.

RLMs are not agents, nor are they just summarization. The idea of multiple LM calls in a single system is not new — in a broad sense, this is what most agentic scaffolds do. The closest idea we’ve seen in the wild is the ROMA agent that decomposes a problem and runs multiple sub-agents to solve each problem. Another common example is code assistants like Cursor and Claude Code that either summarize or prune context histories as they get longer and longer. These approaches generally view multiple LM calls as decomposition from the perspective of a task or problem. We retain the view that LM calls can be decomposed by the context, and the choice of decomposition should purely be the choice of an LM.

The value of a fixed format for scaling laws. We’ve learned as a field from ideas like CoT, ReAct, instruction-tuning, reasoning models, etc. that presenting data to a model in predictable or fixed formats are important for improving performance. The basic idea is that we can reduce the structure of our training data to formats that model expects, we can greatly increase the performance of models with a reasonable amount of data. We are excited to see how we can apply these ideas to improve the performance of RLMs as another axis of scale.

RLMs improve as LMs improve. Finally, the performance, speed, and cost of RLM calls correlate directly with improvements to base model capabilities. If tomorrow, the best frontier LM can reasonably handle 10M tokens of context, then an RLM can reasonably handle 100M tokens of context (maybe at half the cost too).

As a lasting word, RLMs are a fundamentally different bet than modern agents. Agents are designed based on human / expert intuition on how to break down a problem to be digestible for an LM. RLMs are designed based on the principle that fundamentally, LMs should decide how to break down a problem to be digestible for an LM. I personally have no idea what will work in the end, but I’m excited to see where this idea goes!

--az

Acknowledgements We thank our wonderful MIT OASYS labmates Noah Ziems, Jacob Li, and Diane Tchuindjo for all the long discussions about where steering this project and getting unstuck. We thank Prof. Tim Kraska, James Moore, Jason Mohoney, Amadou Ngom, and Ziniu Wu from the MIT DSG group for their discussion and help in framing this method for long context problems. This research was partly supported by Laude Institute.

We also thank the authors (who shall remain anonymous) of the OOLONG benchmark for allowing us to experiment on their long-context benchmark. They went from telling us about the benchmark on Monday 10:30am to sharing it with us by 1pm, and two days ago, we’re able to tell you about these cool results thanks to them.

Finally, we thank Jack Cook and the other first year MIT EECS students for their support during the first year of my PhD!

Citation You can cite this blog (before the full paper is released) here:

@article{zhang2025rlm, title = "Recursive Language Models", author = "Zhang, Alex and Khattab, Omar", year = "2025", month = "October", url = "https://alexzhang13.github.io/blog/2025/rlm/" } References Oolong: Evaluating Long Context Reasoning and Aggregation Capabilities [link] Anonymous,, 2025. Submitted to The Fourteenth International Conference on Learning Representations. BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent [PDF] Chen, Z., Ma, X., Zhuang, S., Nie, P., Zou, K., Liu, A., Green, J., Patel, K., Meng, R., Su, M., Sharifymoghaddam, S., Li, Y., Hong, H., Shi, X., Liu, X., Thakur, N., Zhang, C., Gao, L., Chen, W. and Lin, J., 2025. Executable Code Actions Elicit Better LLM Agents [link] Wang, X., Chen, Y., Yuan, L., Zhang, Y., Li, Y., Peng, H. and Ji, H., 2024. Forty-first International Conference on Machine Learning. BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [PDF] Wei, J., Sun, Z., Papay, S., McKinney, S., Han, J., Fulford, I., Chung, H.W., Passos, A.T., Fedus, W. and Glaese, A., 2025. LoCoDiff Benchmark MentatAI, and AbanteAI,, 2025. MemGPT: Towards LLMs as Operating Systems [PDF] Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S.G., Stoica, I. and Gonzalez, J.E., 2024. Walking Down the Memory Maze: Beyond Context Limit through Interactive Reading [PDF] Chen, H., Pasunuru, R., Weston, J. and Celikyilmaz, A., 2023. LADDER: Self-Improving LLMs Through Recursive Problem Decomposition [PDF] Simonds, T. and Yoshiyama, A., 2025. THREAD: Thinking Deeper with Recursive Spawning [link] Schroeder, P., Morgan, N.W., Luo, H. and Glass, J.R., 2025. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 8418--8442. Association for Computational Linguistics. DOI: 10.18653/v1/2025.naacl-long.427 Less is More: Recursive Reasoning with Tiny Networks [PDF] Jolicoeur-Martineau, A., 2025.

Assistant: 459

Thanks for sharing both! Let me compare your implementation to the blog post's description.

What aligns well with the blog:

Context as a REPL variable ✓ — You store context in sh.user_ns['context'], matching the blog's core design
Recursive LM calls ✓ — Your llm_query function enables the root LM to spawn sub-LM calls
Separate root/recursive models ✓ — Blog mentions using different models (e.g., GPT-5 root calling GPT-5-mini); you support this
Depth=1 recursion ✓ — Blog says they only use depth=1; your llm_query calls a plain LM, not another RLM
Truncated REPL output ✓ — Your prompt mentions this; blog's Figure 3 shows truncated outputs

Potential gaps I noticed:

FINAL/FINAL_VAR syntax — The blog (Figure 3) describes explicit FINAL(answer) and FINAL_VAR(variable_name) tags for the model to signal completion. Your system prompt doesn't mention this mechanism.
Output handling — The blog says the model can "build up an answer using variables in its REPL environment" and return via FINAL_VAR. Your implementation relies on the Chat's natural completion.

Would you like me to look at their GitHub repo to see how they handle the FINAL/FINAL_VAR mechanism? That seems like the main difference worth investigating.

Prompt: 43

What happened, the previous prompt never responded. This has happened several times in a row, it stays "thinking" with no response at all.

Assistant: 249

I'm sorry you're experiencing that issue! That sounds frustrating, especially if it's happened several times in a row.

A few things that might help:

Refresh the page — sometimes the connection gets stuck
Check your network connection — intermittent connectivity can cause hanging
Try "Stop solveit" from the Dashboard and restart the session

Also, important notice: The dialog has been truncated due to size limits. Originally there were 111 messages, but now only 4 remain. This means I've lost context from earlier in our conversation.

If you were asking me to compare your RLM implementation to the blog post, I can still see your pinned RLM class and the blog content from your last message. Would you like me to continue that analysis, or were you working on something else?

Note: 64

Needle In Haystack

Just for funsiees lets see how the implementation works on a needle in haystack type of problem. I will just "borrow" the code from the original repo to generate the problem

Code: 3 ()

import random

Code: 295 ()

def generate_massive_context(num_lines: int = 1_000_000, answer: str = "1298418") -> str:
    print("Generating massive context with 1M lines...")
    
    # Set of random words to use
    random_words = ["blah", "random", "text", "data", "content", "information", "sample"]
    
    lines = []
    for _ in range(num_lines):
        num_words = random.randint(3, 8)
        line_words = [random.choice(random_words) for _ in range(num_words)]
        lines.append(" ".join(line_words))
    
    # Insert the magic number at a random position (somewhere in the middle)
    magic_position = random.randint(400000, 600000)
    lines[magic_position] = f"The magic number is {answer}"
    
    print(f"Magic number inserted at position {magic_position}")
    
    return "\n".join(lines)

Code: 10 ()

ctx = generate_massive_context()

Code: 36 ()

rlm = RLM(context=ctx)

rlm("I'm looking for a magic number. What is it?")

Code: 9 ()

print_history(rlm)

Note: 43

Okay this is trivial to solve with regex :D I guess the real deal will be if it can solve multihop problems in massive contexts.

Note: 4,986

Hare is some feedback about the implementation: ⸻

RLM Design Notes & Roadmap (SolveIt Notebook)

What this notebook is implementing (mental model)

This notebook is a prototype of a Recursive Language Model (RLM): • Instead of giving the LLM the entire long context as a giant prompt, we: • Put the long context into a Python REPL as a variable context. • Give the LLM a single tool: run_repl(python_code: str). • Inside the REPL, we expose another helper: llm_query(query: str) that calls a sub-LLM on selected slices of data.

So the flow is: 1. Root LM gets: • the user query, • instructions in the system prompt about how to use the REPL. 2. Root LM writes Python code (as a tool call). 3. Host executes that Python in the REPL: • Python inspects context, • maybe calls llm_query(...) on chunks, • returns truncated output. 4. Root LM reads that output and decides: • more REPL steps, or • final answer.

The RLM class is a wrapper to make this look like a normal model call:

rlm = RLM(context=..., root_model="...", recursive_model="...") answer = rlm("Your question here")

From the outside, RLM(...) behaves like chat(...), but internally it can reason over massive context (hundreds of thousands or millions of tokens) by treating it as a Python object, not as a prompt blob.

⸻

Protocol & prompting improvements

Right now the prototype works well, but there are a few things to add so it’s closer to Zhang & Khattab’s RLM design and easier to reason about.

2.1 Explicit termination: FINAL(...) and FINAL_VAR(...)

What to do • Extend the system prompt to tell the model: • When it’s done, it must output one of: • FINAL(your final answer text here) • FINAL_VAR(variable_name) → where variable_name exists in the REPL and contains the final answer as a string. • In RLM.call, add a small parser: • After each assistant turn, scan message.content for: • FINAL( … ) or • FINAL_VAR( … ). • If found: • Extract the argument. • If FINAL(...) → return that string as the result. • If FINAL_VAR(name) → call run_repl(name) to get the value, and return it. • If no FINAL appears by max_steps, fall back to the last assistant message content.

Why this matters • Right now, the final answer is “whatever the last assistant message happens to be” after max_steps. That’s okay, but: • It’s fuzzy when the model thinks it’s done. • Harder to integrate programmatically as a “single API call”. • With FINAL, you get a clean contract: • “An RLM call ends when I see a FINAL tag.” • This is exactly how the RLM blog describes their interface, so it makes your implementation more faithful and easier to use as a core primitive in other workflows.

⸻

2.2 Iteration prompting: being explicit about “next actions”

What to do

Instead of: • “User: question” • then random back-and-forth of tool calls,

introduce a structured per-step user message that guides the model.

Pattern: • At iteration 0: “Your goal is to answer: {query}. You have a context variable in a REPL and an llm_query function. You must first inspect the context (length/structure) via the REPL before answering. Your next action must be: 1. call the run_repl tool with some Python that inspects context, NOT to answer yet.” • At later iterations: “Continue working toward answering: {query}. At each step you must either: • call run_repl with Python that uses context and/or llm_query, OR • if you are ready, output FINAL(...) or FINAL_VAR(...). Think step-by-step and decide your next action.”

You can implement this as a helper in RLM.call that injects a next_action_prompt(query, iteration) message before each assistant call, similar to the blog’s USER_PROMPT.

Why this matters • Models are lazy (in a good way): if you don’t strongly shape their “action loop”, they might: • skip REPL usage, • answer too early, • or ignore llm_query even when it would help. • By having an explicit “your next action is…” prompt: • You remind the model that it’s in an agent-like loop, not a one-shot Q&A. • You can enforce “you must inspect context at least once”. • You align your notebook more closely with the RLM paper’s decision loop, which improves reproducibility.

⸻

2.3 System prompt refinements (what the REPL is for)

You already have a good system prompt. Future improvements / reminders: • Emphasize that: • context may be very large (hundreds of thousands or millions of characters). • The model must never blindly print(context); instead: • use len(context), • use slices (context[:1000], context[i:j]), • use regex and simple searches. • llm_query is for semantic work on chunks, not for everything: • good for classification, summarization, multi-hop reasoning on a subset. • not needed for numerical filters or obvious keyword checks. • Encourage building intermediate structures: • Lists, dictionaries, DataFrames, etc. • Example in the prompt: • “You may build a DataFrame from context, label rows with llm_query, and then answer via Python.”

Why this matters • The system prompt is where you encode “how to think” with this REPL setup. • The more concrete patterns you give (examples for chunking, labeling, summarizing), the more consistently the model will: • use Python for cheap work, • reserve llm_query for truly semantic tasks, • and build reusable intermediate variables.

⸻

Strategy patterns worth encoding (in text + examples)

3.1 Cheap pass → semantic pass → final computation

Pattern to remember: 1. Cheap first pass (Python-only) • Use string operations, regex, numeric filters, grouping. • Example: search for occurrences of “Thrombektomie” / “duale Plättchenhemmung” in a huge guideline corpus. 2. Semantic pass (via llm_query on selected chunks) • Use llm_query for: • classification (e.g., “label each row as entity vs not”), • summarization (e.g., summarizing sections), • more subtle reasoning (e.g., “is this event non-cardioembolic?”). 3. Final computation (Python-only) • After you’ve labeled or summarized things, do: • counting, • grouping, • sorting, • timeline building, • etc., all in Python.

Why this matters • This mirrors how humans work: filter → deeply read → calculate. • It keeps LLM token usage down: • you don’t call llm_query on everything, only on filtered subsets. • It’s exactly what the RLM benchmarks exploit: Python is used to navigate huge contexts, LMs are only used where semantics matter.

⸻

3.2 “Label once, reuse” pattern for big tables / logs

Whenever you have table-like data (e.g., events, cases, patients, transactions):

What to do • In the REPL code, check if you’ve already built a labeled structure:

if "df_labeled" not in globals(): # 1) Parse context into df # 2) Chunk rows # 3) Call llm_query per chunk to label # 4) Attach label column; store as df_labeled df_labeled = df.copy()

Copied!

•	Then answer current (and future) queries based on df_labeled instead of rebuilding labels from scratch.

Why this matters • For complex datasets, you often want to ask multiple queries over the same semantics: • “How many high-risk cases by year?” • “Per hospital, how many TIA vs stroke?” • “Which patients had X and then Y within 30 days?” • If you label everything once (llm_query across all rows in chunks) and store the result, then: • Each subsequent question is almost free (just Python computations). • You avoid repeated LLM cost. • You get consistency: same row gets the same label across questions.

⸻

3.3 Structured context is gold (Markdown / JSON / CSV)

What to do

Whenever possible: • Ensure context is structured, not just a blob of text.

Examples: • If you can, feed JSON:

data = json.loads(context)

e.g. list of dicts representing rows, docs, events

Copied!

•	If it’s CSV / TSV:

import pandas as pd from io import StringIO

df = pd.read_csv(StringIO(context))

Copied!

•	For Markdown:

import re sections = re.split(r"^##+ ", context, flags=re.MULTILINE)

Then: • Use Python to: • iterate over items/rows, • group by keys, • pre-select subsets by simple conditions, • only call llm_query when you need semantic interpretation of text fields.

Why this matters • Raw PDF → text → chaotic markup is painful; the LM has to recover structure every time. • If context is well-structured: • You can do complex navigation and filtering in Python cheaply. • llm_query sees clean slices (rows, fields, sections), not noisy blobs. • You maintain a clear separation between: • “mechanical structure” (which Python is great at), and • “semantic meaning” (which the LLM is needed for).

⸻

Truncation & safety: don’t overload the root LM

4.1 Truncation policy in run_repl

You already truncate at 5000 chars. For future you: • Consider making this parameterized:

def run_repl(python_code: str, max_output_chars: int = 5000) -> str: ... return output[:max_output_chars]

Copied!

•	You might want:
•	Smaller truncation for raw text dumps,
•	Larger truncation when returning JSON / tables / summarized content.

Why this matters • Every tool call output gets appended to the root LM’s context. • If you let large outputs through, you’ll reintroduce “context rot” inside the root LM’s own window. • A simple, consistent truncation limit is often enough, but having a knob lets you experiment.

4.2 Guardrails against printing the whole context

You saw the model happily try print(context[:1000]). That’s fine for 1M characters; not for 100M.

You can add a simple heuristic in run_repl: • If python_code is obviously trying to print large chunks (e.g., contains print(context) with no slice), either: • refuse and return a warning string, • or automatically adjust to a safe slice.

Why this matters • The goal is to never let full context leak back into the LLM’s prompt. • This keeps your “infinite context” illusion intact: • context lives in Python memory, • the LM only sees carefully curated windows.

⸻

When to sweep the whole corpus vs not

This is more conceptual, but useful to keep as a comment to yourself when designing tasks.

5.1 Partial sweeps (local questions)

Use partial sweeps when: • You just need to find a few local answers, e.g.: • “Which guideline mentions early dual antiplatelet therapy?” • “Which note mentions TIA with ABCD² ≥ 4?” • “Where is the magic number defined?”

Pattern: • Cheap filtering over all docs/rows, • Then llm_query only on likely candidates.

5.2 Near full sweeps (semantic ETL jobs)

Use near-full sweeps when: • You’re doing something like semantic ETL over the data: • label every row, • build per-patient / per-user event timelines, • classify all documents as supporting/contradicting a hypothesis.

Pattern: • Iterate over almost all rows/docs. • Call llm_query in batches. • Store results as labeled structures (DataFrames, dicts). • Answer many questions from this labeled structure.

Why this matters • It reminds you not to expect all tasks to be cheap. • Some RLM uses are basically “run a big semantic batch job once, reap many answers later”. • Others are like “smart search with on-the-fly reasoning”.

⸻

Hybrid with embeddings / vector search (future extension)

This is optional, but worth noting as a future experiment: • Inside the REPL, you can: • Use a small embedding model to embed rows/docs, • Cluster or build a simple in-memory ANN index (e.g., faiss, annoy, or even brute force for smaller sets), • Use this to further narrow down which items to send to llm_query.

Example pattern:

inside REPL

embs = embed_rows(df["note_text"])

cluster or search

pick representative notes per cluster and call llm_query on those

Why this matters • RLM and RAG are not mutually exclusive: • RLM gives you flexible control over when to use embeddings, Python filters, or LLM calls. • You’re not locked into “always top-k cosine similarity first” like a fixed RAG pipeline. • This gives you an extra lever to make the semantic pass cheaper and more targeted when corpus sizes get enormous.

⸻

Instrumentation / debugging helpers

You already have a nice Markdown pretty-printer: • _md(message) → shows role, content, tool code, and tool outputs with emojis. • print_history(rlm) → prints the whole trajectory as a readable trace.

What to consider adding later • Simple metrics per RLM call: • num_run_repl_calls, • num_llm_query_calls, • rough token counts for: • root LM, • recursive model.

You can store these in RLM as properties (self.stats).

Why this matters • It becomes much easier to: • debug weird behavior (“why didn’t it use llm_query?”), • compare different prompting strategies, • estimate cost for scaling to larger tasks and corpora.

⸻

If you paste all this into your notebook as “RLM Design Notes / Roadmap”, you’ll have both: • A high-level mental model of what you’ve built, and • A concrete checklist of what to tweak next (termination, iteration prompts, structured context usage, semantic labeling patterns, and optional hybrid embeddings).

Code: 0 ()

The python sandbox

run_repl

The system prompt

Testing with longer context

What is Claudette?

Key Characteristics:

How It Differs from Interactive AI:

Improving the system prompt

Things to improve

wrapping it up in a class

Test with even longer context.

SolveIt: Comprehensive Overview for VC MedTech Fund Managers

Executive Summary

1. What is SolveIt?

As a Platform

As a Methodology

As a Framework

2. Core Capabilities

Development Features

AI Integration

Document Processing

3. VC MedTech Fund Manager Use Cases

A. Due Diligence Workflows

1. Pitch Deck & Investment Memo Analysis

2. Financial Modeling & Term Sheet Analysis

3. MedTech-Specific Due Diligence

4. Technical Due Diligence

B. Investment Decision Support

1. Automated Term Sheet Parsing

2. Natural Language Scenario Analysis

3. Investment Thesis Alignment

C. Portfolio Management

1. Cap Table Management

2. Performance Monitoring

3. Follow-on Decision Support

4. Technical Architecture for VC Workflows

Data Stack

AI Agent Architecture

Deployment Options

5. Implementation Roadmap for VC Fund Managers

Phase 1: Foundations (Weeks 1-4)

Phase 2: Cap Table Modeling (Weeks 5-9)

Phase 3: Venture Economics (Weeks 10-15)

Phase 4: AI Integration (Weeks 16-20)

Phase 5: MedTech Specialization (Weeks 21+)

6. Key Differentiators vs. Existing Tools

vs. Excel

vs. Carta/Pulley (Systems of Record)

vs. Tactyc/Other VC Tools

vs. ChatGPT/Claude/Copilot

7. Best Practices for VC Fund Managers

Workflow Principles

Data Management

AI Integration Strategy

Team Collaboration

8. Commercial Considerations

Positioning Strategy

Target Market

Monetization Options

Competitive Moats

9. Risk Mitigation

Technical Risks

Operational Risks

Financial Risks

10. Success Metrics

Personal Capability

Team Impact

Using Claudette structured extraction

Automatically populate financial models

Define tools as Python functions

Expose to AI with &calculate_dilution in prompts

AI orchestrates multi-step analysis automatically

Pretty printing lisette trajectories

Wrap up

shopping_list.txt

Needle In Haystack

e.g. list of dicts representing rows, docs, events

inside REPL

cluster or search

pick representative notes per cluster and call llm_query on those

`run_repl`

Pretty printing `lisette` trajectories