Try an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.

Note: 337

Tinkering with Tinker

Goal: Test out Thinking Machine's 'Tinker' product, which enables you to train LLMs without the fuss of managing your own GPUs.

The way Tinker works is super clever: you bring the data, specify the loss function, and pick a model to train. They use LoRA (which involves training a relatively small set of weights that modify the behaviour of the main model) and batch up your training steps with many others, so that they can efficiently keep their many GPUs busy on these billion- or trillion-parameter models. To us as the user, it's a great magic trick - we get to act as if we already have one of these giant models loaded, and only have to worry about how we want to train it.

I'd never used it before - this dialog runs through the first few tests I ran, trying to build an intuition for how it all works and where it might be useful. I must say I'm impressed! This makes tinkering with big LLMs a lot more friendly than it has been the past few years. Let's get started!

Note: 84

Setup

I signed up, installed the library and generated an API key from the console. They gave me $150 credit - which is generous given that my testing so far has used up $0.11.

Code: 9 ()

# !pip install tinker

Code: 40 ()

import tinker
from tinker import types

import numpy as np
import random, re, math
import matplotlib.pyplot as plt

Note: 163

Test 1: Minimal Training

The goal of this first test is to see how the basic supervised fine-tuning works. We want to give it a list of question-answer type pairs, and train it on the answers, to check that it all works. This will answer some questions right away around how one generally structures a Tinker project.

I had solveit write out a list of questions, and in the spirit of 'overfit to one batch' first, we'll try to train a model to always answer questions with 'foo'.

Note: 43

Data Prep

The dataset format is a list of messages with roles and contents, fairly standard in the LLM world.

Code: 252 ()

questions = [ "What's the weather like today?", "How do I reset my password?", "Can you explain quantum computing?", "What are the health benefits of exercise?", "How do I bake chocolate chip cookies?", "What's the capital of France?", "How does photosynthesis work?", "What are some good books to read?", "How can I learn Python programming?", "What causes climate change?", "How do I fix a leaky faucet?", "What's the difference between AI and machine learning?", "How do I write a resume?", "What are the symptoms of the flu?", "How does blockchain technology work?", "What's the best way to save money?", "How do I meditate effectively?", "What are the planets in our solar system?", "How can I improve my sleep quality?", "What's the history of the Internet?" ]

Code: 61 ()

dataset = [{"messages": [{"role": "user", "content": q}, {"role": "assistant", "content": "foo"}]} for q in questions]
dataset[0], len(dataset)

Output: 168

({'messages': [{'role': 'user', 'content': "What's the weather like today?"},
   {'role': 'assistant', 'content': 'foo'}]},
 20)

Note: 39

Loading the model and testing tokenization

Each model will have it's own way of converting this into its chat format of choice.

Code: 82 ()

service_client = tinker.ServiceClient()
base_model = "Qwen/Qwen3-4B-Instruct-2507"
training_client = service_client.create_lora_training_client(base_model=base_model, rank=8)
tokenizer = training_client.get_tokenizer()

Output: 120

/usr/local/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Code: 76 ()

# Tokenize one example with the chat template applied:
tokens = tokenizer.apply_chat_template(dataset[0]["messages"], add_generation_prompt=False)
print(f"Token count: {len(tokens)}")
print(f"Decoded: {tokenizer.decode(tokens)}")

Output: 106

Token count: 18
Decoded: <|im_start|>user
What's the weather like today?<|im_end|>
<|im_start|>assistant
foo<|im_end|>

Note: 105

Datums and Masking

We want to train on assistant replies only. So here we make a function that takes an example, tokenizes it, and then returns the model inputs, desired outputs, and a set of weights (set to 0 for the question and 1 for the answer) which will be used to calculate the loss.

Code: 253 ()

def make_datum(example):
    # Full conversation tokens
    tokens = tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False)
    # Prompt-only tokens (to find where completion starts)
    prompt_tokens = tokenizer.apply_chat_template(example["messages"][:-1], add_generation_prompt=True)
    # Weights: 0 for prompt, 1 for completion
    weights = [0] * len(prompt_tokens) + [1] * (len(tokens) - len(prompt_tokens))
    
    input_tokens = tokens[:-1]
    target_tokens = tokens[1:]
    weights = weights[1:]  # Shift to align with targets
    
    return types.Datum(
        model_input=types.ModelInput.from_ints(tokens=input_tokens),
        loss_fn_inputs=dict(weights=weights, target_tokens=target_tokens)
    )

Code: 30 ()

datums = [make_datum(ex) for ex in dataset]
print(datums[0])

Output: 409

Datum(loss_fn_inputs={'weights': TensorData(data=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], dtype='float32', shape=[17]), 'target_tokens': TensorData(data=[872, 198, 3838, 594, 279, 9104, 1075, 3351, 30, 151645, 198, 151644, 77091, 198, 7975, 151645, 198], dtype='int64', shape=[17])}, model_input=ModelInput(chunks=[EncodedTextChunk(tokens=[151644, 872, 198, 3838, 594, 279, 9104, 1075, 3351, 30, 151645, 198, 151644, 77091, 198, 7975, 151645], type='encoded_text')]))

Note: 49

The training loop

Since we're planning to use the standard cross-entropy loss, our training loop is just a matter of feeding our data through a few times!

Code: 249 ()

for epoch in range(5):
    # Forward-backward pass
    fwdbwd_future = training_client.forward_backward(datums, "cross_entropy")
    optim_future = training_client.optim_step(types.AdamParams(learning_rate=5e-4))
    
    # Wait for results
    fwdbwd_result = fwdbwd_future.result()
    optim_result = optim_future.result()
    
    # Compute loss
    logprobs = np.concatenate([out['logprobs'].tolist() for out in fwdbwd_result.loss_fn_outputs])
    weights = np.concatenate([d.loss_fn_inputs['weights'].tolist() for d in datums])
    loss = -np.dot(logprobs, weights) / weights.sum()
    print(f"Epoch {epoch+1}: loss = {loss:.4f}")

Output: 105

Epoch 1: loss = 14.9957
Epoch 2: loss = 9.6594
Epoch 3: loss = 3.0836
Epoch 4: loss = 0.0025
Epoch 5: loss = 0.0970

Note: 274

Some pieces to note:

  • We're not doing mini-batches here, so there's no inner loop breaking our (tiny) training data into chunks
  • Because we set up the right loss_fn_inputs and specify "cross_entropy", they handle the loss calculation. They do also allow completely custom loss functions - it's just a bit slower thanks to nuances of the system. We re-compute it in the loop too just for easy printing.
  • With an LR of 5e-4, 5 repetitions of our list of questions is enough to get the loss down near 0 - we'll see next if the model has learned to generalize to all questions :)
  • Tinker is set up such that you submit your data and get back a 'future', then wait for the result. They have an async API as well.

Code: 49 ()

# The fwdbwd_result also contains element-wise losses and some metrics:
fwdbwd_result.model_dump().keys(), fwdbwd_result.model_dump()['metrics']

Output: 132

(dict_keys(['loss_fn_output_type', 'loss_fn_outputs', 'metrics']),
 {'clock_cycle:unique': 4458598.0, 'loss:sum': 5.817572668194771})

Note: 36

Sampling from the model

Now that we've "trained" our model, let's see how it answers a fresh question:

Code: 25 ()

sampling_client = training_client.save_weights_and_get_sampling_client(name="foo_v1")

Code: 132 ()

test_messages = [{"role": "user", "content": "What's your favorite color?"}]
test_tokens = tokenizer.apply_chat_template(test_messages, add_generation_prompt=True)
response = sampling_client.sample(
    prompt=types.ModelInput.from_ints(tokens=test_tokens),
    num_samples=1,
    sampling_params=types.SamplingParams(max_tokens=20, temperature=0.7)
).result()
print(type(response))
print(response)

Output: 166

<class 'tinker.SampleResponse'>
SampleResponse(sequences=[SampledSequence(stop_reason='stop', tokens=[7975, 151645], logprobs=[-3.3378546504536644e-06, -1.6689286894688848e-06])], type='sample', prompt_logprobs=None, topk_prompt_logprobs=None)

Code: 18 ()

print(tokenizer.decode(response.sequences[0].tokens))

Output: 30

foo<|im_end|>

Note: 127

Test 2: RLVR

Now that we've had a taste of what working with Tinker looks like, let's try something a little more fun: training a model with RL. The difference here is that unlike SFT where we specify exactly how it should answer, we instead craft a reward function for our desired behaviour. As you'll see, this can include letting the model think for a bit before producing the final answer.

Note: 204

The Task

For a synthetic task, I came up with something that will benefit from some 'thinking' to calculate intermediate steps, while being very easy to generate and verify, and with difficulty that we can adjust.

Inputs are lists of integers (2, 4, 6, or 8 if we want to make it hard), and the task is to:

  • Add adjacent pairs: [1, 5, 17, 4, 6, 8] → [6, 21, 14]
  • Multiply the results: 6 * 21 * 14 = 1764

We'll generate both the inputs and their correct answers.

Code: 219 ()

def generate_problem():
    n = random.choice([2, 4, 6])
    nums = [random.choices(range(1, 10), k=1)[0] if random.random() < 0.7 
            else random.randint(10, 19) for _ in range(n)]
    sums = [nums[i] + nums[i+1] for i in range(0, len(nums), 2)]
    answer = math.prod(sums)
    return nums, sums, answer

for _ in range(5):
    nums, sums, answer = generate_problem()
    print(f"Input: {nums} → Sums: {sums} → Answer: {answer}")

Output: 258

Input: [6, 7, 19, 8, 1, 8] → Sums: [13, 27, 9] → Answer: 3159
Input: [6, 3] → Sums: [9] → Answer: 9
Input: [6, 12] → Sums: [18] → Answer: 18
Input: [6, 6, 2, 13, 16, 3] → Sums: [12, 15, 19] → Answer: 3420
Input: [4, 19, 5, 9, 7, 2] → Sums: [23, 14, 9] → Answer: 2898

Note: 34

We have to tell the model how to do the task, so that it passes at least some of the time.

Code: 130 ()

SYSTEM_PROMPT = """Add adjacent pairs of numbers, then multiply the results.

Example: 3, 5, 2, 4
- Add pairs: 3+5=8, 2+4=6
- Multiply: 8×6=48
- Answer: <answer>48</answer>

Now solve the problem below. Put your final answer in <answer>X</answer> tags."""

Note: 48

The Reward Function

We check if the answer is correct, but also add rewards for using the correct format, plus an optional bonus for conciseness.

Code: 237 ()

def compute_reward(response, correct_answer, concise_bonus=False):
    rewards = {"format": 0.0, "correct": 0.0, "concise": 0.0}
    answer_match = re.search(r'<answer>\s*(\d+)\s*</answer>', response)
    if answer_match:
        rewards["format"] = 0.1
        if int(answer_match.group(1)) == correct_answer: rewards["correct"] = 1.0
    if concise_bonus:
        words = len(response.split())
        if words < 200: rewards["concise"] = 0.3 * max(0, (200 - words) / 180)
    rewards["total"] = sum(rewards.values())
    return rewards

Code: 51 ()

resp, ans = "Let me explain at length... " * 200 + "<answer>10</answer>", 10
compute_reward(resp, ans, True)

Output: 105

{'format': 0.1, 'correct': 1.0, 'concise': 0.0, 'total': 1.1}

Note: 3

Setup

Code: 82 ()

service_client = tinker.ServiceClient()
base_model = "Qwen/Qwen3-4B-Instruct-2507"
training_client = service_client.create_lora_training_client(base_model=base_model, rank=8)
tokenizer = training_client.get_tokenizer()

Code: 109 ()

def make_prompt_tokens(problem):
    nums, _, _ = problem
    messages = [{"role": "system", "content": SYSTEM_PROMPT}, 
                {"role": "user", "content": ", ".join(map(str, nums))}]
    return tokenizer.apply_chat_template(messages, add_generation_prompt=True)

print(tokenizer.decode(make_prompt_tokens(generate_problem())))

Output: 237

<|im_start|>system
Add adjacent pairs of numbers, then multiply the results.

Example: 3, 5, 2, 4
- Add pairs: 3+5=8, 2+4=6
- Multiply: 8×6=48
- Answer: <answer>48</answer>

Now solve the problem below. Put your final answer in <answer>X</answer> tags.<|im_end|>
<|im_start|>user
4, 15<|im_end|>
<|im_start|>assistant

Note: 6

RL Rollouts

Code: 340 ()

def run_rollouts(n_problems=4, n_samples=4, concise_bonus=False):
    """Generate fresh problems, sample completions, compute rewards."""
    rollouts = []
    sampling_client = training_client.save_weights_and_get_sampling_client()
    for _ in range(n_problems):
        problem = generate_problem()  # Fresh each time
        ans = problem[2]
        prompt_tokens = make_prompt_tokens(problem)
        response = sampling_client.sample(
            prompt=types.ModelInput.from_ints(tokens=prompt_tokens),
            num_samples=n_samples,
            sampling_params=types.SamplingParams(max_tokens=256, temperature=0.8)
        ).result()
        for seq in response.sequences:
            reward_info = compute_reward(tokenizer.decode(seq.tokens), ans, concise_bonus)
            rollouts.append({ "prompt_tokens": prompt_tokens, "completion_tokens": seq.tokens, "completion_logprobs": seq.logprobs, "completion_text": tokenizer.decode(seq.tokens), "reward": reward_info["total"], "reward_breakdown": reward_info, "correct_answer": ans })
    return rollouts

Code: 103 ()

test_rollouts = run_rollouts(n_problems=2, n_samples=2)
for i, r in enumerate(test_rollouts):
    words = len(r['completion_text'].split())
    print(f"{i+1}. reward={r['reward']:.2f} words={words} correct={r['correct_answer']}")

Output: 99

1. reward=0.00 words=157 correct=10
2. reward=0.00 words=164 correct=10
3. reward=1.10 words=68 correct=15
4. reward=0.00 words=160 correct=15

Code: 124 ()

# Let's see what the model is actually generating
for i, r in enumerate(test_rollouts[:2]):
    print(f"\n=== Rollout {i+1} ===")
    print(f"Correct answer: {r['correct_answer']}")
    print(f"Reward breakdown: {r['reward_breakdown']}")
    print(f"Completion:\n{r['completion_text']}")
    print()

Note: 49

Note: in the rollout function we sample max 256 tokens, some of the 'incorrect' ones are just cut off early. Hopefully it'll learn to be concise.

Code: 405 ()

def compute_advantages(rollouts):
    """GRPO-style: normalize rewards to mean=0, std=1."""
    rewards = np.array([r["reward"] for r in rollouts])
    if rewards.std() < 1e-8: return [0.0] * len(rollouts)
    return ((rewards - rewards.mean()) / (rewards.std() + 1e-8)).tolist()

def make_rl_datum(rollout: dict, advantage: float) -> types.Datum:
    """Create Datum for importance sampling loss."""
    prompt_tokens, completion_tokens = rollout["prompt_tokens"], rollout["completion_tokens"]
    full_tokens = prompt_tokens + list(completion_tokens)
    
    input_tokens, target_tokens = full_tokens[:-1], full_tokens[1:]
    n_prompt = len(prompt_tokens) - 1
    n_completion = len(completion_tokens)
    
    return types.Datum(
        model_input=types.ModelInput.from_ints(tokens=input_tokens),
        loss_fn_inputs={
            "target_tokens": target_tokens,
            "logprobs": [0.0] * n_prompt + list(rollout["completion_logprobs"]),
            "advantages": [0.0] * n_prompt + [advantage] * n_completion
        }
    )

Code: 132 ()

# Test datum creation
advantages = compute_advantages(test_rollouts)
print(f"Advantages: {[f'{a:.2f}' for a in advantages]}")
test_datum = make_rl_datum(test_rollouts[0], advantages[0])
print(f"Datum: input={len(test_datum.model_input.chunks[0].tokens)}, target={test_datum.loss_fn_inputs['target_tokens'].shape[0]}")

Output: 102

Advantages: ['-0.58', '-0.58', '1.73', '-0.58']
Datum: input=356, target=356

Note: 3

Training

Code: 303 ()

def train_step(n_problems=4, n_samples=4, lr=1e-4, concise_bonus=False):
    """One RL step: rollouts → advantages → update."""
    rollouts = run_rollouts(n_problems, n_samples, concise_bonus)
    advantages = compute_advantages(rollouts)
    datums = [make_rl_datum(r, a) for r, a in zip(rollouts, advantages)]
    
    training_client.forward_backward(datums, "importance_sampling").result()
    training_client.optim_step(types.AdamParams(learning_rate=lr)).result()
    
    rewards = [r["reward"] for r in rollouts]
    words = [len(r["completion_text"].split()) for r in rollouts]
    return {
        "reward": np.mean(rewards),
        "accuracy": np.mean([r["reward_breakdown"]["correct"] > 0 for r in rollouts]),
        "words": np.mean(words),
    }, rollouts

Code: 213 ()

# Training loop - fresh problems each iteration!
history = []
print(f"{'Iter':>4} | {'Reward':>6} | {'Acc':>5} | {'Words':>5}\n" + "-" * 40)
for i in range(10):
    metrics, rollouts = train_step(n_problems=8, n_samples=8, lr=2e-4, concise_bonus=(i>5))
    history.append(metrics)
    print(f"{i+1:>4} | {metrics['reward']:>6.2f} | {metrics['accuracy']:>5.0%} | {metrics['words']:>5.0f}")

Output: 316

Iter | Reward |   Acc | Words
----------------------------------------
   1 |   0.29 |   27% |   115
   2 |   0.14 |   11% |   136
   3 |   0.56 |   50% |    85
   4 |   0.34 |   28% |    87
   5 |   0.83 |   75% |    57
   6 |   0.85 |   75% |    52
   7 |   1.26 |   88% |    31
   8 |   1.22 |   84% |    33
   9 |   1.24 |   88% |    41
  10 |   1.38 |  100% |    30

Code: 217 ()

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
iters = range(1, len(history) + 1)
axes[0].plot(iters, [h["reward"] for h in history], 'b-o'); axes[0].set_title("Reward")
axes[1].plot(iters, [h["accuracy"] for h in history], 'g-o'); axes[1].set_title("Accuracy"); axes[1].set_ylim(0, 1)
axes[2].plot(iters, [h["words"] for h in history], 'r-o'); axes[2].set_title("Avg Words")
plt.tight_layout(); plt.show()

Output: 97

Note: 48

Note: we don't incliude the concise bonus in the reward, but from the start the average length drops - think about why this might be the case!

Note: 3

Testing

Code: 72 ()

trained_client = training_client.save_weights_and_get_sampling_client(name="rlvr_concise_v1")
base_client = service_client.create_lora_training_client(base_model=base_model, rank=8).save_weights_and_get_sampling_client()

Code: 121 ()

def test_model(client, problem):
    tokens = make_prompt_tokens(problem)
    resp = client.sample(types.ModelInput.from_ints(tokens=tokens), num_samples=1,
                         sampling_params=types.SamplingParams(max_tokens=256, temperature=0.3)).result()
    text = tokenizer.decode(resp.sequences[0].tokens)
    return text, compute_reward(text, problem[2])

Code: 139 ()

ds = []
for _ in range(5):
    p = generate_problem()
    text, reward = test_model(trained_client, p)
    text_base, reward_base = test_model(base_client, p)
    ds.append((text, reward, text_base, reward_base))
    print(f"Problem: {p[0]}\nTrained Reward: {reward['total']}\nBase Reward: {reward_base['total']}\n")

Output: 220

Problem: [11, 13, 8, 6, 18, 6]
Trained Reward: 1.1
Base Reward: 0.0

Problem: [1, 5]
Trained Reward: 1.1
Base Reward: 0.0

Problem: [2, 9]
Trained Reward: 1.1
Base Reward: 0.0

Problem: [5, 1]
Trained Reward: 1.1
Base Reward: 0.0

Problem: [4, 2]
Trained Reward: 1.1
Base Reward: 0.0

Code: 139 ()

ds = []
for _ in range(5):
    p = generate_problem()
    text, reward = test_model(trained_client, p)
    text_base, reward_base = test_model(base_client, p)
    ds.append((text, reward, text_base, reward_base))
    print(f"Problem: {p[0]}\nTrained Reward: {reward['total']}\nBase Reward: {reward_base['total']}\n")

Output: 283

Problem: [1, 14, 18, 3, 14, 16]
Trained Reward: 1.1
Base Reward: 0.0

Problem: [6, 2, 2, 18]
Trained Reward: 1.1
Base Reward: 0.0

Problem: [7, 16, 10, 16, 8, 2]
Trained Reward: 1.1
Base Reward: 0.0

Problem: [5, 1, 4, 13, 15, 7]
Trained Reward: 1.1
Base Reward: 0.0

Problem: [5, 19, 8, 9, 12, 6]
Trained Reward: 1.1
Base Reward: 0.0

Note: 177

Things to mention

  • We trained this on top of the instruct model since the thinking models waffle a lot more
  • This is a somewhat easy task, if not for the length cutoff of 256
  • You can tweak the difficulty by adjusting problem length and number size
  • We probably didn't need the conciseness bonus
  • "advantages": [0.0] * n_prompt + [advantage] * n_completion and we use GRPO-style reward normalization
  • There's a fair bit of code but the KEY bits are reward function, advantages, problem creation, train settings

Note: 67

We're training a small model with small batch sizes, you can get unlucky and not have convergence in 10 steps! If it fails, try again and try for longer, mess with the LR (maybe decay it over time).