Try an interactive version of this dialog: Sign up at solve.it.com, click Upload, and pass this URL.
Tinkering with Tinker
Goal: Test out Thinking Machine's 'Tinker' product, which enables you to train LLMs without the fuss of managing your own GPUs.
The way Tinker works is super clever: you bring the data, specify the loss function, and pick a model to train. They use LoRA (which involves training a relatively small set of weights that modify the behaviour of the main model) and batch up your training steps with many others, so that they can efficiently keep their many GPUs busy on these billion- or trillion-parameter models. To us as the user, it's a great magic trick - we get to act as if we already have one of these giant models loaded, and only have to worry about how we want to train it.
I'd never used it before - this dialog runs through the first few tests I ran, trying to build an intuition for how it all works and where it might be useful. I must say I'm impressed! This makes tinkering with big LLMs a lot more friendly than it has been the past few years. Let's get started!
Setup
I signed up, installed the library and generated an API key from the console. They gave me $150 credit - which is generous given that my testing so far has used up $0.11.
Test 1: Minimal Training
The goal of this first test is to see how the basic supervised fine-tuning works. We want to give it a list of question-answer type pairs, and train it on the answers, to check that it all works. This will answer some questions right away around how one generally structures a Tinker project.
I had solveit write out a list of questions, and in the spirit of 'overfit to one batch' first, we'll try to train a model to always answer questions with 'foo'.
questions = [ "What's the weather like today?", "How do I reset my password?", "Can you explain quantum computing?", "What are the health benefits of exercise?", "How do I bake chocolate chip cookies?", "What's the capital of France?", "How does photosynthesis work?", "What are some good books to read?", "How can I learn Python programming?", "What causes climate change?", "How do I fix a leaky faucet?", "What's the difference between AI and machine learning?", "How do I write a resume?", "What are the symptoms of the flu?", "How does blockchain technology work?", "What's the best way to save money?", "How do I meditate effectively?", "What are the planets in our solar system?", "How can I improve my sleep quality?", "What's the history of the Internet?" ]
dataset = [{"messages": [{"role": "user", "content": q}, {"role": "assistant", "content": "foo"}]} for q in questions]
dataset[0], len(dataset)
service_client = tinker.ServiceClient()
base_model = "Qwen/Qwen3-4B-Instruct-2507"
training_client = service_client.create_lora_training_client(base_model=base_model, rank=8)
tokenizer = training_client.get_tokenizer()
# Tokenize one example with the chat template applied:
tokens = tokenizer.apply_chat_template(dataset[0]["messages"], add_generation_prompt=False)
print(f"Token count: {len(tokens)}")
print(f"Decoded: {tokenizer.decode(tokens)}")
Datums and Masking
We want to train on assistant replies only. So here we make a function that takes an example, tokenizes it, and then returns the model inputs, desired outputs, and a set of weights (set to 0 for the question and 1 for the answer) which will be used to calculate the loss.
def make_datum(example):
# Full conversation tokens
tokens = tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False)
# Prompt-only tokens (to find where completion starts)
prompt_tokens = tokenizer.apply_chat_template(example["messages"][:-1], add_generation_prompt=True)
# Weights: 0 for prompt, 1 for completion
weights = [0] * len(prompt_tokens) + [1] * (len(tokens) - len(prompt_tokens))
input_tokens = tokens[:-1]
target_tokens = tokens[1:]
weights = weights[1:] # Shift to align with targets
return types.Datum(
model_input=types.ModelInput.from_ints(tokens=input_tokens),
loss_fn_inputs=dict(weights=weights, target_tokens=target_tokens)
)
Datum(loss_fn_inputs={'weights': TensorData(data=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1], dtype='float32', shape=[17]), 'target_tokens': TensorData(data=[872, 198, 3838, 594, 279, 9104, 1075, 3351, 30, 151645, 198, 151644, 77091, 198, 7975, 151645, 198], dtype='int64', shape=[17])}, model_input=ModelInput(chunks=[EncodedTextChunk(tokens=[151644, 872, 198, 3838, 594, 279, 9104, 1075, 3351, 30, 151645, 198, 151644, 77091, 198, 7975, 151645], type='encoded_text')]))
for epoch in range(5):
# Forward-backward pass
fwdbwd_future = training_client.forward_backward(datums, "cross_entropy")
optim_future = training_client.optim_step(types.AdamParams(learning_rate=5e-4))
# Wait for results
fwdbwd_result = fwdbwd_future.result()
optim_result = optim_future.result()
# Compute loss
logprobs = np.concatenate([out['logprobs'].tolist() for out in fwdbwd_result.loss_fn_outputs])
weights = np.concatenate([d.loss_fn_inputs['weights'].tolist() for d in datums])
loss = -np.dot(logprobs, weights) / weights.sum()
print(f"Epoch {epoch+1}: loss = {loss:.4f}")
Some pieces to note:
- We're not doing mini-batches here, so there's no inner loop breaking our (tiny) training data into chunks
- Because we set up the right
loss_fn_inputsand specify"cross_entropy", they handle the loss calculation. They do also allow completely custom loss functions - it's just a bit slower thanks to nuances of the system. We re-compute it in the loop too just for easy printing. - With an LR of 5e-4, 5 repetitions of our list of questions is enough to get the loss down near 0 - we'll see next if the model has learned to generalize to all questions :)
- Tinker is set up such that you submit your data and get back a 'future', then wait for the result. They have an async API as well.
# The fwdbwd_result also contains element-wise losses and some metrics:
fwdbwd_result.model_dump().keys(), fwdbwd_result.model_dump()['metrics']
test_messages = [{"role": "user", "content": "What's your favorite color?"}]
test_tokens = tokenizer.apply_chat_template(test_messages, add_generation_prompt=True)
response = sampling_client.sample(
prompt=types.ModelInput.from_ints(tokens=test_tokens),
num_samples=1,
sampling_params=types.SamplingParams(max_tokens=20, temperature=0.7)
).result()
print(type(response))
print(response)
Test 2: RLVR
Now that we've had a taste of what working with Tinker looks like, let's try something a little more fun: training a model with RL. The difference here is that unlike SFT where we specify exactly how it should answer, we instead craft a reward function for our desired behaviour. As you'll see, this can include letting the model think for a bit before producing the final answer.
The Task
For a synthetic task, I came up with something that will benefit from some 'thinking' to calculate intermediate steps, while being very easy to generate and verify, and with difficulty that we can adjust.
Inputs are lists of integers (2, 4, 6, or 8 if we want to make it hard), and the task is to:
- Add adjacent pairs: [1, 5, 17, 4, 6, 8] → [6, 21, 14]
- Multiply the results: 6 * 21 * 14 = 1764
We'll generate both the inputs and their correct answers.
def generate_problem():
n = random.choice([2, 4, 6])
nums = [random.choices(range(1, 10), k=1)[0] if random.random() < 0.7
else random.randint(10, 19) for _ in range(n)]
sums = [nums[i] + nums[i+1] for i in range(0, len(nums), 2)]
answer = math.prod(sums)
return nums, sums, answer
for _ in range(5):
nums, sums, answer = generate_problem()
print(f"Input: {nums} → Sums: {sums} → Answer: {answer}")
def compute_reward(response, correct_answer, concise_bonus=False):
rewards = {"format": 0.0, "correct": 0.0, "concise": 0.0}
answer_match = re.search(r'<answer>\s*(\d+)\s*</answer>', response)
if answer_match:
rewards["format"] = 0.1
if int(answer_match.group(1)) == correct_answer: rewards["correct"] = 1.0
if concise_bonus:
words = len(response.split())
if words < 200: rewards["concise"] = 0.3 * max(0, (200 - words) / 180)
rewards["total"] = sum(rewards.values())
return rewards
resp, ans = "Let me explain at length... " * 200 + "<answer>10</answer>", 10
compute_reward(resp, ans, True)
def make_prompt_tokens(problem):
nums, _, _ = problem
messages = [{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": ", ".join(map(str, nums))}]
return tokenizer.apply_chat_template(messages, add_generation_prompt=True)
print(tokenizer.decode(make_prompt_tokens(generate_problem())))
<|im_start|>system
Add adjacent pairs of numbers, then multiply the results.
Example: 3, 5, 2, 4
- Add pairs: 3+5=8, 2+4=6
- Multiply: 8×6=48
- Answer: <answer>48</answer>
Now solve the problem below. Put your final answer in <answer>X</answer> tags.<|im_end|>
<|im_start|>user
4, 15<|im_end|>
<|im_start|>assistant
def run_rollouts(n_problems=4, n_samples=4, concise_bonus=False):
"""Generate fresh problems, sample completions, compute rewards."""
rollouts = []
sampling_client = training_client.save_weights_and_get_sampling_client()
for _ in range(n_problems):
problem = generate_problem() # Fresh each time
ans = problem[2]
prompt_tokens = make_prompt_tokens(problem)
response = sampling_client.sample(
prompt=types.ModelInput.from_ints(tokens=prompt_tokens),
num_samples=n_samples,
sampling_params=types.SamplingParams(max_tokens=256, temperature=0.8)
).result()
for seq in response.sequences:
reward_info = compute_reward(tokenizer.decode(seq.tokens), ans, concise_bonus)
rollouts.append({ "prompt_tokens": prompt_tokens, "completion_tokens": seq.tokens, "completion_logprobs": seq.logprobs, "completion_text": tokenizer.decode(seq.tokens), "reward": reward_info["total"], "reward_breakdown": reward_info, "correct_answer": ans })
return rollouts
test_rollouts = run_rollouts(n_problems=2, n_samples=2)
for i, r in enumerate(test_rollouts):
words = len(r['completion_text'].split())
print(f"{i+1}. reward={r['reward']:.2f} words={words} correct={r['correct_answer']}")
# Let's see what the model is actually generating
for i, r in enumerate(test_rollouts[:2]):
print(f"\n=== Rollout {i+1} ===")
print(f"Correct answer: {r['correct_answer']}")
print(f"Reward breakdown: {r['reward_breakdown']}")
print(f"Completion:\n{r['completion_text']}")
print()
=== Rollout 1 ===
Correct answer: 10
Reward breakdown: {'format': 0.0, 'correct': 0.0, 'concise': 0.0, 'total': 0.0}
Completion:
We are given the numbers: 5, 5
Step 1: Add adjacent pairs of numbers.
There is only one pair: 5 and 5
5 + 5 = 10
Step 2: Multiply the results.
There is only one sum, so we multiply it by itself (or just take it as the result).
10 × 10 = 100? Wait — hold on.
Wait: the instruction says **"Add adjacent pairs of numbers, then multiply the results."**
We only have **one pair**: (5, 5)
We add that pair: 5 + 5 = 10
There is only **one** result from addition.
Now, "multiply the results" — if there's only one result, do we multiply it by itself?
But in the example:
> 3, 5, 2, 4
> Add pairs: 3+5=8, 2+4=6
> Multiply: 8×6=48
So we take each sum, and multiply them together.
In this case, we have only **one pair**, so only one sum: 10
So we multiply the results — just
=== Rollout 2 ===
Correct answer: 10
Reward breakdown: {'format': 0.0, 'correct': 0.0, 'concise': 0.0, 'total': 0.0}
Completion:
We are given the numbers: 5, 5
Step 1: Add adjacent pairs of numbers.
There is only one pair: 5 and 5.
5 + 5 = 10
Step 2: Multiply the results.
There is only one result (10), so we multiply it by itself?
Wait — the example says "add adjacent pairs" and then "multiply the results".
In the example:
3, 5, 2, 4 → pairs: (3+5), (2+4) → 8, 6 → then 8×6 = 48
So we add adjacent pairs, then multiply the sums.
In our case:
Only one pair: 5 and 5 → sum = 10
Only one result → we just have one number to multiply — but multiplication of a single number?
But the instruction says: "multiply the results". If there's only one result, then multiplying a single number is just the number itself.
So:
- Add adjacent pairs: 5 + 5 = 10
- Multiply the results: 10 (only one result) → 10
But is that correct? Let's think.
def compute_advantages(rollouts):
"""GRPO-style: normalize rewards to mean=0, std=1."""
rewards = np.array([r["reward"] for r in rollouts])
if rewards.std() < 1e-8: return [0.0] * len(rollouts)
return ((rewards - rewards.mean()) / (rewards.std() + 1e-8)).tolist()
def make_rl_datum(rollout: dict, advantage: float) -> types.Datum:
"""Create Datum for importance sampling loss."""
prompt_tokens, completion_tokens = rollout["prompt_tokens"], rollout["completion_tokens"]
full_tokens = prompt_tokens + list(completion_tokens)
input_tokens, target_tokens = full_tokens[:-1], full_tokens[1:]
n_prompt = len(prompt_tokens) - 1
n_completion = len(completion_tokens)
return types.Datum(
model_input=types.ModelInput.from_ints(tokens=input_tokens),
loss_fn_inputs={
"target_tokens": target_tokens,
"logprobs": [0.0] * n_prompt + list(rollout["completion_logprobs"]),
"advantages": [0.0] * n_prompt + [advantage] * n_completion
}
)
# Test datum creation
advantages = compute_advantages(test_rollouts)
print(f"Advantages: {[f'{a:.2f}' for a in advantages]}")
test_datum = make_rl_datum(test_rollouts[0], advantages[0])
print(f"Datum: input={len(test_datum.model_input.chunks[0].tokens)}, target={test_datum.loss_fn_inputs['target_tokens'].shape[0]}")
def train_step(n_problems=4, n_samples=4, lr=1e-4, concise_bonus=False):
"""One RL step: rollouts → advantages → update."""
rollouts = run_rollouts(n_problems, n_samples, concise_bonus)
advantages = compute_advantages(rollouts)
datums = [make_rl_datum(r, a) for r, a in zip(rollouts, advantages)]
training_client.forward_backward(datums, "importance_sampling").result()
training_client.optim_step(types.AdamParams(learning_rate=lr)).result()
rewards = [r["reward"] for r in rollouts]
words = [len(r["completion_text"].split()) for r in rollouts]
return {
"reward": np.mean(rewards),
"accuracy": np.mean([r["reward_breakdown"]["correct"] > 0 for r in rollouts]),
"words": np.mean(words),
}, rollouts
# Training loop - fresh problems each iteration!
history = []
print(f"{'Iter':>4} | {'Reward':>6} | {'Acc':>5} | {'Words':>5}\n" + "-" * 40)
for i in range(10):
metrics, rollouts = train_step(n_problems=8, n_samples=8, lr=2e-4, concise_bonus=(i>5))
history.append(metrics)
print(f"{i+1:>4} | {metrics['reward']:>6.2f} | {metrics['accuracy']:>5.0%} | {metrics['words']:>5.0f}")
fig, axes = plt.subplots(1, 3, figsize=(12, 3))
iters = range(1, len(history) + 1)
axes[0].plot(iters, [h["reward"] for h in history], 'b-o'); axes[0].set_title("Reward")
axes[1].plot(iters, [h["accuracy"] for h in history], 'g-o'); axes[1].set_title("Accuracy"); axes[1].set_ylim(0, 1)
axes[2].plot(iters, [h["words"] for h in history], 'r-o'); axes[2].set_title("Avg Words")
plt.tight_layout(); plt.show()
def test_model(client, problem):
tokens = make_prompt_tokens(problem)
resp = client.sample(types.ModelInput.from_ints(tokens=tokens), num_samples=1,
sampling_params=types.SamplingParams(max_tokens=256, temperature=0.3)).result()
text = tokenizer.decode(resp.sequences[0].tokens)
return text, compute_reward(text, problem[2])
ds = []
for _ in range(5):
p = generate_problem()
text, reward = test_model(trained_client, p)
text_base, reward_base = test_model(base_client, p)
ds.append((text, reward, text_base, reward_base))
print(f"Problem: {p[0]}\nTrained Reward: {reward['total']}\nBase Reward: {reward_base['total']}\n")
ds = []
for _ in range(5):
p = generate_problem()
text, reward = test_model(trained_client, p)
text_base, reward_base = test_model(base_client, p)
ds.append((text, reward, text_base, reward_base))
print(f"Problem: {p[0]}\nTrained Reward: {reward['total']}\nBase Reward: {reward_base['total']}\n")
Problem: [1, 14, 18, 3, 14, 16]
Trained Reward: 1.1
Base Reward: 0.0
Problem: [6, 2, 2, 18]
Trained Reward: 1.1
Base Reward: 0.0
Problem: [7, 16, 10, 16, 8, 2]
Trained Reward: 1.1
Base Reward: 0.0
Problem: [5, 1, 4, 13, 15, 7]
Trained Reward: 1.1
Base Reward: 0.0
Problem: [5, 19, 8, 9, 12, 6]
Trained Reward: 1.1
Base Reward: 0.0
Things to mention
- We trained this on top of the instruct model since the thinking models waffle a lot more
- This is a somewhat easy task, if not for the length cutoff of 256
- You can tweak the difficulty by adjusting problem length and number size
- We probably didn't need the conciseness bonus
"advantages": [0.0] * n_prompt + [advantage] * n_completionand we use GRPO-style reward normalization- There's a fair bit of code but the KEY bits are reward function, advantages, problem creation, train settings