Gemma4 Speculative Pair Design
This page is the implementation plan for pairing a full Gemma4 backbone with the small
google/gemma-4-E4B-it-assistant drafter model. The assistant is not a standalone
chatbot. It is a speculative/MTP draft decoder that consumes the backbone hidden stream and
proposes candidate tokens for the backbone to verify.
In the header/body/footer mental model, the backbone is the authoritative target decoder. The assistant is a secondary draft decoder. The bridge between them behaves like an encoder-to-decoder handoff because the backbone exports hidden state and the assistant consumes it, but the backbone is still a decoder, not a vision-style encoder.
Target Shape
prompt tokens
-> Gemma4 E4B backbone prefill
-> backbone KV cache + hidden stream
-> decode loop:
backbone step exports hidden state H_t [1, 2560]
assistant_pre_projection(H_t) -> draft stream [1, 256]
assistant 4-layer draft decoder proposes token candidates
backbone verifies proposed token(s)
accepted tokens are emitted; rejected token falls back to backbone
The first working milestone should draft and verify one token. After that is correct, increase
draft_tokens to 2, 4, and 8 and measure acceptance rate versus throughput.
Composite Template
The clean CKE implementation should be a composite circuit, not two unrelated model folders joined through ad hoc Python. A future template can be named:
version/v8/templates/gemma4_speculative_pair.json
The template should declare three blocks:
| Block | Role | Existing Template |
|---|---|---|
target |
Authoritative Gemma4 backbone that owns final correctness. | gemma4.json |
draft |
Small assistant/MTP model that proposes candidate tokens. | gemma4_assistant.json |
verify |
Accept/reject loop that compares draft tokens against target logits. | New runtime loop, not a neural layer. |
Weight Namespaces
First implementation should keep the two weight files separate for auditability:
/tmp/ck-gemma4-e4b-runtime/weights.bump
/tmp/ck-gemma4-assistant-runtime/weights.bump
The composite manifest can refer to both namespaces:
{
"model": "gemma4_speculative_pair",
"target": {
"template": "gemma4",
"weights": "target.weights.bump",
"hidden_export": "target_hidden_stream",
"hidden_size": 2560
},
"draft": {
"template": "gemma4_assistant",
"weights": "assistant.weights.bump",
"assistant_role": "mtp_drafter",
"backbone_hidden_size": 2560,
"hidden_size": 256
},
"bridge": {
"source": "target_hidden_stream",
"dest": "draft.backbone_stream",
"shape": ["T", 2560]
}
}
Later, the compiler can merge both into one BUMP file with target and assistant regions. That is an optimization, not a requirement for the first correct runtime.
IR Contract
The composite lowering should make the bridge explicit:
target.decode_step
outputs:
target_logits
target_hidden_stream
target_kv_cache
draft.decode_step
inputs:
draft_token_ids
draft.backbone_stream <- target_hidden_stream
outputs:
draft_logits
draft_candidate_tokens
verify.step
inputs:
draft_candidate_tokens
target_logits
outputs:
accepted_tokens
rejected_token
The generated C should expose explicit functions instead of hiding the protocol:
int ck_gemma4_pair_prefill(...);
int ck_gemma4_pair_decode_one(...);
int ck_gemma4_pair_draft_tokens(...);
int ck_gemma4_pair_verify_tokens(...);
int ck_gemma4_pair_generate_speculative(...);
Memory Plan
Required buffers should be visible in layout_decode.json:
target_main_streamfor Gemma4 backbone hidden state.target_kv_cachefor the authoritative target cache.draft_backbone_streamfor assistant input, shape[T, 2560].draft_main_streamfor assistant hidden state, shape[T, 256].draft_kv_cachefor the assistant cache.draft_logitsandtarget_logits.accepted_tokens,candidate_tokens, and verifier scratch.
Xeon Validation Plan
On the high-memory Xeon box, build both runtimes first:
.venv/bin/python version/v8/scripts/ck_run_v8.py run \
hf://unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-Q4_K_M.gguf \
--run /tmp/ck-gemma4-e4b-runtime \
--context-len 2048 \
--force-convert --force-compile \
--prompt 'Give me a detailed example of C code.' \
--chat-template gemma4 \
--max-tokens 64 \
--temperature 0.0
.venv/bin/python version/v8/scripts/ck_run_v8.py run \
hf://google/gemma-4-E4B-it-assistant \
--run /tmp/ck-gemma4-assistant-runtime \
--context-len 2048 \
--force-convert --force-compile \
--generate-only \
--chat-template none \
--allow-raw-prompt
Then validate in this order:
- Backbone-only coherent decode.
- Assistant-only compile and load smoke. Do not expect standalone chat.
- Backbone decode exports
target_hidden_streamwith shape[1, 2560]. - Assistant consumes that stream and produces draft logits.
- One-token draft and target verification match the Python reference protocol.
- Multi-token speculative loop reports drafted, accepted, rejected, and acceptance rate.
Success Metrics
The first useful dashboard should compare:
backbone_only_tok_per_secspeculative_tok_per_secdraft_tokens_per_batchaccepted_tokens/drafted_tokenstarget_verify_msassistant_draft_ms
A failed acceptance rate is still useful data. It means the bridge, tokenizer, positional policy, or hidden-state handoff is wrong. A high acceptance rate with no speedup means the protocol is correct but the verifier is still too serial.