Gemma4 Speculative Pair Design

This page is the implementation plan for pairing a full Gemma4 backbone with the small google/gemma-4-E4B-it-assistant drafter model. The assistant is not a standalone chatbot. It is a speculative/MTP draft decoder that consumes the backbone hidden stream and proposes candidate tokens for the backbone to verify.

Analogy
In the header/body/footer mental model, the backbone is the authoritative target decoder. The assistant is a secondary draft decoder. The bridge between them behaves like an encoder-to-decoder handoff because the backbone exports hidden state and the assistant consumes it, but the backbone is still a decoder, not a vision-style encoder.

Target Shape

prompt tokens
  -> Gemma4 E4B backbone prefill
  -> backbone KV cache + hidden stream
  -> decode loop:
       backbone step exports hidden state H_t [1, 2560]
       assistant_pre_projection(H_t) -> draft stream [1, 256]
       assistant 4-layer draft decoder proposes token candidates
       backbone verifies proposed token(s)
       accepted tokens are emitted; rejected token falls back to backbone

The first working milestone should draft and verify one token. After that is correct, increase draft_tokens to 2, 4, and 8 and measure acceptance rate versus throughput.

Composite Template

The clean CKE implementation should be a composite circuit, not two unrelated model folders joined through ad hoc Python. A future template can be named:

version/v8/templates/gemma4_speculative_pair.json

The template should declare three blocks:

Block Role Existing Template
target Authoritative Gemma4 backbone that owns final correctness. gemma4.json
draft Small assistant/MTP model that proposes candidate tokens. gemma4_assistant.json
verify Accept/reject loop that compares draft tokens against target logits. New runtime loop, not a neural layer.

Weight Namespaces

First implementation should keep the two weight files separate for auditability:

/tmp/ck-gemma4-e4b-runtime/weights.bump
/tmp/ck-gemma4-assistant-runtime/weights.bump

The composite manifest can refer to both namespaces:

{
  "model": "gemma4_speculative_pair",
  "target": {
    "template": "gemma4",
    "weights": "target.weights.bump",
    "hidden_export": "target_hidden_stream",
    "hidden_size": 2560
  },
  "draft": {
    "template": "gemma4_assistant",
    "weights": "assistant.weights.bump",
    "assistant_role": "mtp_drafter",
    "backbone_hidden_size": 2560,
    "hidden_size": 256
  },
  "bridge": {
    "source": "target_hidden_stream",
    "dest": "draft.backbone_stream",
    "shape": ["T", 2560]
  }
}

Later, the compiler can merge both into one BUMP file with target and assistant regions. That is an optimization, not a requirement for the first correct runtime.

IR Contract

The composite lowering should make the bridge explicit:

target.decode_step
  outputs:
    target_logits
    target_hidden_stream
    target_kv_cache

draft.decode_step
  inputs:
    draft_token_ids
    draft.backbone_stream <- target_hidden_stream
  outputs:
    draft_logits
    draft_candidate_tokens

verify.step
  inputs:
    draft_candidate_tokens
    target_logits
  outputs:
    accepted_tokens
    rejected_token

The generated C should expose explicit functions instead of hiding the protocol:

int ck_gemma4_pair_prefill(...);
int ck_gemma4_pair_decode_one(...);
int ck_gemma4_pair_draft_tokens(...);
int ck_gemma4_pair_verify_tokens(...);
int ck_gemma4_pair_generate_speculative(...);

Memory Plan

Required buffers should be visible in layout_decode.json:

Xeon Validation Plan

On the high-memory Xeon box, build both runtimes first:

.venv/bin/python version/v8/scripts/ck_run_v8.py run \
  hf://unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-Q4_K_M.gguf \
  --run /tmp/ck-gemma4-e4b-runtime \
  --context-len 2048 \
  --force-convert --force-compile \
  --prompt 'Give me a detailed example of C code.' \
  --chat-template gemma4 \
  --max-tokens 64 \
  --temperature 0.0

.venv/bin/python version/v8/scripts/ck_run_v8.py run \
  hf://google/gemma-4-E4B-it-assistant \
  --run /tmp/ck-gemma4-assistant-runtime \
  --context-len 2048 \
  --force-convert --force-compile \
  --generate-only \
  --chat-template none \
  --allow-raw-prompt

Then validate in this order:

  1. Backbone-only coherent decode.
  2. Assistant-only compile and load smoke. Do not expect standalone chat.
  3. Backbone decode exports target_hidden_stream with shape [1, 2560].
  4. Assistant consumes that stream and produces draft logits.
  5. One-token draft and target verification match the Python reference protocol.
  6. Multi-token speculative loop reports drafted, accepted, rejected, and acceptance rate.

Success Metrics

The first useful dashboard should compare:

A failed acceptance rate is still useful data. It means the bridge, tokenizer, positional policy, or hidden-state handoff is wrong. A high acceptance rate with no speedup means the protocol is correct but the verifier is still too serial.

Image
100% | |
Scroll to zoom | Drag to pan | W/H to fit | 0 to reset | ESC to close