2
$\begingroup$

In Google DeepMind's Gemma technical paper (https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf), the 7B Gemma model specs are given as d_model = 3072, num_heads = 16 and head_size = 256 for the 7B model. They don't seem consistent (16 * 256 != 3072). Since the dimension is distributed across h heads, I think this should hold true -

#heads * #head_size = d_model 

This is also explained in the original Transformers paper, "Attention Is All You Need".

This equation holds for the specs provided for Gemma 2B model in the same paper. Am I missing something with the 7B Gemma specs? Or does this paper have an error?

$\endgroup$
0

1 Answer 1

3
$\begingroup$

From the annotated transformer link - https://nlp.seas.harvard.edu/annotated-transformer/#full-model For the positionwiseFeedForward network: D_model = 512, and the inner layer has dimensionality dff = 2048.

It’s hard to be certain but perhaps the Gemma paper is using the dff for model size. I’ve always had the same understanding as you – n_heads * size_of_head = d_model, regardless of hidden dimension. This appears to agree with the hidden_size parameter on hugging_face for gemma_7b:

• hidden_size (int, optional, defaults to 3072) — Dimension of the hidden representations.

https://huggingface.co/docs/transformers/en/model_doc/gemma

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.