You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Updated Albert model Card
* Update docs/source/en/model_doc/albert.md
added the quotes in <hfoption id="Pipeline">
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/albert.md
updated checkpoints
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/albert.md
changed !Tips description
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/albert.md
updated text
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/albert.md
updated transformer-cli implementation
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/albert.md
changed text
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/albert.md
removed repeated description
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update albert.md
removed lines
* Update albert.md
updated pipeline code
* Update albert.md
updated auto model code, removed quantization as model size is not large, removed the attention visualizer part
* Update docs/source/en/model_doc/albert.md
updated notes
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update albert.md
reduced a repeating point in notes
* Update docs/source/en/model_doc/albert.md
updated transformer-CLI
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/albert.md
removed extra notes
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
[ALBERT](https://huggingface.co/papers/1909.11942) is designed to address memory limitations of scaling and training of [BERT](./bert). It adds two parameter reduction techniques. The first, factorized embedding parametrization, splits the larger vocabulary embedding matrix into two smaller matrices so you can grow the hidden size without adding a lot more parameters. The second, cross-layer parameter sharing, allows layer to share parameters which keeps the number of learnable parameters lower.
29
+
30
+
<<<<<<< HEAD
31
+
=======
32
+
33
+
<<<<<<< HEAD
34
+
ALBERT was created to address problems like -- GPU/TPU memory limitations, longer training times, and unexpected model degradation in BERT. ALBERT uses two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT:
35
+
36
+
-**Factorized embedding parameterization:** The large vocabulary embedding matrix is decomposed into two smaller matrices, reducing memory consumption.
37
+
-**Cross-layer parameter sharing:** Instead of learning separate parameters for each transformer layer, ALBERT shares parameters across layers, further reducing the number of learnable weights.
38
+
39
+
ALBERT uses absolute position embeddings (like BERT) so padding is applied at right. Size of embeddings is 128 While BERT uses 768. ALBERT can processes maximum 512 token at a time.
You can find all the original ALBERT checkpoints under the [ALBERT community](https://huggingface.co/albert) organization.
28
45
29
-
The ALBERT model was proposed in [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://huggingface.co/papers/1909.11942) by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma,
30
-
Radu Soricut. It presents two parameter-reduction techniques to lower memory consumption and increase the training
31
-
speed of BERT:
46
+
> [!TIP]
47
+
> Click on the ALBERT models in the right sidebar for more examples of how to apply ALBERT to different language tasks.
32
48
33
-
- Splitting the embedding matrix into two smaller matrices.
34
-
- Using repeating layers split among groups.
49
+
The example below demonstrates how to predict the `[MASK]` token with [`Pipeline`], [`AutoModel`], and from the command line.
35
50
36
-
The abstract from the paper is the following:
51
+
<hfoptionsid="usage">
52
+
<hfoptionid="Pipeline">
37
53
38
-
*Increasing model size when pretraining natural language representations often results in improved performance on
39
-
downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations,
40
-
longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction
41
-
techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows
42
-
that our proposed methods lead to models that scale much better compared to the original BERT. We also use a
43
-
self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks
44
-
with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and
45
-
SQuAD benchmarks while having fewer parameters compared to BERT-large.*
54
+
```py
55
+
import torch
56
+
from transformers import pipeline
46
57
47
-
This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
48
-
[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
58
+
pipeline = pipeline(
59
+
task="fill-mask",
60
+
model="albert-base-v2",
61
+
torch_dtype=torch.float16,
62
+
device=0
63
+
)
64
+
pipeline("Plants create [MASK] through a process known as photosynthesis.", top_k=5)
65
+
```
49
66
50
-
## Usage tips
67
+
</hfoption>
68
+
<hfoptionid="AutoModel">
51
69
52
-
- ALBERT is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather
53
-
than the left.
54
-
- ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains
55
-
similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same
56
-
number of (repeating) layers.
57
-
- Embedding size E is different from hidden size H justified because the embeddings are context independent (one embedding vector represents one token), whereas hidden states are context dependent (one hidden state represents a sequence of tokens) so it's more logical to have H >> E. Also, the embedding matrix is large since it's V x E (V being the vocab size). If E < H, it has less parameters.
58
-
- Layers are split in groups that share parameters (to save memory).
59
-
Next sentence prediction is replaced by a sentence ordering prediction: in the inputs, we have two sentences A and B (that are consecutive) and we either feed A followed by B or B followed by A. The model must predict if they have been swapped or not.
60
-
- The `head_mask` argument is ignored when using all attention implementation other than "eager". If you have a `head_mask` and want it to have effect, load the model with `XXXModel.from_pretrained(model_id, attn_implementation="eager")`
70
+
```py
71
+
import torch
72
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
This model was contributed by [lysandre](https://huggingface.co/lysandre). This model jax version was contributed by
110
-
[kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/google-research/ALBERT).
102
+
</hfoption>
103
+
104
+
</hfoptions>
105
+
106
+
107
+
## Notes
108
+
109
+
- Inputs should be padded on the right because BERT uses absolute position embeddings.
110
+
- The embedding size `E` is different from the hidden size `H` because the embeddings are context independent (one embedding vector represents one token) and the hidden states are context dependent (one hidden state represents a sequence of tokens). The embedding matrix is also larger because `V x E` where `V` is the vocabulary size. As a result, it's more logical if `H >> E`. If `E < H`, the model has less parameters.
0 commit comments