Skip to content

bigscience-workshop/multilingual-modeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

README

Notes

This repository is no longer actively maintained. This repo was created when BLOOM+1 paper was written, where we had to engineered the adapter modules due to the new BLOOM architecture.

But now, adapters for BLOOM models are readily available (see peft), and language adaptation of these models (i.e., training of LLMs on monolingual corpora of a particular language) can be done by following official documentations such as peft-blog using the same pretraining objective, next-token-prediction.


This repository contains code for performing language adaptation of multilingual pretrained large language model BLOOM-{560m,1b1,1b7,3b,7b1} to new unseen languages. Please refer to our ACL 2023 paper BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting.

Our implementations support the following features:

  • finetuning new tokenizers and embedding layers to support new script of unseen languages.
  • different embedding stategies where we replace the entire embedding by training from scratch, reinitialize embedding layers but initialize seen vocabulary, or extend the embedding layer to support new tokens.
  • more than 15 language adaptation strategies for pretrained BLOOM model, including continued-pretraining and parameter-efficient finetuning such as BitFit (Zaken et al., 2021), (IA)^3 (Liu et al., 2022), LoRA (Hu et al., 2021), MAD-X (Pfeiffer et al., 2020), composible sparse finetuning (Ansell et al., 2022), etc.
  • different evaluation settings:
    • supervised fine-tuning or cross-lingual transfer: task-finetuning with (English) task adapters on the following tasks: WikiANN (NER tagging), XLSum (abstractive summarization) and XNLI (natural language inference). This is an artefact that is used for preliminary experiments of our BLOOM+1 work.
    • zero-shot prompting on adapted language models, which is carried out on our BLOOM+1 paper. This is done with forked and modified EleutherAI's lm-eval-harness library. See branch bigscience-lm-adapt.

Installation

  1. Install the packages from composable-sft. This is used for composable-SFT finetuning.
  2. Install the packages from rational_activations. You would need to follow the [Other CUDA/PyTorch] section for installation. This is used for adaptable-adapters.
  3. Install the packages from this repo using pip install -r requirements.txt.

If encounter error with the import transformer, uninstall transformers using the command pip uninstall transformers and rerun step 3 to reinstall transformers supported by adapter-transformers library.

Experimental Setup (Language Adaptation)

Tokenizer and Tokenization of Dataset

Run tokenized4clm_sampled.py to train the tokenizer on the subset of OSCAR dataset.

  • lang: language name (e.g., "de", "th")
  • model: original tokenizer (e.g., "bigscience/bloom-1b3")
  • tokenizer_dir: path directory to save the tokenizer. The tokenizer will be saved as tok_${model}_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_{replace/extend}
  • cache_dir (default is "~/.cache/huggingface/transformers"): cache directory for downloading the OSCAR dataset and GPT2 tokenizer.
  • vocab_size: vocab size of the tokenizer
  • sample_size: the amount of samples to use to train the tokenizer (randomly selected)
  • tok_strategy: extend, replace or overlap-replace
cache_dir=... output_dir=... lang=... # language sample_size=... # training sample size vocab_size=... # vocab size of tokenizer tok_strategy=... # extend, replace, overlap-replace bigs_model="bigscience/bloom-1b3" tokenizer_dir="${output_dir}/tok_$(basename $bigs_model)_${lang}_oscar_${sample_size}samples_${vocab_size}vocab_${tok_strategy}" python ./scripts/lang_adapt/tokenized4clm_sampled.py \ --lang $lang \ --model $bigs_model \ --tokenizer_dir $tokenizer_dir \ --hf_cache_dir $cache_dir \ --vocab_size $vocab_size \ --sample_size $sample_size \ --tok_strategy $tok_strategy 

Language Adaptation

Run madx_run_clm.py to finetune language model on a new language.

  • LANG: language name (e.g., "de", "th") on OSCAR
  • DATA_SAMPLES: training sample size
  • VOCAB_SIZE: vocab size of the tokenizer
  • BIGS_MODEL: bigscience model
  • ADPT_STRATEGY: language adaptation strategy
    • "emb": train only embedding
    • "continual-pretrain": continued pretraining of the entire BLOOM model
    • "emb-then-adpt": train embedding then Pfeiffer adapter later (sequential training)
    • "pfeiffer", "pfeiffer+inv": Pfeiffer adapters in transformers block. (Houlsby et al., 2019) Without or with invertible adapters in embedding layer. This is also known as MAD-X (Pfeiffer et al., 2020).
    • "lora": LoRA adapters in transformers block (Hu et al., 2021)
    • "aa": adaptable adapters (Moosavi et al., 2022)
    • "ia3", "ia3+inv": (IA)^3 adapters in transformers block. Without or with invertible adapters in embedding layer. (Liu et al., 2022)
    • "prefix_tuning", "prefix_tuning_flat": Prefix tuning in input space, whether using MLP layers to initialize (without flat) or directly initialize tokens (with flat) as prefix tokens. (Li & Liang, 2021)
    • "prompt-tuning": Prompt-tuning in transformer blocks (Lester et al., 2021)
    • "sft": Composable sparse finetuning. (Ansell et al., 2022)
    • "bitfit", "bitfit+inv": Finetuning bias layers. Without or with invertible adapters in embedding layer. (Zaken et al., 2021)
    • "fish": Finetuning FISH masks. (Sung et al., 2021)
    • "compacter", "compacterpp": Compacter or compacter++ adapters in transformer blocks. (Mahabadi et al., 2021)
  • EMBD_SRATEGY: embedding strategy. Either "replace" (replace the embedding layer entirely), "overlap-replace" (replace but initialize seen vocab with pretrained embedding), or "extend" (freeze seen vocab embeddings and add trainable embeddings for unseen vocab)
  • TOK_STRATEGY: tokenization strategy (either "replace" (for embedding strategy of "replace" and "overlap-replace") or "extend")
  • tokenizer_dir: saved tokenizer directory (used in the tokenization script above)
  • cache_dir: (as above)
  • output_dir: directory to save adapted model
  • logging_dir: directory to log loss curves to tensorboard
  • MAX_STEPS: training steps
  • EVAL_STEPS: number of training steps between two evaluations
  • SAVE_STEPS: number of training steps between saving the checkpoints.
LANG=... # language DATA_SAMPLES=... # training sample size VOCAB_SIZE=... # vocab size of newly trained tokenizer BIGS_MODEL="bigscience/bloom-1b3" ADPT_STRATEGY="emb" # language adaptation strategy (train only embedding for now) EMBD_SRATEGY=... # either "replace", "overlap-replace", or "extend" TOK_STRATEGY=... # either "replace" (for embedding strategy of "replace" and "overlap-replace") or "extend" tokenizer_dir=... # as above tokenizer_dir="${tokenizer_dir}/tok_${BIGS_MODEL##*/}_${LANG}_oscar_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${TOK_STRATEGY}" cache_dir=... # as above output_dir=... # directory to save adapted model output_dir="${output_dir}/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${EMBD_SRATEGY}" logging_dir=... # directory to log loss curves to tensorboard logging_dir="${logging_dir}/${BIGS_MODEL##*/}_${LANG}_${ADPT_STRATEGY}_${DATA_SAMPLES}samples_${VOCAB_SIZE}vocab_${EMBD_SRATEGY}" mkdir -p $output_dir mkdir -p $logging_dir MAX_STEPS=50000 EVAL_STEPS=5000 SAVE_STEPS=5000 python ./scripts/lang_adapt/madx_run_clm.py \ --seed 0 \ --fp16 \ --model_name_or_path $BIGS_MODEL \ --tokenizer_name $tokenizer_dir \ --dataset_name oscar \ --cache_dir $cache_dir \ --dataset_config_name "unshuffled_deduplicated_${LANG}" \ --logging_dir $logging_dir \ --report_to "tensorboard" \ --learning_rate 0.001 \ --do_train \ --do_eval \ --output_dir $output_dir \ --preprocessing_num_workers 8 \ --overwrite_output_dir \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 4 \ --per_device_eval_batch_size 2 \ --eval_accumulation_steps 4 \ --eval_steps $EVAL_STEPS \ --evaluation_strategy "steps" \ --max_eval_samples 5000 \ --save_steps $SAVE_STEPS \ --save_strategy "steps" \ --max_train_samples $DATA_SAMPLES \ --max_steps $MAX_STEPS \ --logging_steps 1000 \ --lang_adapt_strategies $ADPT_STRATEGY \ --embedding_strategies $EMBD_SRATEGY \ --load_best_model_at_end \ --gradient_checkpointing \ --fp16 

BLOOM+1 Reproduction: See ./scripts/lang_adapt/example_scripts/run_clm_ru_madx_560m.sh to reproduce language adapation of BLOOM-560m models to Russian in our BLOOM+1 paper.

Language Adaptation with DeepSpeed

  1. Replace python ./scripts/lang_adapt/madx_run_clm.py with deepspeed --num_gpus=8 --master_port 60000.
  2. Pass deepspeed config file argument --deepspeed "/home/zhengxinyong/multilingual-modeling/scripts/lang_adapt/ds_config_zero2.json"

See example file at ./scripts/lang_adapt/example_scripts/run_clm_ru_madx_7b1_deepspeed.sh, which adapts BLOOM-7b1 model on Google Cloud 8 A100 GPUs.

Experimental Setup (Evaluation)

Zero-Shot Prompting

Prompt the adapted language model in a zero-shot fashion without any finetuning. You'll need to git clone https://github.com/yongzx/lm-evaluation-harness/tree/bigscience-lm-adapt to be able to run the experiments.

Here shows the evaluation code for XNLI zero-shot prompting. You can find it in lm-evaluation-harness/examples/.

For BLOOM+1, the tasks used are:

Baseline or Model-Based (BitFit, FISH Mask, etc.)

python3 lm-evaluation-harness/main.py \ --model bigscience \ --model_args tokenizer="bigscience/bloom-560m",pretrained="ZYONG2/saved_models/bloom-560m_de_bitfit_100000samples_-1vocab_original-frozen" \ --tasks xnli_de 

Using Adapters (MAD-X, Pfeiffer, IA3, LoRA, etc.)

python3 m-evaluation-harness/main.py \ --model bigscience \ --model_args tokenizer="bigscience/bloom-560m",pretrained="bigscience/bloom-560m",adapter_ckpt_folder="ZYONG2/saved_models/bloom-560m_de_ia3_100000samples_-1vocab_original-frozen/oscar_ia3_de" \ --tasks xnli_de 

Supervised Finetuning or Cross-Lingual Transfer (Only used for preliminary experiments with BLOOM is released)

OUTPUT_DIR=... # where you want to save checkpoints at LANG="de" CACHE_DIR=... # cache dir for saving/loading HF models and XNLI datasets. LR=1e-5 MODEL_NAME="ZYONG2/bigscience/tr5b-1B3-multilingual-alpha-checkpoints" # previous version of BLOOM pre-release TOKENIZER_NAME="ZYONG2/processed/011/oscar-de-tokenizer" # language adapters checkpoint folder MADX_LANG_ADAPTER_NAME=".../oscar_de" # we finetune task adapters for XNLI FT_STRATEGIES="task_adapters" mkdir -p $OUTPUT_DIR python adapters_xnli_de.py \ $OUTPUT_DIR \ --lang $LANG \ --cache_dir $CACHE_DIR \ --num_train_epochs 2 \ --learning_rate $LR \ --per_device_train_batch_size 8 \ --gradient_accumulation_steps 4 \ --pretrained_model $MODEL_NAME \ --tokenizer $TOKENIZER_NAME \ --do_train \ --do_eval_after_train \ --madx_lang_adapter $MADX_LANG_ADAPTER_NAME \ --finetune_strategies $FT_STRATEGIES \ --zero_shot 

Remove --zero_shot for supervised finetuning setting.

See example scripts in ./scripts/eval/task_ftscripts_xnli/. train_xnli_zero_shot.sh is the batch script for XNLI finetuning, and run_eval_xnli_zero_shot.sh is for evaluating trained XNLI task adapters.

Citation

@inproceedings{yong-etal-2023-bloom, title = "{BLOOM}+1: Adding Language Support to {BLOOM} for Zero-Shot Prompting", author = "Yong, Zheng Xin and Schoelkopf, Hailey and Muennighoff, Niklas and Aji, Alham Fikri and Adelani, David Ifeoluwa and Almubarak, Khalid and Bari, M Saiful and Sutawika, Lintang and Kasai, Jungo and Baruwa, Ahmed and Winata, Genta and Biderman, Stella and Raff, Edward and Radev, Dragomir and Nikoulina, Vassilina", editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.653", doi = "10.18653/v1/2023.acl-long.653", pages = "11682--11703", } 

About

BLOOM+1: Adapting BLOOM model to support a new unseen language

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5