- Notifications
You must be signed in to change notification settings - Fork 140
Description
Bug description
Marian 1.12 (65bf82ffce52f4854295d8b98482534f176d494e) runs into this error for target factored data:
[2024-04-18 08:40:14] Error: Cublas Error: 7 - /marian/src/tensors/gpu/prod.cpp:698: cublasLtAffineTyped(ltHandle, opB, opA, n, m, k, &alpha, B->data<T>(), ldb, A->data<T>(), lda, &beta, C->data<T>(), ldc, bias->data<T>(), workspace->data<T>(), workspaceSizeBytes, do_relu, stream) [2024-04-18 08:40:14] Error: Aborted from void marian::gpu::affineTyped(marian::Tensor, marian::Ptr<marian::Allocator>, const Tensor&, const Tensor&, const Tensor&, bool, bool, T, T, bool) [with T = float; marian::Tensor = IntrusivePtr<marian::TensorBase>; marian::Ptr<marian::Allocator> = std::shared_ptr<marian::Allocator>] in /marian/src/tensors/gpu/prod.cpp:698 How to reproduce
Run marian 1.12 compiled against CUDA 11+ with target factors.
I am trying to train marian models from scratch using factored data. It succeeds for source factors, but source-and-target factors or target factor trainings fail the CUBLAS check.
I compile 65bf82ffce52f4854295d8b98482534f176d494e in a docker container and have tried this with a set of cuda-, nvidia- and marian-versions on ubuntu 22.04 and 18.04
Variants that were tried:
marian 1.12 | cuda 12.3.1 | nvidia 525.85.12 or 550.54.14 | ubuntu 22.04 -> fails marian 1.12 | cuda 11.8 | nvidia 525.85.12 or 550.54.14 | ubuntu 22.04 -> fails marian 1.11 | cuda 12.2.0 | nvidia 525.85.12 | ubuntu 20.04 -> fails marian 1.11 | cuda 11.8 | nvidia 525.85.12 | ubuntu 20.04 -> fails marian 1.11 | cuda 10.2 | nvidia 525.85.12 or 550.54.14 | ubuntu 18.04 -> works Context
Marian output
+ /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000 [2024-04-18 08:40:13] [marian] Marian v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800 [2024-04-18 08:40:13] [marian] Running on 25b1c50316d0 as process 33 with command line: [2024-04-18 08:40:13] [marian] /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000 [2024-04-18 08:40:13] [config] after: 0e [2024-04-18 08:40:13] [config] after-batches: 0 [2024-04-18 08:40:13] [config] after-epochs: 500 [2024-04-18 08:40:13] [config] all-caps-every: 0 [2024-04-18 08:40:13] [config] allow-unk: false [2024-04-18 08:40:13] [config] authors: false [2024-04-18 08:40:13] [config] beam-size: 6 [2024-04-18 08:40:13] [config] bert-class-symbol: "[CLS]" [2024-04-18 08:40:13] [config] bert-mask-symbol: "[MASK]" [2024-04-18 08:40:13] [config] bert-masking-fraction: 0.15 [2024-04-18 08:40:13] [config] bert-sep-symbol: "[SEP]" [2024-04-18 08:40:13] [config] bert-train-type-embeddings: true [2024-04-18 08:40:13] [config] bert-type-vocab-size: 2 [2024-04-18 08:40:13] [config] build-info: "" [2024-04-18 08:40:13] [config] check-gradient-nan: false [2024-04-18 08:40:13] [config] check-nan: false [2024-04-18 08:40:13] [config] cite: false [2024-04-18 08:40:13] [config] clip-norm: 5 [2024-04-18 08:40:13] [config] cost-scaling: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] cost-type: ce-sum [2024-04-18 08:40:13] [config] cpu-threads: 0 [2024-04-18 08:40:13] [config] data-threads: 8 [2024-04-18 08:40:13] [config] data-weighting: "" [2024-04-18 08:40:13] [config] data-weighting-type: sentence [2024-04-18 08:40:13] [config] dec-cell: ssru [2024-04-18 08:40:13] [config] dec-cell-base-depth: 2 [2024-04-18 08:40:13] [config] dec-cell-high-depth: 1 [2024-04-18 08:40:13] [config] dec-depth: 6 [2024-04-18 08:40:13] [config] devices: [2024-04-18 08:40:13] [config] - 0 [2024-04-18 08:40:13] [config] - 1 [2024-04-18 08:40:13] [config] - 2 [2024-04-18 08:40:13] [config] - 3 [2024-04-18 08:40:13] [config] dim-emb: 512 [2024-04-18 08:40:13] [config] dim-rnn: 1024 [2024-04-18 08:40:13] [config] dim-vocabs: [2024-04-18 08:40:13] [config] - 0 [2024-04-18 08:40:13] [config] - 0 [2024-04-18 08:40:13] [config] disp-first: 0 [2024-04-18 08:40:13] [config] disp-freq: 500 [2024-04-18 08:40:13] [config] disp-label-counts: true [2024-04-18 08:40:13] [config] dropout-rnn: 0 [2024-04-18 08:40:13] [config] dropout-src: 0 [2024-04-18 08:40:13] [config] dropout-trg: 0 [2024-04-18 08:40:13] [config] dump-config: "" [2024-04-18 08:40:13] [config] dynamic-gradient-scaling: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] early-stopping: 3 [2024-04-18 08:40:13] [config] early-stopping-on: first [2024-04-18 08:40:13] [config] embedding-fix-src: false [2024-04-18 08:40:13] [config] embedding-fix-trg: false [2024-04-18 08:40:13] [config] embedding-normalization: false [2024-04-18 08:40:13] [config] embedding-vectors: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] enc-cell: gru [2024-04-18 08:40:13] [config] enc-cell-depth: 1 [2024-04-18 08:40:13] [config] enc-depth: 6 [2024-04-18 08:40:13] [config] enc-type: bidirectional [2024-04-18 08:40:13] [config] english-title-case-every: 0 [2024-04-18 08:40:13] [config] exponential-smoothing: 0.0001 [2024-04-18 08:40:13] [config] factor-weight: 1 [2024-04-18 08:40:13] [config] factors-combine: sum [2024-04-18 08:40:13] [config] factors-dim-emb: 0 [2024-04-18 08:40:13] [config] gradient-checkpointing: false [2024-04-18 08:40:13] [config] gradient-norm-average-window: 100 [2024-04-18 08:40:13] [config] guided-alignment: data/train.tok.tc.clean.bpe.en.en-de.align [2024-04-18 08:40:13] [config] guided-alignment-cost: ce [2024-04-18 08:40:13] [config] guided-alignment-weight: 0.1 [2024-04-18 08:40:13] [config] ignore-model-config: false [2024-04-18 08:40:13] [config] input-types: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] interpolate-env-vars: false [2024-04-18 08:40:13] [config] keep-best: true [2024-04-18 08:40:13] [config] label-smoothing: 0.1 [2024-04-18 08:40:13] [config] layer-normalization: false [2024-04-18 08:40:13] [config] learn-rate: 0.0003 [2024-04-18 08:40:13] [config] lemma-dependency: "" [2024-04-18 08:40:13] [config] lemma-dim-emb: 0 [2024-04-18 08:40:13] [config] log: "" [2024-04-18 08:40:13] [config] log-level: info [2024-04-18 08:40:13] [config] log-time-zone: "" [2024-04-18 08:40:13] [config] logical-epoch: [2024-04-18 08:40:13] [config] - 1e [2024-04-18 08:40:13] [config] - 0 [2024-04-18 08:40:13] [config] lr-decay: 0 [2024-04-18 08:40:13] [config] lr-decay-freq: 50000 [2024-04-18 08:40:13] [config] lr-decay-inv-sqrt: [2024-04-18 08:40:13] [config] - 16000 [2024-04-18 08:40:13] [config] lr-decay-repeat-warmup: false [2024-04-18 08:40:13] [config] lr-decay-reset-optimizer: false [2024-04-18 08:40:13] [config] lr-decay-start: [2024-04-18 08:40:13] [config] - 10 [2024-04-18 08:40:13] [config] - 1 [2024-04-18 08:40:13] [config] lr-decay-strategy: epoch+stalled [2024-04-18 08:40:13] [config] lr-report: true [2024-04-18 08:40:13] [config] lr-warmup: 16000 [2024-04-18 08:40:13] [config] lr-warmup-at-reload: false [2024-04-18 08:40:13] [config] lr-warmup-cycle: false [2024-04-18 08:40:13] [config] lr-warmup-start-rate: 0 [2024-04-18 08:40:13] [config] max-length: 100 [2024-04-18 08:40:13] [config] max-length-crop: false [2024-04-18 08:40:13] [config] max-length-factor: 3 [2024-04-18 08:40:13] [config] maxi-batch: 1000 [2024-04-18 08:40:13] [config] maxi-batch-sort: trg [2024-04-18 08:40:13] [config] mini-batch: 64 [2024-04-18 08:40:13] [config] mini-batch-fit: true [2024-04-18 08:40:13] [config] mini-batch-fit-step: 10 [2024-04-18 08:40:13] [config] mini-batch-round-up: true [2024-04-18 08:40:13] [config] mini-batch-track-lr: false [2024-04-18 08:40:13] [config] mini-batch-warmup: 0 [2024-04-18 08:40:13] [config] mini-batch-words: 0 [2024-04-18 08:40:13] [config] mini-batch-words-ref: 0 [2024-04-18 08:40:13] [config] model: /data/training/model/model.npz [2024-04-18 08:40:13] [config] multi-loss-type: sum [2024-04-18 08:40:13] [config] n-best: false [2024-04-18 08:40:13] [config] no-nccl: false [2024-04-18 08:40:13] [config] no-reload: false [2024-04-18 08:40:13] [config] no-restore-corpus: false [2024-04-18 08:40:13] [config] normalize: 0.6 [2024-04-18 08:40:13] [config] normalize-gradient: false [2024-04-18 08:40:13] [config] num-devices: 0 [2024-04-18 08:40:13] [config] optimizer: adam [2024-04-18 08:40:13] [config] optimizer-delay: 1 [2024-04-18 08:40:13] [config] optimizer-params: [2024-04-18 08:40:13] [config] - 0.9 [2024-04-18 08:40:13] [config] - 0.98 [2024-04-18 08:40:13] [config] - 1e-09 [2024-04-18 08:40:13] [config] output-omit-bias: false [2024-04-18 08:40:13] [config] overwrite: false [2024-04-18 08:40:13] [config] precision: [2024-04-18 08:40:13] [config] - float32 [2024-04-18 08:40:13] [config] - float32 [2024-04-18 08:40:13] [config] pretrained-model: "" [2024-04-18 08:40:13] [config] quantize-biases: false [2024-04-18 08:40:13] [config] quantize-bits: 0 [2024-04-18 08:40:13] [config] quantize-log-based: false [2024-04-18 08:40:13] [config] quantize-optimization-steps: 0 [2024-04-18 08:40:13] [config] quiet: false [2024-04-18 08:40:13] [config] quiet-translation: true [2024-04-18 08:40:13] [config] relative-paths: false [2024-04-18 08:40:13] [config] right-left: false [2024-04-18 08:40:13] [config] save-freq: 10 [2024-04-18 08:40:13] [config] seed: 1111 [2024-04-18 08:40:13] [config] sharding: global [2024-04-18 08:40:13] [config] shuffle: data [2024-04-18 08:40:13] [config] shuffle-in-ram: false [2024-04-18 08:40:13] [config] sigterm: save-and-exit [2024-04-18 08:40:13] [config] skip: false [2024-04-18 08:40:13] [config] sqlite: "" [2024-04-18 08:40:13] [config] sqlite-drop: false [2024-04-18 08:40:13] [config] sync-freq: 200u [2024-04-18 08:40:13] [config] sync-sgd: true [2024-04-18 08:40:13] [config] tempdir: marian-tmp [2024-04-18 08:40:13] [config] tied-embeddings: true [2024-04-18 08:40:13] [config] tied-embeddings-all: false [2024-04-18 08:40:13] [config] tied-embeddings-src: false [2024-04-18 08:40:13] [config] train-embedder-rank: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] train-sets: [2024-04-18 08:40:13] [config] - /data/training/data/train.tok.tc.clean.bpe.en [2024-04-18 08:40:13] [config] - /data/training/data/train.tok.tc.factorized.clean.bpe.de [2024-04-18 08:40:13] [config] transformer-aan-activation: swish [2024-04-18 08:40:13] [config] transformer-aan-depth: 2 [2024-04-18 08:40:13] [config] transformer-aan-nogate: false [2024-04-18 08:40:13] [config] transformer-decoder-autoreg: rnn [2024-04-18 08:40:13] [config] transformer-decoder-dim-ffn: 0 [2024-04-18 08:40:13] [config] transformer-decoder-ffn-depth: 0 [2024-04-18 08:40:13] [config] transformer-depth-scaling: false [2024-04-18 08:40:13] [config] transformer-dim-aan: 2048 [2024-04-18 08:40:13] [config] transformer-dim-ffn: 2048 [2024-04-18 08:40:13] [config] transformer-dropout: 0.1 [2024-04-18 08:40:13] [config] transformer-dropout-attention: 0 [2024-04-18 08:40:13] [config] transformer-dropout-ffn: 0 [2024-04-18 08:40:13] [config] transformer-ffn-activation: swish [2024-04-18 08:40:13] [config] transformer-ffn-depth: 2 [2024-04-18 08:40:13] [config] transformer-guided-alignment-layer: last [2024-04-18 08:40:13] [config] transformer-heads: 8 [2024-04-18 08:40:13] [config] transformer-no-projection: false [2024-04-18 08:40:13] [config] transformer-pool: false [2024-04-18 08:40:13] [config] transformer-postprocess: dan [2024-04-18 08:40:13] [config] transformer-postprocess-emb: d [2024-04-18 08:40:13] [config] transformer-postprocess-top: "" [2024-04-18 08:40:13] [config] transformer-preprocess: "" [2024-04-18 08:40:13] [config] transformer-rnn-projection: false [2024-04-18 08:40:13] [config] transformer-tied-layers: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] transformer-train-position-embeddings: false [2024-04-18 08:40:13] [config] tsv: false [2024-04-18 08:40:13] [config] tsv-fields: 0 [2024-04-18 08:40:13] [config] type: transformer [2024-04-18 08:40:13] [config] ulr: false [2024-04-18 08:40:13] [config] ulr-dim-emb: 0 [2024-04-18 08:40:13] [config] ulr-dropout: 0 [2024-04-18 08:40:13] [config] ulr-keys-vectors: "" [2024-04-18 08:40:13] [config] ulr-query-vectors: "" [2024-04-18 08:40:13] [config] ulr-softmax-temperature: 1 [2024-04-18 08:40:13] [config] ulr-trainable-transformation: false [2024-04-18 08:40:13] [config] unlikelihood-loss: false [2024-04-18 08:40:13] [config] valid-freq: 10 [2024-04-18 08:40:13] [config] valid-log: /data/training/valid.log [2024-04-18 08:40:13] [config] valid-max-length: 1000 [2024-04-18 08:40:13] [config] valid-metrics: [2024-04-18 08:40:13] [config] - cross-entropy [2024-04-18 08:40:13] [config] - perplexity [2024-04-18 08:40:13] [config] - bleu [2024-04-18 08:40:13] [config] - translation [2024-04-18 08:40:13] [config] valid-mini-batch: 64 [2024-04-18 08:40:13] [config] valid-reset-all: false [2024-04-18 08:40:13] [config] valid-reset-stalled: false [2024-04-18 08:40:13] [config] valid-script-args: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] valid-script-path: /data/training/validate.sh [2024-04-18 08:40:13] [config] valid-sets: [2024-04-18 08:40:13] [config] - /data/training/data/dev.tok.tc.bpe.en [2024-04-18 08:40:13] [config] - /data/training/data/dev.tok.tc.factorized.bpe.de [2024-04-18 08:40:13] [config] valid-translation-output: "" [2024-04-18 08:40:13] [config] vocabs: [2024-04-18 08:40:13] [config] - /data/training/data/train.tok.tc.clean.bpe.en.yml [2024-04-18 08:40:13] [config] - /data/training/data/train.tok.tc.factorized.clean.bpe.de.fsv [2024-04-18 08:40:13] [config] word-penalty: 0 [2024-04-18 08:40:13] [config] word-scores: false [2024-04-18 08:40:13] [config] workspace: 6000 [2024-04-18 08:40:13] [config] Model is being created with Marian v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800 [2024-04-18 08:40:13] Using synchronous SGD [2024-04-18 08:40:13] [comm] Compiled without MPI support. Running as a single process on 25b1c50316d0 [2024-04-18 08:40:13] Synced seed 1111 [2024-04-18 08:40:13] [data] Loading vocabulary from JSON/Yaml file /data/training/data/train.tok.tc.clean.bpe.en.yml [2024-04-18 08:40:13] [data] Setting vocabulary size for input 0 to 484 [2024-04-18 08:40:13] [vocab] Loading vocab spec file /data/training/data/train.tok.tc.factorized.clean.bpe.de.fsv [2024-04-18 08:40:13] [vocab] Factor group '(lemma)' has 493 members [2024-04-18 08:40:13] [vocab] Factor group '|C' has 4 members [2024-04-18 08:40:13] [vocab] Factored-embedding map read with total/unique of 984/497 factors from 493 example words (in space of 2,470) [2024-04-18 08:40:13] [vocab] Expanding all valid vocab entries out of 2,470... [2024-04-18 08:40:13] [vocab] Completed, total 1966 valid combinations [2024-04-18 08:40:13] [data] Setting vocabulary size for input 1 to 1,966 [2024-04-18 08:40:13] [data] Using word alignments from file data/train.tok.tc.clean.bpe.en.en-de.align [2024-04-18 08:40:13] [batching] Collecting statistics for batch fitting with step size 10 [2024-04-18 08:40:13] [memory] Extending reserved space to 6016 MB (device gpu0) [2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu1) [2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu2) [2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu3) [2024-04-18 08:40:14] [comm] Using NCCL 2.8.3 for GPU communication [2024-04-18 08:40:14] [comm] Using global sharding [2024-04-18 08:40:14] [comm] NCCLCommunicators constructed successfully [2024-04-18 08:40:14] [training] Using 4 GPUs [2024-04-18 08:40:14] [vocab] Reusing existing vocabulary object in memory (vocab size 1966) [2024-04-18 08:40:14] [embedding] Factored embeddings enabled [2024-04-18 08:40:14] [embedding] Factored outputs enabled [2024-04-18 08:40:14] [logits] Applying loss function for 2 factor(s) [2024-04-18 08:40:14] [memory] Reserving 158 MB, device gpu0 [2024-04-18 08:40:14] [gpu] 16-bit TensorCores enabled for float32 matrix operations [2024-04-18 08:40:14] Error: Cublas Error: 7 - /marian/src/tensors/gpu/prod.cpp:698: cublasLtAffineTyped(ltHandle, opB, opA, n, m, k, &alpha, B->data<T>(), ldb, A->data<T>(), lda, &beta, C->data<T>(), ldc, bias->data<T>(), workspace->data<T>(), workspaceSizeBytes, do_relu, stream) [2024-04-18 08:40:14] Error: Aborted from void marian::gpu::affineTyped(marian::Tensor, marian::Ptr<marian::Allocator>, const Tensor&, const Tensor&, const Tensor&, bool, bool, T, T, bool) [with T = float; marian::Tensor = IntrusivePtr<marian::TensorBase>; marian::Ptr<marian::Allocator> = std::shared_ptr<marian::Allocator>] in /marian/src/tensors/gpu/prod.cpp:698 [CALL STACK] [0x564280173ac4] + 0xa54ac4 [0x56428016d4a8] + 0xa4e4a8 [0x56427fbedf07] + 0x4cef07 [0x56427fca3a96] + 0x584a96 [0x56427fb6302b] + 0x44402b [0x56427fe6c21c] + 0x74d21c [0x56427fe534c8] + 0x7344c8 [0x56427f99261a] + 0x27361a [0x56427f8b778b] + 0x19878b [0x7f13eb991d90] + 0x29d90 [0x7f13eb991e40] __libc_start_main + 0x80 [0x56427f8b0b55] + 0x191b55 ./train.sh: line 29: 33 Aborted (core dumped) /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000 marian version (in the docker environment)
root@f52169769fca:/marian# marian --version v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800 nvidia-smi output
host system 1
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | host system 2
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | failing marian 1.12 cuda 12.3 docker container on host 1
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.3 | working marian 1.11 cuda 10.2 docker container on host 1
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | failing marian 1.12 cuda 12.3 docker container on host 2
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | working marian 1.11 cuda 10.2 docker container on host 2
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | I notice the CUDA versions that nvidia-smi outputs seem to be whatever is higher, host system or docker CUDA, but all containers have been build to run the packed cuda.