Skip to content

Running into Cublas Error: 7 for target factors for marian 1.12 #1023

@LauritzBrandt19116

Description

@LauritzBrandt19116

Bug description

Marian 1.12 (65bf82ffce52f4854295d8b98482534f176d494e) runs into this error for target factored data:

[2024-04-18 08:40:14] Error: Cublas Error: 7 - /marian/src/tensors/gpu/prod.cpp:698: cublasLtAffineTyped(ltHandle, opB, opA, n, m, k, &alpha, B->data<T>(), ldb, A->data<T>(), lda, &beta, C->data<T>(), ldc, bias->data<T>(), workspace->data<T>(), workspaceSizeBytes, do_relu, stream) [2024-04-18 08:40:14] Error: Aborted from void marian::gpu::affineTyped(marian::Tensor, marian::Ptr<marian::Allocator>, const Tensor&, const Tensor&, const Tensor&, bool, bool, T, T, bool) [with T = float; marian::Tensor = IntrusivePtr<marian::TensorBase>; marian::Ptr<marian::Allocator> = std::shared_ptr<marian::Allocator>] in /marian/src/tensors/gpu/prod.cpp:698 

How to reproduce

Run marian 1.12 compiled against CUDA 11+ with target factors.

I am trying to train marian models from scratch using factored data. It succeeds for source factors, but source-and-target factors or target factor trainings fail the CUBLAS check.

I compile 65bf82ffce52f4854295d8b98482534f176d494e in a docker container and have tried this with a set of cuda-, nvidia- and marian-versions on ubuntu 22.04 and 18.04
Variants that were tried:

marian 1.12 | cuda 12.3.1 | nvidia 525.85.12 or 550.54.14 | ubuntu 22.04 -> fails marian 1.12 | cuda 11.8 | nvidia 525.85.12 or 550.54.14 | ubuntu 22.04 -> fails marian 1.11 | cuda 12.2.0 | nvidia 525.85.12 | ubuntu 20.04 -> fails marian 1.11 | cuda 11.8 | nvidia 525.85.12 | ubuntu 20.04 -> fails marian 1.11 | cuda 10.2 | nvidia 525.85.12 or 550.54.14 | ubuntu 18.04 -> works 

Context

Marian output

+ /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000 [2024-04-18 08:40:13] [marian] Marian v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800 [2024-04-18 08:40:13] [marian] Running on 25b1c50316d0 as process 33 with command line: [2024-04-18 08:40:13] [marian] /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000 [2024-04-18 08:40:13] [config] after: 0e [2024-04-18 08:40:13] [config] after-batches: 0 [2024-04-18 08:40:13] [config] after-epochs: 500 [2024-04-18 08:40:13] [config] all-caps-every: 0 [2024-04-18 08:40:13] [config] allow-unk: false [2024-04-18 08:40:13] [config] authors: false [2024-04-18 08:40:13] [config] beam-size: 6 [2024-04-18 08:40:13] [config] bert-class-symbol: "[CLS]" [2024-04-18 08:40:13] [config] bert-mask-symbol: "[MASK]" [2024-04-18 08:40:13] [config] bert-masking-fraction: 0.15 [2024-04-18 08:40:13] [config] bert-sep-symbol: "[SEP]" [2024-04-18 08:40:13] [config] bert-train-type-embeddings: true [2024-04-18 08:40:13] [config] bert-type-vocab-size: 2 [2024-04-18 08:40:13] [config] build-info: "" [2024-04-18 08:40:13] [config] check-gradient-nan: false [2024-04-18 08:40:13] [config] check-nan: false [2024-04-18 08:40:13] [config] cite: false [2024-04-18 08:40:13] [config] clip-norm: 5 [2024-04-18 08:40:13] [config] cost-scaling: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] cost-type: ce-sum [2024-04-18 08:40:13] [config] cpu-threads: 0 [2024-04-18 08:40:13] [config] data-threads: 8 [2024-04-18 08:40:13] [config] data-weighting: "" [2024-04-18 08:40:13] [config] data-weighting-type: sentence [2024-04-18 08:40:13] [config] dec-cell: ssru [2024-04-18 08:40:13] [config] dec-cell-base-depth: 2 [2024-04-18 08:40:13] [config] dec-cell-high-depth: 1 [2024-04-18 08:40:13] [config] dec-depth: 6 [2024-04-18 08:40:13] [config] devices: [2024-04-18 08:40:13] [config] - 0 [2024-04-18 08:40:13] [config] - 1 [2024-04-18 08:40:13] [config] - 2 [2024-04-18 08:40:13] [config] - 3 [2024-04-18 08:40:13] [config] dim-emb: 512 [2024-04-18 08:40:13] [config] dim-rnn: 1024 [2024-04-18 08:40:13] [config] dim-vocabs: [2024-04-18 08:40:13] [config] - 0 [2024-04-18 08:40:13] [config] - 0 [2024-04-18 08:40:13] [config] disp-first: 0 [2024-04-18 08:40:13] [config] disp-freq: 500 [2024-04-18 08:40:13] [config] disp-label-counts: true [2024-04-18 08:40:13] [config] dropout-rnn: 0 [2024-04-18 08:40:13] [config] dropout-src: 0 [2024-04-18 08:40:13] [config] dropout-trg: 0 [2024-04-18 08:40:13] [config] dump-config: "" [2024-04-18 08:40:13] [config] dynamic-gradient-scaling: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] early-stopping: 3 [2024-04-18 08:40:13] [config] early-stopping-on: first [2024-04-18 08:40:13] [config] embedding-fix-src: false [2024-04-18 08:40:13] [config] embedding-fix-trg: false [2024-04-18 08:40:13] [config] embedding-normalization: false [2024-04-18 08:40:13] [config] embedding-vectors: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] enc-cell: gru [2024-04-18 08:40:13] [config] enc-cell-depth: 1 [2024-04-18 08:40:13] [config] enc-depth: 6 [2024-04-18 08:40:13] [config] enc-type: bidirectional [2024-04-18 08:40:13] [config] english-title-case-every: 0 [2024-04-18 08:40:13] [config] exponential-smoothing: 0.0001 [2024-04-18 08:40:13] [config] factor-weight: 1 [2024-04-18 08:40:13] [config] factors-combine: sum [2024-04-18 08:40:13] [config] factors-dim-emb: 0 [2024-04-18 08:40:13] [config] gradient-checkpointing: false [2024-04-18 08:40:13] [config] gradient-norm-average-window: 100 [2024-04-18 08:40:13] [config] guided-alignment: data/train.tok.tc.clean.bpe.en.en-de.align [2024-04-18 08:40:13] [config] guided-alignment-cost: ce [2024-04-18 08:40:13] [config] guided-alignment-weight: 0.1 [2024-04-18 08:40:13] [config] ignore-model-config: false [2024-04-18 08:40:13] [config] input-types: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] interpolate-env-vars: false [2024-04-18 08:40:13] [config] keep-best: true [2024-04-18 08:40:13] [config] label-smoothing: 0.1 [2024-04-18 08:40:13] [config] layer-normalization: false [2024-04-18 08:40:13] [config] learn-rate: 0.0003 [2024-04-18 08:40:13] [config] lemma-dependency: "" [2024-04-18 08:40:13] [config] lemma-dim-emb: 0 [2024-04-18 08:40:13] [config] log: "" [2024-04-18 08:40:13] [config] log-level: info [2024-04-18 08:40:13] [config] log-time-zone: "" [2024-04-18 08:40:13] [config] logical-epoch: [2024-04-18 08:40:13] [config] - 1e [2024-04-18 08:40:13] [config] - 0 [2024-04-18 08:40:13] [config] lr-decay: 0 [2024-04-18 08:40:13] [config] lr-decay-freq: 50000 [2024-04-18 08:40:13] [config] lr-decay-inv-sqrt: [2024-04-18 08:40:13] [config] - 16000 [2024-04-18 08:40:13] [config] lr-decay-repeat-warmup: false [2024-04-18 08:40:13] [config] lr-decay-reset-optimizer: false [2024-04-18 08:40:13] [config] lr-decay-start: [2024-04-18 08:40:13] [config] - 10 [2024-04-18 08:40:13] [config] - 1 [2024-04-18 08:40:13] [config] lr-decay-strategy: epoch+stalled [2024-04-18 08:40:13] [config] lr-report: true [2024-04-18 08:40:13] [config] lr-warmup: 16000 [2024-04-18 08:40:13] [config] lr-warmup-at-reload: false [2024-04-18 08:40:13] [config] lr-warmup-cycle: false [2024-04-18 08:40:13] [config] lr-warmup-start-rate: 0 [2024-04-18 08:40:13] [config] max-length: 100 [2024-04-18 08:40:13] [config] max-length-crop: false [2024-04-18 08:40:13] [config] max-length-factor: 3 [2024-04-18 08:40:13] [config] maxi-batch: 1000 [2024-04-18 08:40:13] [config] maxi-batch-sort: trg [2024-04-18 08:40:13] [config] mini-batch: 64 [2024-04-18 08:40:13] [config] mini-batch-fit: true [2024-04-18 08:40:13] [config] mini-batch-fit-step: 10 [2024-04-18 08:40:13] [config] mini-batch-round-up: true [2024-04-18 08:40:13] [config] mini-batch-track-lr: false [2024-04-18 08:40:13] [config] mini-batch-warmup: 0 [2024-04-18 08:40:13] [config] mini-batch-words: 0 [2024-04-18 08:40:13] [config] mini-batch-words-ref: 0 [2024-04-18 08:40:13] [config] model: /data/training/model/model.npz [2024-04-18 08:40:13] [config] multi-loss-type: sum [2024-04-18 08:40:13] [config] n-best: false [2024-04-18 08:40:13] [config] no-nccl: false [2024-04-18 08:40:13] [config] no-reload: false [2024-04-18 08:40:13] [config] no-restore-corpus: false [2024-04-18 08:40:13] [config] normalize: 0.6 [2024-04-18 08:40:13] [config] normalize-gradient: false [2024-04-18 08:40:13] [config] num-devices: 0 [2024-04-18 08:40:13] [config] optimizer: adam [2024-04-18 08:40:13] [config] optimizer-delay: 1 [2024-04-18 08:40:13] [config] optimizer-params: [2024-04-18 08:40:13] [config] - 0.9 [2024-04-18 08:40:13] [config] - 0.98 [2024-04-18 08:40:13] [config] - 1e-09 [2024-04-18 08:40:13] [config] output-omit-bias: false [2024-04-18 08:40:13] [config] overwrite: false [2024-04-18 08:40:13] [config] precision: [2024-04-18 08:40:13] [config] - float32 [2024-04-18 08:40:13] [config] - float32 [2024-04-18 08:40:13] [config] pretrained-model: "" [2024-04-18 08:40:13] [config] quantize-biases: false [2024-04-18 08:40:13] [config] quantize-bits: 0 [2024-04-18 08:40:13] [config] quantize-log-based: false [2024-04-18 08:40:13] [config] quantize-optimization-steps: 0 [2024-04-18 08:40:13] [config] quiet: false [2024-04-18 08:40:13] [config] quiet-translation: true [2024-04-18 08:40:13] [config] relative-paths: false [2024-04-18 08:40:13] [config] right-left: false [2024-04-18 08:40:13] [config] save-freq: 10 [2024-04-18 08:40:13] [config] seed: 1111 [2024-04-18 08:40:13] [config] sharding: global [2024-04-18 08:40:13] [config] shuffle: data [2024-04-18 08:40:13] [config] shuffle-in-ram: false [2024-04-18 08:40:13] [config] sigterm: save-and-exit [2024-04-18 08:40:13] [config] skip: false [2024-04-18 08:40:13] [config] sqlite: "" [2024-04-18 08:40:13] [config] sqlite-drop: false [2024-04-18 08:40:13] [config] sync-freq: 200u [2024-04-18 08:40:13] [config] sync-sgd: true [2024-04-18 08:40:13] [config] tempdir: marian-tmp [2024-04-18 08:40:13] [config] tied-embeddings: true [2024-04-18 08:40:13] [config] tied-embeddings-all: false [2024-04-18 08:40:13] [config] tied-embeddings-src: false [2024-04-18 08:40:13] [config] train-embedder-rank: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] train-sets: [2024-04-18 08:40:13] [config] - /data/training/data/train.tok.tc.clean.bpe.en [2024-04-18 08:40:13] [config] - /data/training/data/train.tok.tc.factorized.clean.bpe.de [2024-04-18 08:40:13] [config] transformer-aan-activation: swish [2024-04-18 08:40:13] [config] transformer-aan-depth: 2 [2024-04-18 08:40:13] [config] transformer-aan-nogate: false [2024-04-18 08:40:13] [config] transformer-decoder-autoreg: rnn [2024-04-18 08:40:13] [config] transformer-decoder-dim-ffn: 0 [2024-04-18 08:40:13] [config] transformer-decoder-ffn-depth: 0 [2024-04-18 08:40:13] [config] transformer-depth-scaling: false [2024-04-18 08:40:13] [config] transformer-dim-aan: 2048 [2024-04-18 08:40:13] [config] transformer-dim-ffn: 2048 [2024-04-18 08:40:13] [config] transformer-dropout: 0.1 [2024-04-18 08:40:13] [config] transformer-dropout-attention: 0 [2024-04-18 08:40:13] [config] transformer-dropout-ffn: 0 [2024-04-18 08:40:13] [config] transformer-ffn-activation: swish [2024-04-18 08:40:13] [config] transformer-ffn-depth: 2 [2024-04-18 08:40:13] [config] transformer-guided-alignment-layer: last [2024-04-18 08:40:13] [config] transformer-heads: 8 [2024-04-18 08:40:13] [config] transformer-no-projection: false [2024-04-18 08:40:13] [config] transformer-pool: false [2024-04-18 08:40:13] [config] transformer-postprocess: dan [2024-04-18 08:40:13] [config] transformer-postprocess-emb: d [2024-04-18 08:40:13] [config] transformer-postprocess-top: "" [2024-04-18 08:40:13] [config] transformer-preprocess: "" [2024-04-18 08:40:13] [config] transformer-rnn-projection: false [2024-04-18 08:40:13] [config] transformer-tied-layers: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] transformer-train-position-embeddings: false [2024-04-18 08:40:13] [config] tsv: false [2024-04-18 08:40:13] [config] tsv-fields: 0 [2024-04-18 08:40:13] [config] type: transformer [2024-04-18 08:40:13] [config] ulr: false [2024-04-18 08:40:13] [config] ulr-dim-emb: 0 [2024-04-18 08:40:13] [config] ulr-dropout: 0 [2024-04-18 08:40:13] [config] ulr-keys-vectors: "" [2024-04-18 08:40:13] [config] ulr-query-vectors: "" [2024-04-18 08:40:13] [config] ulr-softmax-temperature: 1 [2024-04-18 08:40:13] [config] ulr-trainable-transformation: false [2024-04-18 08:40:13] [config] unlikelihood-loss: false [2024-04-18 08:40:13] [config] valid-freq: 10 [2024-04-18 08:40:13] [config] valid-log: /data/training/valid.log [2024-04-18 08:40:13] [config] valid-max-length: 1000 [2024-04-18 08:40:13] [config] valid-metrics: [2024-04-18 08:40:13] [config] - cross-entropy [2024-04-18 08:40:13] [config] - perplexity [2024-04-18 08:40:13] [config] - bleu [2024-04-18 08:40:13] [config] - translation [2024-04-18 08:40:13] [config] valid-mini-batch: 64 [2024-04-18 08:40:13] [config] valid-reset-all: false [2024-04-18 08:40:13] [config] valid-reset-stalled: false [2024-04-18 08:40:13] [config] valid-script-args: [2024-04-18 08:40:13] [config] [] [2024-04-18 08:40:13] [config] valid-script-path: /data/training/validate.sh [2024-04-18 08:40:13] [config] valid-sets: [2024-04-18 08:40:13] [config] - /data/training/data/dev.tok.tc.bpe.en [2024-04-18 08:40:13] [config] - /data/training/data/dev.tok.tc.factorized.bpe.de [2024-04-18 08:40:13] [config] valid-translation-output: "" [2024-04-18 08:40:13] [config] vocabs: [2024-04-18 08:40:13] [config] - /data/training/data/train.tok.tc.clean.bpe.en.yml [2024-04-18 08:40:13] [config] - /data/training/data/train.tok.tc.factorized.clean.bpe.de.fsv [2024-04-18 08:40:13] [config] word-penalty: 0 [2024-04-18 08:40:13] [config] word-scores: false [2024-04-18 08:40:13] [config] workspace: 6000 [2024-04-18 08:40:13] [config] Model is being created with Marian v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800 [2024-04-18 08:40:13] Using synchronous SGD [2024-04-18 08:40:13] [comm] Compiled without MPI support. Running as a single process on 25b1c50316d0 [2024-04-18 08:40:13] Synced seed 1111 [2024-04-18 08:40:13] [data] Loading vocabulary from JSON/Yaml file /data/training/data/train.tok.tc.clean.bpe.en.yml [2024-04-18 08:40:13] [data] Setting vocabulary size for input 0 to 484 [2024-04-18 08:40:13] [vocab] Loading vocab spec file /data/training/data/train.tok.tc.factorized.clean.bpe.de.fsv [2024-04-18 08:40:13] [vocab] Factor group '(lemma)' has 493 members [2024-04-18 08:40:13] [vocab] Factor group '|C' has 4 members [2024-04-18 08:40:13] [vocab] Factored-embedding map read with total/unique of 984/497 factors from 493 example words (in space of 2,470) [2024-04-18 08:40:13] [vocab] Expanding all valid vocab entries out of 2,470... [2024-04-18 08:40:13] [vocab] Completed, total 1966 valid combinations [2024-04-18 08:40:13] [data] Setting vocabulary size for input 1 to 1,966 [2024-04-18 08:40:13] [data] Using word alignments from file data/train.tok.tc.clean.bpe.en.en-de.align [2024-04-18 08:40:13] [batching] Collecting statistics for batch fitting with step size 10 [2024-04-18 08:40:13] [memory] Extending reserved space to 6016 MB (device gpu0) [2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu1) [2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu2) [2024-04-18 08:40:14] [memory] Extending reserved space to 6016 MB (device gpu3) [2024-04-18 08:40:14] [comm] Using NCCL 2.8.3 for GPU communication [2024-04-18 08:40:14] [comm] Using global sharding [2024-04-18 08:40:14] [comm] NCCLCommunicators constructed successfully [2024-04-18 08:40:14] [training] Using 4 GPUs [2024-04-18 08:40:14] [vocab] Reusing existing vocabulary object in memory (vocab size 1966) [2024-04-18 08:40:14] [embedding] Factored embeddings enabled [2024-04-18 08:40:14] [embedding] Factored outputs enabled [2024-04-18 08:40:14] [logits] Applying loss function for 2 factor(s) [2024-04-18 08:40:14] [memory] Reserving 158 MB, device gpu0 [2024-04-18 08:40:14] [gpu] 16-bit TensorCores enabled for float32 matrix operations [2024-04-18 08:40:14] Error: Cublas Error: 7 - /marian/src/tensors/gpu/prod.cpp:698: cublasLtAffineTyped(ltHandle, opB, opA, n, m, k, &alpha, B->data<T>(), ldb, A->data<T>(), lda, &beta, C->data<T>(), ldc, bias->data<T>(), workspace->data<T>(), workspaceSizeBytes, do_relu, stream) [2024-04-18 08:40:14] Error: Aborted from void marian::gpu::affineTyped(marian::Tensor, marian::Ptr<marian::Allocator>, const Tensor&, const Tensor&, const Tensor&, bool, bool, T, T, bool) [with T = float; marian::Tensor = IntrusivePtr<marian::TensorBase>; marian::Ptr<marian::Allocator> = std::shared_ptr<marian::Allocator>] in /marian/src/tensors/gpu/prod.cpp:698 [CALL STACK] [0x564280173ac4] + 0xa54ac4 [0x56428016d4a8] + 0xa4e4a8 [0x56427fbedf07] + 0x4cef07 [0x56427fca3a96] + 0x584a96 [0x56427fb6302b] + 0x44402b [0x56427fe6c21c] + 0x74d21c [0x56427fe534c8] + 0x7344c8 [0x56427f99261a] + 0x27361a [0x56427f8b778b] + 0x19878b [0x7f13eb991d90] + 0x29d90 [0x7f13eb991e40] __libc_start_main + 0x80 [0x56427f8b0b55] + 0x191b55 ./train.sh: line 29: 33 Aborted (core dumped) /marian/marian --tempdir marian-tmp -c config.yml --devices 0 1 2 3 --type transformer --valid-freq 10 --save-freq 10 --early-stopping 3 --after-epochs 500 -w 6000 

marian version (in the docker environment)

root@f52169769fca:/marian# marian --version v1.12.0 65bf82ff 2023-02-21 09:56:29 -0800 

nvidia-smi output

host system 1

| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | 

host system 2

| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | 

failing marian 1.12 cuda 12.3 docker container on host 1

| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.3 | 

working marian 1.11 cuda 10.2 docker container on host 1

| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | 

failing marian 1.12 cuda 12.3 docker container on host 2

| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | 

working marian 1.11 cuda 10.2 docker container on host 2

| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 | 

I notice the CUDA versions that nvidia-smi outputs seem to be whatever is higher, host system or docker CUDA, but all containers have been build to run the packed cuda.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions