Newest 'distributed-computing' Questions

0 votes

1 answer

38 views

Upsert! Operation Throws "A table can't contain duplicate column names" Error

I have a base table A and a result table B in DolphinDB. Table B was initially empty and is used to store calculated results based on table A. When trying to insert the calculated results into table B,...

RORO

1

asked Oct 24 at 9:52

3 votes

1 answer

135 views

In Apache Ignite the Replication mode and Partition mode does not work all together

I’m working with Apache Ignite 2.17.0. I load database tables into Ignite caches and run SQL queries using the SQLFieldsQuery API. Recently, I modified the cache configuration for some tables to use ...

kushal Baldev

799

asked Jul 29 at 17:31

0 votes

0 answers

62 views

Get two different nodes to access and distribute the same SQL table in Apache spark?

I have the following code to test. I created a table on worker 1. Then I tried to read the table on worker 2 and it got TABLE_OR_VIEW_NOT_FOUND. Worker 2 is in the some computer as Master. I ran the ...

Rick C. Ferreira

1

asked Jun 16 at 19:25

0 votes

0 answers

50 views

How to best partition my data with a 32 core EMR instance and make sure I max out the parallelize feature?

I’m optimizing a PySpark pipeline that processes records with a heavily skewed categorical column (category). The data has: A few high-frequency categories (e.g., 90% of records fall into 2-3 ...

Bilal Jamil

27

asked Apr 30 at 2:51

1 vote

1 answer

114 views

Distributed REST API Calls using SPARK with maintaining consistency

I have a Spark DataFrame created from a Delta table, with one column of type STRUCT(JSON). For each row in this DataFrame, I need to make a REST API call using the JSON payload in the column. ...

uds0128

53

asked Mar 2 at 18:42

0 votes

0 answers

328 views

PyTorch DDP Multi-Node Training: ncclInternalError: Internal check failed. Bootstrap : no socket interface found

I am trying to run a multi-node training job using PyTorch's DistributedDataParallel (DDP) following this guide. However, when I launch the job with torchrun, I encounter the following NCCL error on ...

yunjeong

1

asked Jan 31 at 7:19

1 vote

0 answers

91 views

Segmentation Fault During Validation with MirroredStrategy on Multiple GPUs

I am training a model using TensorFlow 2.18.0 with the tf.distribute.MirroredStrategy across two GPUs. The training works fine on a single GPU, but when I try to run it on two GPUs, it ends with a ...

TGD

56

asked Jan 13 at 7:42

0 votes

0 answers

81 views

Vertex AI Reduction Server returning 500 Internal Error

I am looking to finetune a pre-trained deberta model on Vertex AI with pytorch. I'm attempting to run a distributed job, making use of the Vertex AI reduction server. I'm following this notebook: ...

purpleFudge

1

asked Jan 1 at 14:59

Collectives™ on Stack Overflow

Upsert! Operation Throws "A table can't contain duplicate column names" Error

In Apache Ignite the Replication mode and Partition mode does not work all together

Get two different nodes to access and distribute the same SQL table in Apache spark?

How to best partition my data with a 32 core EMR instance and make sure I max out the parallelize feature?

Distributed REST API Calls using SPARK with maintaining consistency

PyTorch DDP Multi-Node Training: ncclInternalError: Internal check failed. Bootstrap : no socket interface found

Segmentation Fault During Validation with MirroredStrategy on Multiple GPUs

Vertex AI Reduction Server returning 500 Internal Error

Hot Network Questions