Unanswered 'distributed' Questions

0 votes

0 answers

28 views

Why does MixCoord keep routing requests to stale QueryNodes after a Kubernetes node reboot in Milvus?

MixCoord keeps routing requests to non-existent QueryNodes after a Kubernetes worker node reboot in Milvus I’m running a Milvus 2.5.x cluster on Kubernetes, where each worker node hosts a full set of ...

Schiffer Marget

9

asked Dec 23, 2025 at 7:03

0 votes

0 answers

52 views

Distributed Tensorflow with mulitple GPUS training MNIST with Optuna is stuck when training

I created a 5 GPU Cluster using three nodes/machines locally using the tensorflow.distributed.MultiWorkerMirrored Strategy. One machine has the Apple M1 Pro Metals GPU, the other two nodes has NVIDIA ...

vinhdiesal

1

asked Jul 4, 2025 at 19:53

0 votes

0 answers

44 views

How does clusters work in TensorFlow in the parameterServerStrategy?

I don't seem to understand how clusters work in the parameterServerStrategy in TensorFlow, and I need some clarifications. I have read this tutorial, but they don't mention or explain clearly how to ...

ali-saaeddin-1123581321

1

asked Feb 22, 2025 at 13:50

0 votes

0 answers

82 views

Low CIFAR-10 Accuracy (60%) in Decentralized Federated Learning (DFL) - Seeking Improvement

I implemented an algorithm in a Decentralized Federated Learning (DFL) environment. When I experimented with MNIST and Fashion-MNIST, I achieved an accuracy of 80–90%. However, when testing with CIFAR-...

Mansi Sakarvadia

1

modified Feb 17, 2025 at 13:56

0 votes

0 answers

461 views

Facing issue with connecting to socket with DDP and Pytorch (single node, multi-GPU communication)

I am completely new to distributed programming and I have been trying to port the original code that ran on a multi-node cluster to single-node cluster with multiple GPUs. My goal is to simulate a ...

soumya_sarkar.19

13

modified Feb 9, 2025 at 6:39

0 votes

0 answers

38 views

Better way to integrate Kafka with Akka Cluster Sharding

We have Kafka as the bus and Akka Cluster Sharding as the application distributed cluster. So we need to consume data from Kafka and process them in Akka Cluster. For a now we implement separate ...

Donz

1,397

asked Jan 24, 2025 at 15:33

0 votes

0 answers

92 views

SLRUM: troch distributed: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED

I am new to pytorch-distributed, and any input will help. I have a code working with a single GPU. I am trying to make it distributed. I am getting a socket connect error. Below is the code ( I am ...

GPS-999

11

asked Jan 16, 2025 at 5:56

0 votes

0 answers

143 views

PyTorch Distributed and Multiprocessing: Processes execute from the start of the whole script not the place where them are created

Introduction I'm new in PyTorch distributed and multiprocessing and I met the unexpected problems: I have leant that processes created by spawn will execute the given function, but my processes ...

Mecreative

1

asked Oct 2, 2024 at 3:58

1 vote

0 answers

211 views

Distributed SQL Caching in .Net 4.7.2

Has anyone used distributed SQL caching in .Net 4.7.2 ? I have seen many sample code for SQL caching with .Net Core but not with .Net Framework 4.7.2. We are currently using Redis cache in the ...

Monisha

11

asked Sep 12, 2024 at 20:58

0 votes

0 answers

106 views

Using Java RMI and Reflection for a Broker without interface

I'm attempting to develop a broker in Java. Currently, I have created a server capable of posting it's services within the broker. Additionally, I have implemented a client that can invoke methods on ...

Andrei Vlasceanu

1

asked Apr 6, 2024 at 14:53

0 votes

0 answers

87 views

How to save the JavaScript runtime state

I want to be able to take a snapshot of a running javascript program, save it to a database, wait for some period of time, load back the state and continue the process execution. Is it possible?

ischenko

97

asked Mar 15, 2024 at 18:57

0 votes

0 answers

52 views

akka PubSub not working across distributed system

I have a system that is clustered deploy on k8s, it will have multiple instances when it's deployed. My code sample like below import akka.actor.typed.pubsub.Topic import akka.actor.typed.scaladsl....

somethingW

1

asked Mar 15, 2024 at 9:23

0 votes

0 answers

390 views

Qdrant: Which shard is at which node? It seems like all shards are on the same node

I installed qdrant helm chart in cluster with 1 node master, 4 node workers and I created a collection with shard_number=2, replication_factor=2. When I get cluster info with command: curl http://192....

Ngô Minh Hải

1

asked Feb 23, 2024 at 10:20

1 vote

0 answers

407 views

how to debug torch distributed processgroup?

I want to analyse the pytorch distributed backend interface but I don't know how to debug it.QAQ vscode python,debug+ gbd attach,python C++ debuuger Subprocesses can't be debugged? I'm wondering if ...

Haitao Xiao

11

asked Dec 6, 2023 at 17:52

1 vote

0 answers

3k views

How to use device_map of huggingface on multiple GPUs?

I encountered the following issues while using the device_map provided by Hugging Face for model parallel inference: I am running the code from the example code provided by Hugging Face, which can be ...

petezurich

10.3k

modified Oct 31, 2023 at 19:34

Collectives™ on Stack Overflow

Why does MixCoord keep routing requests to stale QueryNodes after a Kubernetes node reboot in Milvus?

Distributed Tensorflow with mulitple GPUS training MNIST with Optuna is stuck when training

How does clusters work in TensorFlow in the parameterServerStrategy?

Low CIFAR-10 Accuracy (60%) in Decentralized Federated Learning (DFL) - Seeking Improvement

Facing issue with connecting to socket with DDP and Pytorch (single node, multi-GPU communication)

Better way to integrate Kafka with Akka Cluster Sharding

SLRUM: troch distributed: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED

PyTorch Distributed and Multiprocessing: Processes execute from the start of the whole script not the place where them are created

Distributed SQL Caching in .Net 4.7.2

Using Java RMI and Reflection for a Broker without interface

How to save the JavaScript runtime state

akka PubSub not working across distributed system

Qdrant: Which shard is at which node? It seems like all shards are on the same node

how to debug torch distributed processgroup?

How to use device_map of huggingface on multiple GPUs?

Hot Network Questions