299 questions with no answers
0
votes
0
answers
28
views
Why does MixCoord keep routing requests to stale QueryNodes after a Kubernetes node reboot in Milvus?
MixCoord keeps routing requests to non-existent QueryNodes after a Kubernetes worker node reboot in Milvus
I’m running a Milvus 2.5.x cluster on Kubernetes, where each worker node hosts a full set of ...
0
votes
0
answers
52
views
Distributed Tensorflow with mulitple GPUS training MNIST with Optuna is stuck when training
I created a 5 GPU Cluster using three nodes/machines locally using the tensorflow.distributed.MultiWorkerMirrored Strategy. One machine has the Apple M1 Pro Metals GPU, the other two nodes has NVIDIA ...
0
votes
0
answers
44
views
How does clusters work in TensorFlow in the parameterServerStrategy?
I don't seem to understand how clusters work in the parameterServerStrategy in TensorFlow, and I need some clarifications.
I have read this tutorial, but they don't mention or explain clearly how to ...
0
votes
0
answers
82
views
Low CIFAR-10 Accuracy (60%) in Decentralized Federated Learning (DFL) - Seeking Improvement
I implemented an algorithm in a Decentralized Federated Learning (DFL) environment. When I experimented with MNIST and Fashion-MNIST, I achieved an accuracy of 80–90%. However, when testing with CIFAR-...
0
votes
0
answers
461
views
Facing issue with connecting to socket with DDP and Pytorch (single node, multi-GPU communication)
I am completely new to distributed programming and I have been trying to port the original code that ran on a multi-node cluster to single-node cluster with multiple GPUs. My goal is to simulate a ...
0
votes
0
answers
38
views
Better way to integrate Kafka with Akka Cluster Sharding
We have Kafka as the bus and Akka Cluster Sharding as the application distributed cluster. So we need to consume data from Kafka and process them in Akka Cluster.
For a now we implement separate ...
0
votes
0
answers
92
views
SLRUM: troch distributed: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED
I am new to pytorch-distributed, and any input will help. I have a code working with a single GPU. I am trying to make it distributed. I am getting a socket connect error. Below is the code ( I am ...
0
votes
0
answers
143
views
PyTorch Distributed and Multiprocessing: Processes execute from the start of the whole script not the place where them are created
Introduction
I'm new in PyTorch distributed and multiprocessing and I met the unexpected problems:
I have leant that processes created by spawn will execute the given function, but my processes ...
1
vote
0
answers
211
views
Distributed SQL Caching in .Net 4.7.2
Has anyone used distributed SQL caching in .Net 4.7.2 ? I have seen many sample code for SQL caching with .Net Core but not with .Net Framework 4.7.2.
We are currently using Redis cache in the ...
0
votes
0
answers
106
views
Using Java RMI and Reflection for a Broker without interface
I'm attempting to develop a broker in Java. Currently, I have created a server capable of posting it's services within the broker. Additionally, I have implemented a client that can invoke methods on ...
0
votes
0
answers
87
views
How to save the JavaScript runtime state
I want to be able to take a snapshot of a running javascript program, save it to a database, wait for some period of time, load back the state and continue the process execution. Is it possible?
0
votes
0
answers
52
views
akka PubSub not working across distributed system
I have a system that is clustered deploy on k8s, it will have multiple instances when it's deployed.
My code sample like below
import akka.actor.typed.pubsub.Topic
import akka.actor.typed.scaladsl....
0
votes
0
answers
390
views
Qdrant: Which shard is at which node? It seems like all shards are on the same node
I installed qdrant helm chart in cluster with 1 node master, 4 node workers and I created a collection with shard_number=2, replication_factor=2.
When I get cluster info with command:
curl http://192....
1
vote
0
answers
407
views
how to debug torch distributed processgroup?
I want to analyse the pytorch distributed backend interface but I don't know how to debug it.QAQ
vscode python,debug+ gbd attach,python C++ debuuger Subprocesses can't be debugged?
I'm wondering if ...
1
vote
0
answers
3k
views
How to use device_map of huggingface on multiple GPUs?
I encountered the following issues while using the device_map provided by Hugging Face for model parallel inference:
I am running the code from the example code provided by Hugging Face, which can be ...