Newest 'distributed' Questions

-3 votes

0 answers

43 views

Avoiding counter update contention under high write throughput [closed]

We maintain multiple counters where each incoming request increments or decrements one or more counters. These counters are bounded by a max value, once that is reached, we reject the requests. ...

tusharRawat

631

asked Dec 26, 2025 at 20:49

0 votes

0 answers

28 views

Why does MixCoord keep routing requests to stale QueryNodes after a Kubernetes node reboot in Milvus?

MixCoord keeps routing requests to non-existent QueryNodes after a Kubernetes worker node reboot in Milvus I’m running a Milvus 2.5.x cluster on Kubernetes, where each worker node hosts a full set of ...

Schiffer Marget

9

asked Dec 23, 2025 at 7:03

0 votes

0 answers

52 views

Distributed Tensorflow with mulitple GPUS training MNIST with Optuna is stuck when training

I created a 5 GPU Cluster using three nodes/machines locally using the tensorflow.distributed.MultiWorkerMirrored Strategy. One machine has the Apple M1 Pro Metals GPU, the other two nodes has NVIDIA ...

vinhdiesal

1

asked Jul 4, 2025 at 19:53

0 votes

0 answers

44 views

How does clusters work in TensorFlow in the parameterServerStrategy?

I don't seem to understand how clusters work in the parameterServerStrategy in TensorFlow, and I need some clarifications. I have read this tutorial, but they don't mention or explain clearly how to ...

ali-saaeddin-1123581321

1

asked Feb 22, 2025 at 13:50

0 votes

0 answers

82 views

Low CIFAR-10 Accuracy (60%) in Decentralized Federated Learning (DFL) - Seeking Improvement

I implemented an algorithm in a Decentralized Federated Learning (DFL) environment. When I experimented with MNIST and Fashion-MNIST, I achieved an accuracy of 80–90%. However, when testing with CIFAR-...

ddochi

37

asked Feb 14, 2025 at 6:15

0 votes

0 answers

461 views

Facing issue with connecting to socket with DDP and Pytorch (single node, multi-GPU communication)

I am completely new to distributed programming and I have been trying to port the original code that ran on a multi-node cluster to single-node cluster with multiple GPUs. My goal is to simulate a ...

soumya_sarkar.19

13

asked Feb 9, 2025 at 6:29

0 votes

0 answers

38 views

Better way to integrate Kafka with Akka Cluster Sharding

We have Kafka as the bus and Akka Cluster Sharding as the application distributed cluster. So we need to consume data from Kafka and process them in Akka Cluster. For a now we implement separate ...

Donz

1,397

asked Jan 24, 2025 at 15:33

0 votes

0 answers

92 views

SLRUM: troch distributed: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED

I am new to pytorch-distributed, and any input will help. I have a code working with a single GPU. I am trying to make it distributed. I am getting a socket connect error. Below is the code ( I am ...

GPS-999

11

asked Jan 16, 2025 at 5:56

0 votes

1 answer

42 views

How to maintain synchronization between distributed python processes?

I have a number of workstations that run long processes containing sequences like this: x = wait_while_current_is_set y = read_voltage z = z + y The workstations must maintain synchronization with a ...

david

2,706

asked Dec 1, 2024 at 6:40

0 votes

0 answers

143 views

PyTorch Distributed and Multiprocessing: Processes execute from the start of the whole script not the place where them are created

Introduction I'm new in PyTorch distributed and multiprocessing and I met the unexpected problems: I have leant that processes created by spawn will execute the given function, but my processes ...

Mecreative

1

asked Oct 2, 2024 at 3:58

2 votes

2 answers

230 views

Client request failure in raft

Imagine a 3 node raft cluster. Each node is in sync has log [1,2,3] and entry 3 is committed by the leader. Now leader receives an entry 4 but fails to commit it because of unreliable network and ...

Dumb_Pegasus

129

asked Sep 15, 2024 at 11:04

1 vote

0 answers

211 views

Distributed SQL Caching in .Net 4.7.2

Has anyone used distributed SQL caching in .Net 4.7.2 ? I have seen many sample code for SQL caching with .Net Core but not with .Net Framework 4.7.2. We are currently using Redis cache in the ...

Monisha

11

asked Sep 12, 2024 at 20:58

-1 votes

1 answer

394 views

SeaweedFS S3 Gateway Stuck Connecting to Incorrect gRPC Port [closed]

I've been setting up SeaweedFS on a cluster of three nodes and encountered issues when configuring the S3 gateway. The S3 gateway tries to connect to the incorrect gRPC port 28888 instead of the ...

quarks

35.7k

asked Jun 9, 2024 at 20:25

0 votes

1 answer

483 views

Insert into local table SELECT from distributed table in clickhouse caused default.local_table at other node not exists error

I need select data from some distributed and local table, insert into another stand alone local table。I use sql like this: INSERT into local_table SELECT FROM distributed_table WHERE ... . The ...

Sam Wang

1

asked Jun 3, 2024 at 12:12

1 vote

2 answers

371 views

Uniswap use SDK to get historical rates (and current rate)

I am trying to use the Uniswap SDK to get historical rates between two coins on a pool. I believe the rate is simply just xy = k, where k is a constant. If someone buys n coins of x, the cost in terms ...

efwefwefwefwefw wefwefwefwef

25

asked May 25, 2024 at 3:43

Collectives™ on Stack Overflow

Avoiding counter update contention under high write throughput [closed]

Why does MixCoord keep routing requests to stale QueryNodes after a Kubernetes node reboot in Milvus?

Distributed Tensorflow with mulitple GPUS training MNIST with Optuna is stuck when training

How does clusters work in TensorFlow in the parameterServerStrategy?

Low CIFAR-10 Accuracy (60%) in Decentralized Federated Learning (DFL) - Seeking Improvement

Facing issue with connecting to socket with DDP and Pytorch (single node, multi-GPU communication)

Better way to integrate Kafka with Akka Cluster Sharding

SLRUM: troch distributed: CUDA call failed lazily at initialization with error: device >= 0 && device < num_gpus INTERNAL ASSERT FAILED

How to maintain synchronization between distributed python processes?

PyTorch Distributed and Multiprocessing: Processes execute from the start of the whole script not the place where them are created

Client request failure in raft

Distributed SQL Caching in .Net 4.7.2

SeaweedFS S3 Gateway Stuck Connecting to Incorrect gRPC Port [closed]

Insert into local table SELECT from distributed table in clickhouse caused default.local_table at other node not exists error

Uniswap use SDK to get historical rates (and current rate)

Hot Network Questions