stats/opentelemetry: record retry attempts from clientStream #8342

vinothkumarr227 · 2025-05-19T09:14:38Z

RELEASE NOTES:

stats/opentelemetry: Retry attempts (grpc.previous-rpc-attempts) are now recorded as span attributes for non-transparent client retries.

codecov · 2025-05-19T09:19:34Z

Codecov Report

❌ Patch coverage is 52.63158% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.50%. Comparing base (e60a04b) to head (c0e523c).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
stats/opentelemetry/client_tracing.go	52.63%	6 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8342      +/-   ##
==========================================
- Coverage   81.64%   81.50%   -0.15%     
==========================================
  Files         413      413              
  Lines       40621    40693      +72     
==========================================
- Hits        33167    33166       -1     
- Misses       5991     6000       +9     
- Partials     1463     1527      +64

Files with missing lines	Coverage Δ
stats/opentelemetry/opentelemetry.go	`50.94% <ø> (-26.97%)`	⬇️
stats/opentelemetry/trace.go	`41.07% <ø> (-47.82%)`	⬇️
stats/opentelemetry/client_tracing.go	`55.31% <52.63%> (-32.19%)`	⬇️

... and 24 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

stream.go

stats/opentelemetry/trace.go

stats/opentelemetry/client_tracing.go

purnesh42H

I remember we had separate tests for retries. This change should only affect that test. This change shouldn't affect tests which are doing only single attempt.

stats/opentelemetry/client_tracing.go

stats/opentelemetry/trace.go

stats/opentelemetry/e2e_test.go

purnesh42H

@vinothkumarr227 have you tested this to ensure its working correctly? I was under impression that TestTraceSpan_WithRetriesAndNameResolutionDelay will need changes for expected values. I think its not working as expected because you are not setting the count back to ctx after incrementing.

stats/opentelemetry/client_tracing.go

stats/opentelemetry/opentelemetry.go

stats/opentelemetry/trace.go

stats/opentelemetry/opentelemetry.go

stats/opentelemetry/e2e_test.go

stats/opentelemetry/client_metrics.go

stats/opentelemetry/e2e_test.go

stats/opentelemetry/opentelemetry.go

stats/opentelemetry/trace.go

stats/opentelemetry/client_tracing.go

stats/opentelemetry/e2e_test.go

dfawley · 2025-08-12T22:13:25Z

stats/opentelemetry/client_tracing.go

+
+	// Client-specific Begin attributes.
+	var previousRPCAttempts int64
+	if ri.ai.previousRPCAttempts != nil {


Is it possible for this to be false?

And is there any reason this is a pointer instead of just a value? Then we wouldn't need nil checks, or to construct it explicitly.

It’s not possible for it to be nil — we always initialize it in getOrCreateCallInfo with ci.previousRPCAttempts = new(atomic.Uint32). I’ve removed the nil check as well. The pointer is used so that updates in ai are automatically reflected in ci.

Oh this is in the attemptInfo. Why are we keeping it here instead of just loading it out of the callInfo?

Also I realized that with hedging, the previous attempts accounting is racy:

Attempt 1 starts

Attempt 2 starts

Attempt 3 starts

All attempts do Begin simultaneously

It's possible for any of them to have any value now. You need to instead increment and read together:

previousAttempts := previousRPCAttempts.Add(1) - 1 // Add returns the new value; we need the old value

Then they will all see unique values.

(We don't implement hedging yet, but we need to keep in mind that we will one day.)

previousRPCAttempts represents the number of retries before the current attempt. We should record the existing retry count in attemptInfo before incrementing it. After that, we increment the retry count.

Also I realized that with hedging, the previous attempts accounting is racy:

Attempt 1 starts

Attempt 2 starts

Attempt 3 starts

All attempts do Begin simultaneously

It's possible for any of them to have any value now. You need to instead increment and read together:

previousAttempts := previousRPCAttempts.Add(1) - 1 // Add returns the new value; we need the old value

Then they will all see unique values.

(We don't implement hedging yet, but we need to keep in mind that we will one day.)
Sure, I’ll keep that in mind; we’ll do it in a future update.

previousRPCAttempts represents the number of retries before the current attempt. We should record the existing retry count in attemptInfo before incrementing it. After that, we increment the retry count.

No, we need to atomically read and increment it, as I explained.

Hi Doug, I’m a bit confused. Initially, I had the increment here — https://github.com/grpc/grpc-go/blob/master/stats/opentelemetry/client_metrics.go#L78 — but after the review comments, I’m moved to trace. Could you clarify what exactly you’d like to change here?

If we do the load and add separately, there is a race if multiple attempts are happening at once, with hedging. So we need to do the add and use the value it returns. It returns the incremented value, so we need to subtract 1 from it to get the original value. This way if multiple attempts are happening at the same time, then each one will get a unique value.

I think we should also see about removing the previousRPCAttempts pointer from the attemptInfo, and instead read it out of the callInfo, which is stored in the context too, right?

Thanks for the suggestion! I'll update the code accordingly. I'll also double-check how it's stored in the context.

If we do the load and add separately, there is a race if multiple attempts are happening at once, with hedging. So we need to do the add and use the value it returns. It returns the incremented value, so we need to subtract 1 from it to get the original value. This way if multiple attempts are happening at the same time, then each one will get a unique value.

I think we should also see about removing the previousRPCAttempts pointer from the attemptInfo, and instead read it out of the callInfo, which is stored in the context too, right?

Thanks for the feedback! I've updated the code.

dfawley · 2025-08-12T22:14:03Z

stats/opentelemetry/client_tracing.go

+	// Client-specific Begin attributes.
+	var previousRPCAttempts int64
+	if ri.ai.previousRPCAttempts != nil {
+		previousRPCAttempts = int64(ri.ai.previousRPCAttempts.Load())


Why are we loading this here but only using it inside the if below? Why not load only when needed?

No need for the condition — I’ve removed the nil check as well.

dfawley · 2025-08-12T22:14:37Z

stats/opentelemetry/client_tracing.go

+			attribute.Bool("Client", begin.Client),
+			attribute.Bool("FailFast", begin.FailFast),


Let's remove these now since they are not part of the spec

dfawley

Sorry for the delays in review here. LGTM after this one small change.

dfawley · 2025-09-05T20:50:45Z

stats/opentelemetry/client_metrics.go

-			method: determineMethod(method, opts...),
+			target:              cc.CanonicalTarget(),
+			method:              determineMethod(method, opts...),
+			previousRPCAttempts: new(atomic.Uint32),


Let's make this a non-pointer type and then we don't need the new here or the chance to get nil panics.

…s-fix-8299

dfawley · 2025-09-09T20:04:08Z

stats/opentelemetry/client_metrics.go

 			target:              cc.CanonicalTarget(),
 			method:              determineMethod(method, opts...),
-			previousRPCAttempts: new(atomic.Uint32),
+			previousRPCAttempts: atomic.Uint32{},


Please delete this line. This is the zero value so it doesn't need explicit initialization.

…rpc#8342)" This reverts commit c122250.

…8342)" (#8571) This introduced flakiness in a test - Test/TraceSpan_WithRetriesAndNameResolutionDelay Failure: https://github.com/grpc/grpc-go/actions/runs/17614152882/job/50042942932?pr=8547 Related issue: #8299 RELEASE NOTES: None

Fixed retry attempts in HandleRPC

3ba457e

purnesh42H reviewed May 19, 2025

View reviewed changes

stream.go Outdated Show resolved Hide resolved

purnesh42H assigned vinothkumarr227 May 19, 2025

purnesh42H added Type: Bug Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability labels May 19, 2025

purnesh42H added this to the 1.73 Release milestone May 19, 2025

vinothkumarr227 added 2 commits May 20, 2025 06:55

Fixed the review changes

5d19779

Fixed vet issues

4245950

vinothkumarr227 requested a review from purnesh42H May 20, 2025 07:09

eshitachandwani assigned purnesh42H and unassigned vinothkumarr227 May 20, 2025

purnesh42H reviewed May 20, 2025

View reviewed changes

stats/opentelemetry/trace.go Outdated Show resolved Hide resolved

stats/opentelemetry/client_tracing.go Outdated Show resolved Hide resolved

stats/opentelemetry/client_tracing.go Outdated Show resolved Hide resolved

purnesh42H assigned vinothkumarr227 and unassigned purnesh42H May 20, 2025

Fixed the review changes

5347db1

vinothkumarr227 requested a review from purnesh42H May 21, 2025 06:08

purnesh42H reviewed May 21, 2025

View reviewed changes

stats/opentelemetry/client_tracing.go Outdated Show resolved Hide resolved

stats/opentelemetry/trace.go Outdated Show resolved Hide resolved

stats/opentelemetry/trace.go Outdated Show resolved Hide resolved

stats/opentelemetry/e2e_test.go Outdated Show resolved Hide resolved

Fixed the review changes

99e88d8

vinothkumarr227 requested a review from purnesh42H May 22, 2025 07:51

purnesh42H reviewed May 26, 2025

View reviewed changes

vinothkumarr227 added 2 commits May 26, 2025 12:58

Fixed the review changes

586cf63

Fixed the test cases

1bdad7e

eshitachandwani assigned purnesh42H and unassigned vinothkumarr227 May 27, 2025

eshitachandwani requested a review from purnesh42H May 27, 2025 09:37

purnesh42H reviewed May 28, 2025

View reviewed changes

purnesh42H assigned vinothkumarr227 and unassigned purnesh42H May 28, 2025

Fixed the review changes

39c5f0d

vinothkumarr227 requested a review from dfawley August 6, 2025 12:31

dfawley removed their assignment Aug 6, 2025

Pranjali-2501 assigned vinothkumarr227 Aug 7, 2025

Fixed the review changes

9f59e51

vinothkumarr227 removed their assignment Aug 11, 2025

eshitachandwani assigned dfawley Aug 12, 2025

dfawley reviewed Aug 12, 2025

View reviewed changes

dfawley assigned vinothkumarr227 and unassigned dfawley Aug 12, 2025

Fixed the review changes

1660fcb

vinothkumarr227 requested a review from dfawley August 13, 2025 11:01

vinothkumarr227 removed their assignment Aug 13, 2025

Fixed the review changes

992343e

eshitachandwani assigned dfawley Aug 14, 2025

dfawley approved these changes Sep 5, 2025

View reviewed changes

dfawley assigned vinothkumarr227 and unassigned dfawley Sep 5, 2025

vinothkumarr227 added 3 commits September 8, 2025 04:45

small tweaks

25f2e3d

Merge remote-tracking branch 'origin/master' into stats-retry-attempt…

e225af4

…s-fix-8299

Merge remote-tracking branch 'origin/master' into stats-retry-attempt…

f32e117

…s-fix-8299

vinothkumarr227 removed their assignment Sep 9, 2025

arjan-bal assigned dfawley Sep 9, 2025

dfawley approved these changes Sep 9, 2025

View reviewed changes

dfawley assigned vinothkumarr227 and unassigned dfawley Sep 9, 2025

small tweaks

c0e523c

vinothkumarr227 removed their assignment Sep 10, 2025

eshitachandwani merged commit c122250 into grpc:master Sep 10, 2025
15 checks passed

eshitachandwani added a commit to eshitachandwani/grpc-go that referenced this pull request Sep 10, 2025

Revert "stats/opentelemetry: record retry attempts from clientStream (g…

66b7615

…rpc#8342)" This reverts commit c122250.

		attribute.Bool("Client", begin.Client),
		attribute.Bool("FailFast", begin.FailFast),

stats/opentelemetry: record retry attempts from clientStream #8342

stats/opentelemetry: record retry attempts from clientStream #8342

Uh oh!

Conversation

vinothkumarr227 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

codecov bot commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

purnesh42H left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

purnesh42H left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothkumarr227 Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfawley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Uh oh!

Labels

6 participants

vinothkumarr227 commented May 19, 2025 •

edited

Loading

codecov bot commented May 19, 2025 •

edited

Loading

vinothkumarr227 Aug 13, 2025 •

edited

Loading