Skip to content

Conversation

@AwesomePatrol
Copy link
Contributor

Part of #12460

SpanProcessor filters Spans to mimic behavior of LogIfLong and passes them to SpanExporter.

SpanExporter writes Spans to zap.Logger as Info messages while trying to keep the same format and contents as the original bespoke solution.

Currently, the only difference are "child traces" which are not yet instrumented in Etcd's OpenTelemetry tracing. This can be later mitigated by adding "links" or copying events between spans. I also have other ideas...

It can replace bespoke tracing solution while keeping the current
behavior and log format.

Signed-off-by: Aleksander Mistewicz <amistewicz@google.com>
@codecov
Copy link

codecov bot commented Sep 26, 2025

Codecov Report

❌ Patch coverage is 67.69231% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.14%. Comparing base (d3f136a) to head (734a4c4).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
pkg/traceutil/processor.go 28.57% 15 Missing ⚠️
pkg/traceutil/exporter.go 86.36% 6 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
pkg/traceutil/exporter.go 86.36% <86.36%> (ø)
pkg/traceutil/processor.go 28.57% <28.57%> (ø)

... and 32 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20728      +/-   ##
==========================================
- Coverage   69.16%   69.14%   -0.02%     
==========================================
  Files         420      424       +4     
  Lines       34817    34882      +65     
==========================================
+ Hits        24081    24120      +39     
- Misses       9338     9361      +23     
- Partials     1398     1401       +3     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3f136a...734a4c4. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
@AwesomePatrol
Copy link
Contributor Author

This is the first step:

  • Add alternative implementation of tracing to logs
  • Start using the new implementation
  • Remove the old implementation and all references (rely on OpenTelemetry instead)

Going with k8s-ci-robot suggestion for reviewer...

/assign @ivanvc

@serathius
Copy link
Member

serathius commented Sep 30, 2025

Please provide dumps of traces for before and after you change.

@AwesomePatrol
Copy link
Contributor Author

AwesomePatrol commented Sep 30, 2025

Before change:

{"level":"info","ts":"2025-09-30T09:43:19.162953+0200","caller":"traceutil/trace.go:195","msg":"trace[696549722] put","detail":"{key:/araara; req_size:14; response_revision:13; }","duration":"6.294819ms","start":"2025-09-30T09:43:19.156646+0200","end":"2025-09-30T09:43:19.162941+0200","steps":["trace[696549722] 'process raft request'  (duration: 6.239834ms)"],"step_count":1}
{"level":"info","ts":"2025-09-30T09:42:46.210334+0200","caller":"traceutil/trace.go:195","msg":"trace[814383223] put","detail":"{key:/araara; req_size:14; response_revision:11; }","duration":"1.951959ms","start":"2025-09-30T09:42:46.208352+0200","end":"2025-09-30T09:42:46.210304+0200","steps":["trace[814383223] 'process raft request'  (duration: 1.834066ms)"],"step_count":1}

(formatted)

{
  "level": "info",
  "ts": "2025-09-30T09:42:46.210334+0200",
  "caller": "traceutil/trace.go:195",
  "msg": "trace[814383223] put",
  "detail": "{key:/araara; req_size:14; response_revision:11; }",
  "duration": "1.951959ms",
  "start": "2025-09-30T09:42:46.208352+0200",
  "end": "2025-09-30T09:42:46.210304+0200",
  "steps": [
    "trace[814383223] process raft request  (duration: 1.834066ms)"
  ],
  "step_count": 1
}
{
  "level": "info",
  "ts": "2025-09-30T09:43:19.162953+0200",
  "caller": "traceutil/trace.go:195",
  "msg": "trace[696549722] put",
  "detail": "{key:/araara; req_size:14; response_revision:13; }",
  "duration": "6.294819ms",
  "start": "2025-09-30T09:43:19.156646+0200",
  "end": "2025-09-30T09:43:19.162941+0200",
  "steps": [
    "trace[696549722] process raft request  (duration: 6.239834ms)"
  ],
  "step_count": 1
}

After change:

{"level":"info","ts":"2025-09-30T09:38:47.490681+0200","caller":"traceutil/exporter.go:68","msg":"trace[6411157f0516b202] delete_range","detail":"{range_begin:/araara; range_end:; }","duration":"1.962719ms","start":"2025-09-30T09:38:47.488702+0200","end":"2025-09-30T09:38:47.490665+0200","steps":["Send raft proposal  [+0ms]","Receive raft result  [+1ms]"],"step_count":2}
{"level":"info","ts":"2025-09-30T09:39:02.162545+0200","caller":"traceutil/exporter.go:68","msg":"trace[29e5c03c5e0c821c] delete_range","detail":"{range_begin:/araara; range_end:; }","duration":"3.755637ms","start":"2025-09-30T09:39:02.158777+0200","end":"2025-09-30T09:39:02.162533+0200","steps":["Send raft proposal  [+0ms]","Receive raft result  [+3ms]"],"step_count":2}

(formatted)

{
  "level": "info",
  "ts": "2025-09-30T09:38:47.490681+0200",
  "caller": "traceutil/exporter.go:68",
  "msg": "trace[6411157f0516b202] delete_range",
  "detail": "{range_begin:/araara; range_end:; }",
  "duration": "1.962719ms",
  "start": "2025-09-30T09:38:47.488702+0200",
  "end": "2025-09-30T09:38:47.490665+0200",
  "steps": [
    "Send raft proposal  [+0ms]",
    "Receive raft result  [+1ms]"
  ],
  "step_count": 2
}
{
  "level": "info",
  "ts": "2025-09-30T09:39:02.162545+0200",
  "caller": "traceutil/exporter.go:68",
  "msg": "trace[29e5c03c5e0c821c] delete_range",
  "detail": "{range_begin:/araara; range_end:; }",
  "duration": "3.755637ms",
  "start": "2025-09-30T09:39:02.158777+0200",
  "end": "2025-09-30T09:39:02.162533+0200",
  "steps": [
    "Send raft proposal  [+0ms]",
    "Receive raft result  [+3ms]"
  ],
  "step_count": 2
}
duration := end.Sub(start)
events := s.Events()
steps := make([]string, 0, len(events))
slices.SortFunc(events, func(a, b trace.Event) int {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work with nesting?

Copy link
Contributor Author

@AwesomePatrol AwesomePatrol Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From PR description:

Currently, the only difference are "child traces" which are not yet instrumented in Etcd's OpenTelemetry tracing. This can be later mitigated by adding "links" or copying events between spans. I also have other ideas...

so either printing spanID so it will be easy to search in logs/Otel collector or returning the span to use for copying like in #20307

Please note that "applier span" is the only case when one trace is part of another

})
for _, event := range events {
step := fmt.Sprintf("%s %s [+%dms]",
event.Name, writeAttrs(event.Attributes), event.Time.Sub(start).Milliseconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we match the resolution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to %.3fms

Comment on lines +83 to +85
if len(attrs) == 0 {
return ""
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(attrs) == 0 {
return ""
}
if len(attrs) == 0 {
return "{}"
}
}
}

func writeAttrs(attrs []attribute.KeyValue) string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why implement own json serializer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to keep it as close to the previous implementation as possible: https://github.com/etcd-io/etcd/blob/v3.6.5/pkg/traceutil/trace.go#L44-L55

Signed-off-by: Aleksander Mistewicz <amistewicz@google.com>
@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AwesomePatrol
Once this PR has been reviewed and has the lgtm label, please ask for approval from ivanvc. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

@AwesomePatrol: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-coverage-report 734a4c4 link true /test pull-etcd-coverage-report

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment