Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 1 | # GPU Synchronization in Chrome |
| 2 | |
| 3 | Chrome supports multiple mechanisms for sequencing GPU drawing operations, this |
| 4 | document provides a brief overview. The main focus is a high-level explanation |
| 5 | of when synchronization is needed and which mechanism is appropriate. |
| 6 | |
| 7 | [TOC] |
| 8 | |
| 9 | ## Glossary |
| 10 | |
| 11 | **GL Sync Object**: Generic GL-level synchronization object that can be in a |
| 12 | "unsignaled" or "signaled" state. The only current implementation of this is a |
| 13 | GL fence. |
| 14 | |
| 15 | **GL Fence**: A GL sync object that is inserted into the GL command stream. It |
| 16 | starts out unsignaled and becomes signaled when the GPU reaches this point in the |
| 17 | command stream, implying that all previous commands have completed. |
| 18 | |
| 19 | **Client Wait**: Block the client thread until a sync object becomes signaled, |
| 20 | or until a timeout occurs. |
| 21 | |
| 22 | **Server Wait**: Tells the GPU to defer executing commands issued after a fence |
| 23 | until the fence signals. The client thread continues executing immediately and |
| 24 | can continue submitting GL commands. |
| 25 | |
| 26 | **CHROMIUM fence sync**: A command buffer specific GL fence that sequences |
| 27 | operations among command buffer GL contexts without requiring driver-level |
| 28 | execution of previous commands. |
| 29 | |
| 30 | **Native GL Fence**: A GL Fence backed by a platform-specific cross-process |
| 31 | synchronization mechanism. |
| 32 | |
| 33 | **GPU Fence Handle**: An IPC-transportable object (typically a file descriptor) |
| 34 | that can be used to duplicate a native GL fence into a different process's |
| 35 | context. |
| 36 | |
| 37 | **GPU Fence**: A Chrome abstraction that owns a GPU fence handle representing a |
| 38 | native GL fence, usable for cross-process synchronization. |
| 39 | |
| 40 | ## Use case overview |
| 41 | |
Quinten Yearsley | 317532d | 2021-10-20 17:10:31 | [diff] [blame^] | 42 | The core scenario is synchronizing read and write access to a shared resource, |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 43 | for example drawing an image into an offscreen texture and compositing the |
| 44 | result into a final image. The drawing operations need to be completed before |
| 45 | reading to ensure correct output. A typical effect of wrong synchronization is |
| 46 | that the output contains blank or incomplete results instead of the expected |
| 47 | rendered sub-images, causing flickering or tearing. |
| 48 | |
| 49 | "Completed" in this case means that the end result of using a resource as input |
| 50 | will be equivalent to waiting for everything to finish rendering, but it does |
| 51 | not necessarily mean that the GPU has fully finished all drawing operations at |
| 52 | that time. |
| 53 | |
| 54 | ## Single GL context: no synchronization needed |
| 55 | |
| 56 | If all access to the shared resource happens in the same GL context, there is no |
| 57 | need for explicit synchronization. GL guarantees that commands are logically |
| 58 | processed in the order they are submitted. This is true both for local GL |
| 59 | contexts (GL calls via ui/gl/ interfaces) and for a single command buffer GL |
| 60 | context. |
| 61 | |
| 62 | ## Multiple driver-level GL contexts in the same share group: use GLFence |
| 63 | |
| 64 | A process can create multiple GL contexts that are part of the same share group. |
| 65 | These contexts can be created in different threads within this process. |
| 66 | |
| 67 | In this case, GL fences must be used for sequencing, for example: |
| 68 | |
| 69 | 1. Context A: draw image, create GLFence |
| 70 | 1. Context B: server wait or client wait for GLFence, read image |
| 71 | |
Xu Xing | c5b1b958 | 2018-01-15 04:52:05 | [diff] [blame] | 72 | [gl::GLFence](/ui/gl/gl_fence.h) and its subclasses provide wrappers for |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 73 | GL/EGL fence handling methods such as `eglFenceSyncKHR` and `eglWaitSyncKHR`. |
| 74 | These fence objects can be used cross-thread as long as both thread's GL |
| 75 | contexts are part of the same share group. |
| 76 | |
| 77 | For more details, please refer to the underlying extension documentation, for example: |
| 78 | |
| 79 | * https://www.khronos.org/opengl/wiki/Synchronization |
| 80 | * https://www.khronos.org/registry/EGL/extensions/KHR/EGL_KHR_fence_sync.txt |
| 81 | * https://www.khronos.org/registry/EGL/extensions/KHR/EGL_KHR_wait_sync.txt |
| 82 | |
| 83 | ## Implementation-dependent: same-thread driver-level GL contexts |
| 84 | |
| 85 | Many GL driver implementations are based on a per-thread command queue, |
| 86 | with the effect that commands are processed in order even if they were issued |
| 87 | from different contexts on that thread without explicit synchronization. |
| 88 | |
| 89 | This behavior is not part of the GL standard, and some driver implementations |
| 90 | use a per-context command queue where this assumption is not true. |
| 91 | |
| 92 | See [issue 510232](http://crbug.com/510243#c23) for an example of a problematic |
| 93 | sequence: |
| 94 | |
Klaus Weidner | f11c1775 | 2018-01-31 00:17:19 | [diff] [blame] | 95 | ``` |
| 96 | // In one thread: |
| 97 | MakeCurrent(A); |
| 98 | Render1(); |
| 99 | MakeCurrent(B); |
| 100 | Render2(); |
| 101 | CreateSync(X); |
| 102 | |
| 103 | // And in another thread: |
| 104 | MakeCurrent(C); |
| 105 | WaitSync(X); |
| 106 | Render3(); |
| 107 | MakeCurrent(D); |
| 108 | Render4(); |
| 109 | ``` |
| 110 | |
| 111 | The only serialization guarantee is that Render2 will complete before Render3, |
| 112 | but Render4 could theoretically complete before Render1. |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 113 | |
| 114 | Chrome assumes that the render steps happen in order Render1, Render2, Render3, |
| 115 | and Render4, and requires this behavior to ensure security. If the driver doesn't |
| 116 | ensure this sequencing, Chrome has to emulate it using virtual contexts. (Or by |
| 117 | using explicit synchronization, but it doesn't do that today.) See also the |
| 118 | "CHROMIUM fence sync" section below. |
| 119 | |
Sunny Sachanandani | c94b8de | 2017-12-16 03:30:30 | [diff] [blame] | 120 | ## Command buffer GL clients: use CHROMIUM sync tokens |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 121 | |
| 122 | Chrome's command buffer IPC interface uses multiple layers. There are multiple |
| 123 | active IPC channels (typically one per process, i.e. one per Renderer and one |
Sunny Sachanandani | c94b8de | 2017-12-16 03:30:30 | [diff] [blame] | 124 | for Browser). Each IPC channel has multiple scheduling groups (also called |
| 125 | streams), and each stream can contain multiple command buffers, which in turn |
| 126 | contain a sequence of GL commands. |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 127 | |
| 128 | Command buffers in the same client-side share group must be in the same stream. |
| 129 | Command scheduling granuarity is at the stream level, and a client can choose to |
| 130 | create and use multiple streams with different stream priorities. Stream IDs are |
| 131 | arbitrary integers assigned by the client at creation time, see for example the |
Scott Violet | 703b824 | 2019-06-11 19:34:36 | [diff] [blame] | 132 | [viz::ContextProviderCommandBuffer](/services/viz/public/cpp/gpu/context_provider_command_buffer.h) |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 133 | constructor. |
| 134 | |
Sunny Sachanandani | c94b8de | 2017-12-16 03:30:30 | [diff] [blame] | 135 | The CHROMIUM sync token is intended to order operations among command buffer GL |
| 136 | instructions. It inserts an internal fence sync command in the stream, flushing |
| 137 | it appropriately (see below), and generating a sync token from it which is a |
| 138 | cross-context transportable reference to the underlying fence sync. A |
| 139 | WaitSyncTokenCHROMIUM call does **not** ensure that the underlying GL commands |
| 140 | have been executed at the GPU driver level, this mechanism is not suitable for |
| 141 | synchronizing command buffer GL operations with a local driver-level GL context. |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 142 | |
| 143 | See the |
Xu Xing | c5b1b958 | 2018-01-15 04:52:05 | [diff] [blame] | 144 | [CHROMIUM_sync_point](/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_sync_point.txt) |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 145 | documentation for details. |
| 146 | |
| 147 | Commands issued within a single command buffer don't need to be synchronized |
| 148 | explicitly, they will be executed in the same order that they were issued. |
| 149 | |
| 150 | Multiple command buffers within the same stream can use an ordering barrier to |
| 151 | sequence their commands. Sync tokens are not necessary. Example: |
| 152 | |
| 153 | ```c++ |
| 154 | // Command buffers gl1 and gl2 are in the same stream. |
| 155 | Render1(gl1); |
| 156 | gl1->OrderingBarrierCHROMIUM() |
| 157 | Render2(gl2); // will happen after Render1. |
| 158 | ``` |
| 159 | |
| 160 | Command buffers that are in different streams need to use sync tokens. If both |
| 161 | are using the same IPC channel (i.e. same client process), an unverified sync |
| 162 | token is sufficient, and commands do not need to be flushed to the server: |
| 163 | |
| 164 | ```c++ |
| 165 | // stream A |
| 166 | Render1(glA); |
Sunny Sachanandani | c94b8de | 2017-12-16 03:30:30 | [diff] [blame] | 167 | glA->GenUnverifiedSyncTokenCHROMIUM(out_sync_token); |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 168 | |
| 169 | // stream B |
| 170 | glB->WaitSyncTokenCHROMIUM(); |
| 171 | Render2(glB); // will happen after Render1. |
| 172 | ``` |
| 173 | |
| 174 | Command buffers that are using different IPC channels must use verified sync |
| 175 | tokens. Verification is a check that the underlying fence sync was flushed to |
| 176 | the server. Cross-process synchronization always uses verified sync tokens. |
| 177 | `GenSyncTokenCHROMIUM` will force a shallow flush as a side effect if necessary. |
| 178 | Example: |
| 179 | |
| 180 | ```c++ |
| 181 | // IPC channel in process X |
| 182 | Render1(glX); |
Sunny Sachanandani | c94b8de | 2017-12-16 03:30:30 | [diff] [blame] | 183 | glX->GenSyncTokenCHROMIUM(out_sync_token); |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 184 | |
| 185 | // IPC channel in process Y |
| 186 | glY->WaitSyncTokenCHROMIUM(); |
| 187 | Render2(glY); // will happen after Render1. |
| 188 | ``` |
| 189 | |
| 190 | Alternatively, unverified sync tokens can be converted to verified ones in bulk |
| 191 | by calling `VerifySyncTokensCHROMIUM`. This will wait for a flush to complete as |
| 192 | necessary. Use this to avoid multiple sequential flushes: |
| 193 | |
| 194 | ```c++ |
Sunny Sachanandani | c94b8de | 2017-12-16 03:30:30 | [diff] [blame] | 195 | gl->GenUnverifiedSyncTokenCHROMIUM(out_sync_tokens[0]); |
| 196 | gl->GenUnverifiedSyncTokenCHROMIUM(out_sync_tokens[1]); |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 197 | gl->VerifySyncTokensCHROMIUM(out_sync_tokens, 2); |
| 198 | ``` |
| 199 | |
| 200 | ### Implementation notes |
| 201 | |
| 202 | Correctness of the CHROMIUM fence sync mechanism depends on the assumption that |
| 203 | commands issued from the command buffer service side happen in the order they |
| 204 | were issued in that thread. This is handled in different ways: |
| 205 | |
| 206 | * Issue a glFlush on switching contexts on platforms where glFlush is sufficient |
| 207 | to ensure ordering, i.e. MacOS. (This approach would not be well suited to |
| 208 | tiling GPUs as used on many mobile GPUs where glFlush is an expensive |
| 209 | operation, it may force content load/store between tile memory and main |
| 210 | memory.) See for example |
Xu Xing | c5b1b958 | 2018-01-15 04:52:05 | [diff] [blame] | 211 | [gl::GLContextCGL::MakeCurrent](/ui/gl/gl_context_cgl.cc): |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 212 | ```c++ |
| 213 | // It's likely we're going to switch OpenGL contexts at this point. |
| 214 | // Before doing so, if there is a current context, flush it. There |
| 215 | // are many implicit assumptions of flush ordering between contexts |
| 216 | // at higher levels, and if a flush isn't performed, OpenGL commands |
| 217 | // may be issued in unexpected orders, causing flickering and other |
| 218 | // artifacts. |
| 219 | ``` |
| 220 | |
| 221 | * Force context virtualization so that all commands are issued into a single |
| 222 | driver-level GL context. This is used on Qualcomm/Adreno chipsets, see [issue |
| 223 | 691102](http://crbug.com/691102). |
| 224 | |
| 225 | * Assume per-thread command queues without explicit synchronization. GLX |
| 226 | effectively ensures this. On Windows, ANGLE uses a single D3D device |
| 227 | underneath all contexts which ensures strong ordering. |
| 228 | |
| 229 | GPU control tasks are processed out of band and are only partially ordered in |
| 230 | respect to GL commands. A gpu_control task always happens before any following |
| 231 | GL commands issued on the same IPC channel. It usually executes before any |
| 232 | preceding unflushed GL commands, but this is not guaranteed. A |
| 233 | `ShallowFlushCHROMIUM` ensures that any following gpu_control tasks will execute |
| 234 | after the flushed GL commands. |
| 235 | |
| 236 | In this example, DoTask will execute after GLCommandA and before GLCommandD, but |
| 237 | there is no ordering guarantee relative to CommandB and CommandC: |
| 238 | |
| 239 | ```c++ |
| 240 | // gles2_implementation.cc |
| 241 | |
| 242 | helper_->GLCommandA(); |
| 243 | ShallowFlushCHROMIUM(); |
| 244 | |
| 245 | helper_->GLCommandB(); |
| 246 | helper_->GLCommandC(); |
| 247 | gpu_control_->DoTask(); |
| 248 | |
| 249 | helper_->GLCommandD(); |
| 250 | |
| 251 | // Execution order is one of: |
| 252 | // A | DoTask B C | D |
| 253 | // A | B DoTask C | D |
| 254 | // A | B C DoTask | D |
| 255 | ``` |
| 256 | |
| 257 | The shallow flush adds the pending GL commands to the service's task queue, and |
| 258 | this task queue is also used by incoming gpu control tasks and processed in |
| 259 | order. The `ShallowFlushCHROMIUM` command returns as soon as the tasks are |
| 260 | queued and does not wait for them to be processed. |
| 261 | |
| 262 | ## Cross-process transport: GpuFence and GpuFenceHandle |
| 263 | |
| 264 | Some platforms such as Android (most devices N and above) and ChromeOS support |
| 265 | synchronizing a native GL context with a command buffer GL context through a |
| 266 | GpuFence. |
| 267 | |
| 268 | Use the static `gl::GLFence::IsGpuFenceSupported()` method to check at runtime if |
| 269 | the current platform has support for the GpuFence mechanism including |
| 270 | GpuFenceHandle transport. |
| 271 | |
| 272 | The GpuFence mechanism supports two use cases: |
| 273 | |
| 274 | * Create a GLFence object in a local context, convert it to a client-side |
| 275 | GpuFence, duplicate it into a command buffer service-side gpu fence, and |
| 276 | issue a server wait on the command buffer service side. That service-side |
| 277 | wait will be unblocked when the *client-side* GpuFence signals. |
| 278 | |
| 279 | * Create a new command buffer service-side gpu fence, request a GpuFenceHandle |
| 280 | from it, use this handle to create a native GL fence object in the local |
| 281 | context, then issue a server wait on the local GL fence object. This local |
| 282 | server wait will be unblocked when the *service-side* gpu fence signals. |
| 283 | |
| 284 | The [CHROMIUM_gpu_fence |
Xu Xing | c5b1b958 | 2018-01-15 04:52:05 | [diff] [blame] | 285 | extension](/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_gpu_fence.txt) documents |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 286 | the GLES API as used through the command buffer interface. This section contains |
| 287 | additional information about the integration with local GL contexts that is |
| 288 | needed to work with these objects. |
| 289 | |
| 290 | ### Driver-level wrappers |
| 291 | |
| 292 | In general, you should use the static `gl::GLFence::CreateForGpuFence()` and |
| 293 | `gl::GLFence::CreateFromGpuFence()` factory methods to create a |
| 294 | platform-specific local fence object instead of using an implementation class |
| 295 | directly. |
| 296 | |
| 297 | For Android and ChromeOS, the |
Xu Xing | c5b1b958 | 2018-01-15 04:52:05 | [diff] [blame] | 298 | [gl::GLFenceAndroidNativeFenceSync](/ui/gl/gl_fence_android_native_fence_sync.h) |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 299 | implementation wraps the |
| 300 | [EGL_ANDROID_native_fence_sync](https://www.khronos.org/registry/EGL/extensions/ANDROID/EGL_ANDROID_native_fence_sync.txt) |
| 301 | extension that allows creating a special EGLFence object from which a file |
| 302 | descriptor can be extracted, and then creating a duplicate fence object from |
| 303 | that file descriptor that is synchronized with the original fence. |
| 304 | |
| 305 | ### GpuFence and GpuFenceHandle |
| 306 | |
Xu Xing | c5b1b958 | 2018-01-15 04:52:05 | [diff] [blame] | 307 | A [gfx::GpuFence](/ui/gfx/gpu_fence.h) object owns a GPU fence handle |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 308 | representing a native GL fence. The `AsClientGpuFence` method casts it to a |
| 309 | ClientGpuFence type for use with the [CHROMIUM_gpu_fence |
Xu Xing | c5b1b958 | 2018-01-15 04:52:05 | [diff] [blame] | 310 | extension](/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_gpu_fence.txt)'s |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 311 | `CreateClientGpuFenceCHROMIUM` call. |
| 312 | |
Xu Xing | c5b1b958 | 2018-01-15 04:52:05 | [diff] [blame] | 313 | A [gfx::GpuFenceHandle](/ui/gfx/gpu_fence_handle.h) is an IPC-transportable |
Klaus Weidner | e66cc7d | 2017-12-09 17:26:30 | [diff] [blame] | 314 | wrapper for a file descriptor or other underlying primitive object, and is used |
| 315 | to duplicate a native GL fence into another process. It has value semantics and |
| 316 | can be copied multiple times, and then consumed exactly one time. Consumers take |
| 317 | ownership of the underlying resource. Current GpuFenceHandle consumers are: |
| 318 | |
| 319 | * The `gfx::GpuFence(gpu_fence_handle)` constructor takes ownership of the |
| 320 | handle's resources without constructing a local fence. |
| 321 | |
| 322 | * The IPC subsystem closes resources after sending. The typical idiom is to call |
| 323 | `gfx::CloneHandleForIPC(handle)` on a GpuFenceHandle retrieved from a |
| 324 | scope-lifetime object to create a copied handle that will be owned by the IPC |
| 325 | subsystem. |
| 326 | |
| 327 | ### Sample Code |
| 328 | |
| 329 | A usage example for two-process synchronization is to sequence access to a |
| 330 | globally shared drawable such as an AHardwareBuffer on Android, where the |
| 331 | writer uses a local GL context and the reader is a command buffer context in |
| 332 | the GPU process. The writer process draws into an AHardwareBuffer-backed |
| 333 | GLImage in the local GL context, then creates a gpu fence to mark the end of |
| 334 | drawing operations: |
| 335 | |
| 336 | ```c++ |
| 337 | // This example assumes that GpuFence is supported. If not, the application |
| 338 | // should fall back to a different transport or synchronization method. |
| 339 | DCHECK(gl::GLFence::IsGpuFenceSupported()) |
| 340 | |
| 341 | // ... write to the shared drawable in local context, then create |
| 342 | // a local fence. |
| 343 | std::unique_ptr<gl::GLFence> local_fence = gl::GLFence::CreateForGpuFence(); |
| 344 | |
| 345 | // Convert to a GpuFence. |
| 346 | std::unique_ptr<gfx::GpuFence> gpu_fence = local_fence->GetGpuFence(); |
| 347 | // It's ok for local_fence to be destroyed now, the GpuFence remains valid. |
| 348 | |
| 349 | // Create a matching gpu fence on the command buffer context, issue |
| 350 | // server wait, and destroy it. |
| 351 | GLuint id = gl->CreateClientGpuFenceCHROMIUM(gpu_fence.AsClientGpuFence()); |
| 352 | // It's ok for gpu_fence to be destroyed now. |
| 353 | gl->WaitGpuFenceCHROMIUM(id); |
| 354 | gl->DestroyGpuFenceCHROMIUM(id); |
| 355 | |
| 356 | // ... read from the shared drawable via command buffer. These reads |
| 357 | // will happen after the local_fence has signalled. The local |
| 358 | // fence and gpu_fence dn't need to remain alive for this. |
| 359 | ``` |
| 360 | |
| 361 | If a process wants to consume a drawable that was produced through a command |
| 362 | buffer context in the GPU process, the sequence is as follows: |
| 363 | |
| 364 | ```c++ |
| 365 | // Set up callback that's waiting for the drawable to be ready. |
| 366 | void callback(std::unique_ptr<gfx::GpuFence> gpu_fence) { |
| 367 | // Create a local context GL fence from the GpuFence. |
| 368 | std::unique_ptr<gl::GLFence> local_fence = |
| 369 | gl::GLFence::CreateFromGpuFence(*gpu_fence); |
| 370 | local_fence->ServerWait(); |
| 371 | // ... read from the shared drawable in the local context. |
| 372 | } |
| 373 | |
| 374 | // ... write to the shared drawable via command buffer, then |
| 375 | // create a gpu fence: |
| 376 | GLuint id = gl->CreateGpuFenceCHROMIUM(); |
| 377 | context_support->GetGpuFenceHandle(id, base::BindOnce(callback)); |
| 378 | gl->DestroyGpuFenceCHROMIUM(id); |
| 379 | ``` |
| 380 | |
| 381 | It is legal to create the GpuFence on a separate command buffer context instead |
| 382 | of on the command buffer channel that did the drawing operations, but in that |
| 383 | case gl->WaitSyncTokenCHROMIUM() or equivalent must be used to sequence the |
| 384 | operations between the distinct command buffer contexts as usual. |