Blame - docs/design/gpu_synchronization.md - chromium/src

blob: 76bada51b2142933f3f71055830ee8fa5963ba5e [file] [log] [blame] [view]

Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	1	# GPU Synchronization in Chrome
				2
				3	Chrome supports multiple mechanisms for sequencing GPU drawing operations, this
				4	document provides a brief overview. The main focus is a high-level explanation
				5	of when synchronization is needed and which mechanism is appropriate.
				6
				7	[TOC]
				8
				9	## Glossary
				10
				11	GL Sync Object: Generic GL-level synchronization object that can be in a
				12	"unsignaled" or "signaled" state. The only current implementation of this is a
				13	GL fence.
				14
				15	GL Fence: A GL sync object that is inserted into the GL command stream. It
				16	starts out unsignaled and becomes signaled when the GPU reaches this point in the
				17	command stream, implying that all previous commands have completed.
				18
				19	Client Wait: Block the client thread until a sync object becomes signaled,
				20	or until a timeout occurs.
				21
				22	Server Wait: Tells the GPU to defer executing commands issued after a fence
				23	until the fence signals. The client thread continues executing immediately and
				24	can continue submitting GL commands.
				25
				26	CHROMIUM fence sync: A command buffer specific GL fence that sequences
				27	operations among command buffer GL contexts without requiring driver-level
				28	execution of previous commands.
				29
				30	Native GL Fence: A GL Fence backed by a platform-specific cross-process
				31	synchronization mechanism.
				32
				33	GPU Fence Handle: An IPC-transportable object (typically a file descriptor)
				34	that can be used to duplicate a native GL fence into a different process's
				35	context.
				36
				37	GPU Fence: A Chrome abstraction that owns a GPU fence handle representing a
				38	native GL fence, usable for cross-process synchronization.
				39
				40	## Use case overview
				41
Quinten Yearsley	317532d	2021-10-20 17:10:31	[diff] [blame^]	42	The core scenario is synchronizing read and write access to a shared resource,
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	43	for example drawing an image into an offscreen texture and compositing the
				44	result into a final image. The drawing operations need to be completed before
				45	reading to ensure correct output. A typical effect of wrong synchronization is
				46	that the output contains blank or incomplete results instead of the expected
				47	rendered sub-images, causing flickering or tearing.
				48
				49	"Completed" in this case means that the end result of using a resource as input
				50	will be equivalent to waiting for everything to finish rendering, but it does
				51	not necessarily mean that the GPU has fully finished all drawing operations at
				52	that time.
				53
				54	## Single GL context: no synchronization needed
				55
				56	If all access to the shared resource happens in the same GL context, there is no
				57	need for explicit synchronization. GL guarantees that commands are logically
				58	processed in the order they are submitted. This is true both for local GL
				59	contexts (GL calls via ui/gl/ interfaces) and for a single command buffer GL
				60	context.
				61
				62	## Multiple driver-level GL contexts in the same share group: use GLFence
				63
				64	A process can create multiple GL contexts that are part of the same share group.
				65	These contexts can be created in different threads within this process.
				66
				67	In this case, GL fences must be used for sequencing, for example:
				68
				69	1. Context A: draw image, create GLFence
				70	1. Context B: server wait or client wait for GLFence, read image
				71
Xu Xing	c5b1b958	2018-01-15 04:52:05	[diff] [blame]	72	[gl::GLFence](/ui/gl/gl_fence.h) and its subclasses provide wrappers for
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	73	GL/EGL fence handling methods such as `eglFenceSyncKHR` and `eglWaitSyncKHR`.
				74	These fence objects can be used cross-thread as long as both thread's GL
				75	contexts are part of the same share group.
				76
				77	For more details, please refer to the underlying extension documentation, for example:
				78
				79	* https://www.khronos.org/opengl/wiki/Synchronization
				80	* https://www.khronos.org/registry/EGL/extensions/KHR/EGL_KHR_fence_sync.txt
				81	* https://www.khronos.org/registry/EGL/extensions/KHR/EGL_KHR_wait_sync.txt
				82
				83	## Implementation-dependent: same-thread driver-level GL contexts
				84
				85	Many GL driver implementations are based on a per-thread command queue,
				86	with the effect that commands are processed in order even if they were issued
				87	from different contexts on that thread without explicit synchronization.
				88
				89	This behavior is not part of the GL standard, and some driver implementations
				90	use a per-context command queue where this assumption is not true.
				91
				92	See [issue 510232](http://crbug.com/510243#c23) for an example of a problematic
				93	sequence:
				94
Klaus Weidner	f11c1775	2018-01-31 00:17:19	[diff] [blame]	95	```
				96	// In one thread:
				97	MakeCurrent(A);
				98	Render1();
				99	MakeCurrent(B);
				100	Render2();
				101	CreateSync(X);
				102
				103	// And in another thread:
				104	MakeCurrent(C);
				105	WaitSync(X);
				106	Render3();
				107	MakeCurrent(D);
				108	Render4();
				109	```
				110
				111	The only serialization guarantee is that Render2 will complete before Render3,
				112	but Render4 could theoretically complete before Render1.
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	113
				114	Chrome assumes that the render steps happen in order Render1, Render2, Render3,
				115	and Render4, and requires this behavior to ensure security. If the driver doesn't
				116	ensure this sequencing, Chrome has to emulate it using virtual contexts. (Or by
				117	using explicit synchronization, but it doesn't do that today.) See also the
				118	"CHROMIUM fence sync" section below.
				119
Sunny Sachanandani	c94b8de	2017-12-16 03:30:30	[diff] [blame]	120	## Command buffer GL clients: use CHROMIUM sync tokens
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	121
				122	Chrome's command buffer IPC interface uses multiple layers. There are multiple
				123	active IPC channels (typically one per process, i.e. one per Renderer and one
Sunny Sachanandani	c94b8de	2017-12-16 03:30:30	[diff] [blame]	124	for Browser). Each IPC channel has multiple scheduling groups (also called
				125	streams), and each stream can contain multiple command buffers, which in turn
				126	contain a sequence of GL commands.
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	127
				128	Command buffers in the same client-side share group must be in the same stream.
				129	Command scheduling granuarity is at the stream level, and a client can choose to
				130	create and use multiple streams with different stream priorities. Stream IDs are
				131	arbitrary integers assigned by the client at creation time, see for example the
Scott Violet	703b824	2019-06-11 19:34:36	[diff] [blame]	132	[viz::ContextProviderCommandBuffer](/services/viz/public/cpp/gpu/context_provider_command_buffer.h)
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	133	constructor.
				134
Sunny Sachanandani	c94b8de	2017-12-16 03:30:30	[diff] [blame]	135	The CHROMIUM sync token is intended to order operations among command buffer GL
				136	instructions. It inserts an internal fence sync command in the stream, flushing
				137	it appropriately (see below), and generating a sync token from it which is a
				138	cross-context transportable reference to the underlying fence sync. A
				139	WaitSyncTokenCHROMIUM call does not ensure that the underlying GL commands
				140	have been executed at the GPU driver level, this mechanism is not suitable for
				141	synchronizing command buffer GL operations with a local driver-level GL context.
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	142
				143	See the
Xu Xing	c5b1b958	2018-01-15 04:52:05	[diff] [blame]	144	[CHROMIUM_sync_point](/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_sync_point.txt)
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	145	documentation for details.
				146
				147	Commands issued within a single command buffer don't need to be synchronized
				148	explicitly, they will be executed in the same order that they were issued.
				149
				150	Multiple command buffers within the same stream can use an ordering barrier to
				151	sequence their commands. Sync tokens are not necessary. Example:
				152
				153	```c++
				154	// Command buffers gl1 and gl2 are in the same stream.
				155	Render1(gl1);
				156	gl1->OrderingBarrierCHROMIUM()
				157	Render2(gl2); // will happen after Render1.
				158	```
				159
				160	Command buffers that are in different streams need to use sync tokens. If both
				161	are using the same IPC channel (i.e. same client process), an unverified sync
				162	token is sufficient, and commands do not need to be flushed to the server:
				163
				164	```c++
				165	// stream A
				166	Render1(glA);
Sunny Sachanandani	c94b8de	2017-12-16 03:30:30	[diff] [blame]	167	glA->GenUnverifiedSyncTokenCHROMIUM(out_sync_token);
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	168
				169	// stream B
				170	glB->WaitSyncTokenCHROMIUM();
				171	Render2(glB); // will happen after Render1.
				172	```
				173
				174	Command buffers that are using different IPC channels must use verified sync
				175	tokens. Verification is a check that the underlying fence sync was flushed to
				176	the server. Cross-process synchronization always uses verified sync tokens.
				177	`GenSyncTokenCHROMIUM` will force a shallow flush as a side effect if necessary.
				178	Example:
				179
				180	```c++
				181	// IPC channel in process X
				182	Render1(glX);
Sunny Sachanandani	c94b8de	2017-12-16 03:30:30	[diff] [blame]	183	glX->GenSyncTokenCHROMIUM(out_sync_token);
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	184
				185	// IPC channel in process Y
				186	glY->WaitSyncTokenCHROMIUM();
				187	Render2(glY); // will happen after Render1.
				188	```
				189
				190	Alternatively, unverified sync tokens can be converted to verified ones in bulk
				191	by calling `VerifySyncTokensCHROMIUM`. This will wait for a flush to complete as
				192	necessary. Use this to avoid multiple sequential flushes:
				193
				194	```c++
Sunny Sachanandani	c94b8de	2017-12-16 03:30:30	[diff] [blame]	195	gl->GenUnverifiedSyncTokenCHROMIUM(out_sync_tokens[0]);
				196	gl->GenUnverifiedSyncTokenCHROMIUM(out_sync_tokens[1]);
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	197	gl->VerifySyncTokensCHROMIUM(out_sync_tokens, 2);
				198	```
				199
				200	### Implementation notes
				201
				202	Correctness of the CHROMIUM fence sync mechanism depends on the assumption that
				203	commands issued from the command buffer service side happen in the order they
				204	were issued in that thread. This is handled in different ways:
				205
				206	* Issue a glFlush on switching contexts on platforms where glFlush is sufficient
				207	to ensure ordering, i.e. MacOS. (This approach would not be well suited to
				208	tiling GPUs as used on many mobile GPUs where glFlush is an expensive
				209	operation, it may force content load/store between tile memory and main
				210	memory.) See for example
Xu Xing	c5b1b958	2018-01-15 04:52:05	[diff] [blame]	211	[gl::GLContextCGL::MakeCurrent](/ui/gl/gl_context_cgl.cc):
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	212	```c++
				213	// It's likely we're going to switch OpenGL contexts at this point.
				214	// Before doing so, if there is a current context, flush it. There
				215	// are many implicit assumptions of flush ordering between contexts
				216	// at higher levels, and if a flush isn't performed, OpenGL commands
				217	// may be issued in unexpected orders, causing flickering and other
				218	// artifacts.
				219	```
				220
				221	* Force context virtualization so that all commands are issued into a single
				222	driver-level GL context. This is used on Qualcomm/Adreno chipsets, see [issue
				223	691102](http://crbug.com/691102).
				224
				225	* Assume per-thread command queues without explicit synchronization. GLX
				226	effectively ensures this. On Windows, ANGLE uses a single D3D device
				227	underneath all contexts which ensures strong ordering.
				228
				229	GPU control tasks are processed out of band and are only partially ordered in
				230	respect to GL commands. A gpu_control task always happens before any following
				231	GL commands issued on the same IPC channel. It usually executes before any
				232	preceding unflushed GL commands, but this is not guaranteed. A
				233	`ShallowFlushCHROMIUM` ensures that any following gpu_control tasks will execute
				234	after the flushed GL commands.
				235
				236	In this example, DoTask will execute after GLCommandA and before GLCommandD, but
				237	there is no ordering guarantee relative to CommandB and CommandC:
				238
				239	```c++
				240	// gles2_implementation.cc
				241
				242	helper_->GLCommandA();
				243	ShallowFlushCHROMIUM();
				244
				245	helper_->GLCommandB();
				246	helper_->GLCommandC();
				247	gpu_control_->DoTask();
				248
				249	helper_->GLCommandD();
				250
				251	// Execution order is one of:
				252	// A \| DoTask B C \| D
				253	// A \| B DoTask C \| D
				254	// A \| B C DoTask \| D
				255	```
				256
				257	The shallow flush adds the pending GL commands to the service's task queue, and
				258	this task queue is also used by incoming gpu control tasks and processed in
				259	order. The `ShallowFlushCHROMIUM` command returns as soon as the tasks are
				260	queued and does not wait for them to be processed.
				261
				262	## Cross-process transport: GpuFence and GpuFenceHandle
				263
				264	Some platforms such as Android (most devices N and above) and ChromeOS support
				265	synchronizing a native GL context with a command buffer GL context through a
				266	GpuFence.
				267
				268	Use the static `gl::GLFence::IsGpuFenceSupported()` method to check at runtime if
				269	the current platform has support for the GpuFence mechanism including
				270	GpuFenceHandle transport.
				271
				272	The GpuFence mechanism supports two use cases:
				273
				274	* Create a GLFence object in a local context, convert it to a client-side
				275	GpuFence, duplicate it into a command buffer service-side gpu fence, and
				276	issue a server wait on the command buffer service side. That service-side
				277	wait will be unblocked when the client-side GpuFence signals.
				278
				279	* Create a new command buffer service-side gpu fence, request a GpuFenceHandle
				280	from it, use this handle to create a native GL fence object in the local
				281	context, then issue a server wait on the local GL fence object. This local
				282	server wait will be unblocked when the service-side gpu fence signals.
				283
				284	The [CHROMIUM_gpu_fence
Xu Xing	c5b1b958	2018-01-15 04:52:05	[diff] [blame]	285	extension](/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_gpu_fence.txt) documents
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	286	the GLES API as used through the command buffer interface. This section contains
				287	additional information about the integration with local GL contexts that is
				288	needed to work with these objects.
				289
				290	### Driver-level wrappers
				291
				292	In general, you should use the static `gl::GLFence::CreateForGpuFence()` and
				293	`gl::GLFence::CreateFromGpuFence()` factory methods to create a
				294	platform-specific local fence object instead of using an implementation class
				295	directly.
				296
				297	For Android and ChromeOS, the
Xu Xing	c5b1b958	2018-01-15 04:52:05	[diff] [blame]	298	[gl::GLFenceAndroidNativeFenceSync](/ui/gl/gl_fence_android_native_fence_sync.h)
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	299	implementation wraps the
				300	[EGL_ANDROID_native_fence_sync](https://www.khronos.org/registry/EGL/extensions/ANDROID/EGL_ANDROID_native_fence_sync.txt)
				301	extension that allows creating a special EGLFence object from which a file
				302	descriptor can be extracted, and then creating a duplicate fence object from
				303	that file descriptor that is synchronized with the original fence.
				304
				305	### GpuFence and GpuFenceHandle
				306
Xu Xing	c5b1b958	2018-01-15 04:52:05	[diff] [blame]	307	A [gfx::GpuFence](/ui/gfx/gpu_fence.h) object owns a GPU fence handle
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	308	representing a native GL fence. The `AsClientGpuFence` method casts it to a
				309	ClientGpuFence type for use with the [CHROMIUM_gpu_fence
Xu Xing	c5b1b958	2018-01-15 04:52:05	[diff] [blame]	310	extension](/gpu/GLES2/extensions/CHROMIUM/CHROMIUM_gpu_fence.txt)'s
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	311	`CreateClientGpuFenceCHROMIUM` call.
				312
Xu Xing	c5b1b958	2018-01-15 04:52:05	[diff] [blame]	313	A [gfx::GpuFenceHandle](/ui/gfx/gpu_fence_handle.h) is an IPC-transportable
Klaus Weidner	e66cc7d	2017-12-09 17:26:30	[diff] [blame]	314	wrapper for a file descriptor or other underlying primitive object, and is used
				315	to duplicate a native GL fence into another process. It has value semantics and
				316	can be copied multiple times, and then consumed exactly one time. Consumers take
				317	ownership of the underlying resource. Current GpuFenceHandle consumers are:
				318
				319	* The `gfx::GpuFence(gpu_fence_handle)` constructor takes ownership of the
				320	handle's resources without constructing a local fence.
				321
				322	* The IPC subsystem closes resources after sending. The typical idiom is to call
				323	`gfx::CloneHandleForIPC(handle)` on a GpuFenceHandle retrieved from a
				324	scope-lifetime object to create a copied handle that will be owned by the IPC
				325	subsystem.
				326
				327	### Sample Code
				328
				329	A usage example for two-process synchronization is to sequence access to a
				330	globally shared drawable such as an AHardwareBuffer on Android, where the
				331	writer uses a local GL context and the reader is a command buffer context in
				332	the GPU process. The writer process draws into an AHardwareBuffer-backed
				333	GLImage in the local GL context, then creates a gpu fence to mark the end of
				334	drawing operations:
				335
				336	```c++
				337	// This example assumes that GpuFence is supported. If not, the application
				338	// should fall back to a different transport or synchronization method.
				339	DCHECK(gl::GLFence::IsGpuFenceSupported())
				340
				341	// ... write to the shared drawable in local context, then create
				342	// a local fence.
				343	std::unique_ptr<gl::GLFence> local_fence = gl::GLFence::CreateForGpuFence();
				344
				345	// Convert to a GpuFence.
				346	std::unique_ptr<gfx::GpuFence> gpu_fence = local_fence->GetGpuFence();
				347	// It's ok for local_fence to be destroyed now, the GpuFence remains valid.
				348
				349	// Create a matching gpu fence on the command buffer context, issue
				350	// server wait, and destroy it.
				351	GLuint id = gl->CreateClientGpuFenceCHROMIUM(gpu_fence.AsClientGpuFence());
				352	// It's ok for gpu_fence to be destroyed now.
				353	gl->WaitGpuFenceCHROMIUM(id);
				354	gl->DestroyGpuFenceCHROMIUM(id);
				355
				356	// ... read from the shared drawable via command buffer. These reads
				357	// will happen after the local_fence has signalled. The local
				358	// fence and gpu_fence dn't need to remain alive for this.
				359	```
				360
				361	If a process wants to consume a drawable that was produced through a command
				362	buffer context in the GPU process, the sequence is as follows:
				363
				364	```c++
				365	// Set up callback that's waiting for the drawable to be ready.
				366	void callback(std::unique_ptr<gfx::GpuFence> gpu_fence) {
				367	// Create a local context GL fence from the GpuFence.
				368	std::unique_ptr<gl::GLFence> local_fence =
				369	gl::GLFence::CreateFromGpuFence(*gpu_fence);
				370	local_fence->ServerWait();
				371	// ... read from the shared drawable in the local context.
				372	}
				373
				374	// ... write to the shared drawable via command buffer, then
				375	// create a gpu fence:
				376	GLuint id = gl->CreateGpuFenceCHROMIUM();
				377	context_support->GetGpuFenceHandle(id, base::BindOnce(callback));
				378	gl->DestroyGpuFenceCHROMIUM(id);
				379	```
				380
				381	It is legal to create the GpuFence on a separate command buffer context instead
				382	of on the command buffer channel that did the drawing operations, but in that
				383	case gl->WaitSyncTokenCHROMIUM() or equivalent must be used to sequence the
				384	operations between the distinct command buffer contexts as usual.