Skip to content

proposal: cmd/compile: floating point to integer conversions should saturate on overflow #76264

@dr2chase

Description

@dr2chase

Proposal Details

What

Change (implementation-defined) overflow behavior for Go's conversions from float{32,64} to [u]int{8,16,32,64} to be saturating on all platforms. Put this under a GOEXPERIMENT for 1.26, if that goes well, make this the implementation-defined behavior in 1.27, and change the spec in 1.28. This is already directly supported for conversions to 32 and 64 bit integers on arm64, loong64, riscv64, s390x, ppc64, and wasm; the notable outlier is amd64.

int64(NaN) is an additional concern; hardware produces 3 different results:

  • 0 - arm64, loong64, wasm
  • MININT - amd64, s390x, ppc64
  • MAXINT - riscv64

Mandating a single answer for NaN conversion will add some code to signed conversions for some languages; defining all result bits zero except for the sign bit only requires an additional change (beyond others already made) on Riscv64.

Why

Variations in overflow behavior lead to unportability and we don't have any particular tools for checking this. Removing this unnecessarily implementation-defined behavior would remove this source of portability bugs and also make the out-of-range results easier to describe and understand. We've had the same bug reported twice against x/time/rate because of this implementation defined behavior (#71154, #18762)

For conversions to 8 and 16-bit integers, the overflow results are not intuitive even when they agree across platforms. I am slightly concerned that there will be more of these in our glorious machine-learning future, and that their current behavior will cause bugs. Changing the behavior of these conversions will result in a simpler and more consistent specification.

Compatibility issues

Technically, this is implementation-defined behavior, and for conversions to 32 and 64 bit integers this is merely normalizing the implementation-defined behavior to be what is supported by (some) popular hardware. The "compatibility issues" are actually "portability issues".

A not-experiment-gated version of this was tried, and was found to break some external code (that was quickly modified to remove the break).

Within Google, testing generated 147 apparent failures. Bisection succeeded for 124 of these (that is, there are at least 124 real failures) and 92 of these were traced to an at-best-confused conversion in a testing library. For the other 32 bisected failures, neither the code nor the test showed any indication of intent in the case of overflow; apparently someone had observed a behavior and declared it golden.

The largest compatibility change is the conversions to 8 and 16-bit types. Because no hardware supports this, overflow behavior across platforms is somewhat more consistent, and the implementation cost increases somewhat across all platforms. If, however, the former behavior/performance is required/desired, it is easily obtained by first casting the floating point number to int64, then to the smaller integer type.

Implementation plans/costs

For conversions to 32 and 64 bit signed integers, most hardware already supports saturating conversions. Amd64 doesn't, and there's some variation in NaN conversion that we might want to eliminate. The details below show only 64-bit architectures; the pattern is similar for 32-bit, plus softfloat work.

Hardware support

Saturating conversions to 32 and 64-bit integers are already supported by many 64-bit hardware platforms. The exceptions are amd64 and some older mips64 chips (FCSRNAN2008=0). Starred platforms have non-zero NaN conversion.

arm64 riscv64* ppc64* wasm s390x* sparcv9 mips64,loong64
f32->s32 FCVTSSW FCVTWS - I32TruncSatF32S CFEBRA FSTOI TRUNCFW
f32->u32 FCVTZUSW FCVTWUS - I32TruncSatF32U - - -
f32->s64 FCVTSS FCVTLS - I64TruncSatF32S CGEBRA FSTOX TRUNCFV
f32->u64 FCVTZUS FCVTLUS - I64TruncSatF32U - - -
f64->s32 FCVTSDW FCVTWD FCTIWZ I32TruncSatF64S CFDBRA FDTOI TRUNCDW
f64->u32 FCVTZUDW FCVTWUD FCTIWUZ I32TruncSatF64U - - -
f64->s64 FCVTSD FCVTLD FCTIDZ I64TruncSatF64S CGDBRA FDTOX TRUNCDV
f64->u64 FCVTZUD FCVTLUD FCTIDUZ I64TruncSatF64U - - -

Signed 64-bit conversion on Amd64

func Int64(x float64) (result int64) {
	result = int64(x)     // CVTTSD2SQ (-)1 << 63 if NaN, +Inf, or -Inf
	if result == -1<<63 { // error value
		// should be one comparison of x and 0, and two conditional moves (but isn't yet)
		if !(x < 0) { // x > 0 || x is NaN
			result = 0
		}
		if x > 0 {
			result = 0x7fff_ffff_ffff_ffffw
		} 
	}
	return
}

Normalizing NaN before hardware conversion to [u]int{32,64} if necessary (riscv64, ppc64, s390x)

func Int64(x float64) (result int64) {
	if x != x { // NaN // compiles to conditional move if there is one.
		x = 0
	}
	result = int64HW(x)
}

Unsigned 32/64 from signed conversion (amd64, s390x, sparcv9, loong64)

func Uint32(x float32) (result uint32) {
	const W = 32
	const cutoff = 1 << (W - 1)
	if !(x >= cutoff) { // likely; also NaN
		result = uint32(int32HW(x)) // first signed conversion
		if !(x >= 0) {            // unlikely; also handles NaN
			result = 0
		}
	} else {
		y := x - cutoff
		z := int32HW(y) // second signed conversion
		if amd64 && z < 0 { // amd64 positive overflow
			z-- // or, z = cutoff-1
		}
		result = uint32(z) | cutoff
	}
	return
}

8/16 unsigned from unsigned float conversion (arm64, riscv64, wasm)

func Uint8(x float64) (result uint8) {
	const cutoff = 1<<(intY_Size) // intY_Size is 16 or 8
	result = uintY(uint64HW(x))
	if riscv64 && !(x < cutoff) { // optional riscv64 NaN correction 
		result = 0
	}
	if x >= cutoff {
		result = cutoff - 1
	}
}
 

8/16 unsigned from signed float conversion (amd64, s390x, sparcv9, loong64)

func Uint8(x float64) (result uint8) {
	const Y = 8
	const cutoff = 1 << (Y - 1)
	if !(x >= cutoff) { // likely; also NaN
		result = uint8(int64HW(x)) // first signed conversion
		if !(x >= 0) {           // unlikely; also NaN
			result = 0
		}
	} else {
		result = cutoff - 1
	}
	return
}

8/16 signed from signed float conversion (all platforms)

func Int8(x float64) int8 {
	const Y = 8
	const W = 64
	const OVERFLOW = (1<<(Y-1) - 1)
	s := int64HW(x)
	signs := uint64(s >> (Y - 1)) // want all 1s or 0s
	if signs+1 < 2 {              // all 1s or 0s, +1, yields 0 or 1
		return int8(s) // includes 0 for NaN on arm64, loong64, wasm
	}
	// Some kind of overflow or amd64 && , ppc64, s390x, riscv64} nonzero NaN conversion
	s = (s >> (W - 1)) ^ OVERFLOW // +/- overflow 
	if (amd64 || ppc64 || s390x || riscv64) && x != x {
		s = 0 // NaN
	}
	If amd64 && x > 0 { // NaN has low bits zero, and NaN is not > 0.
		s = OVERFLOW
	}
	return int8(s)
}

Existing conversion overflow behavior

Here 87381=1.5p16, big32=1.5p32, big64=1.5p64, values chosen so that they overflow conversions into both signed and unisgned integer types but are not infinity. Saturated results (that is, the goal of this proposal) appear in bold italic.

conversion 386 amd64 arm arm64 loong64 riscv64 s390x ppc64 ppc64le conversion
int8(+87381) 0x+55 0x+55 0x+55 0x+55 0x+55 0x+55 0x+55 0x+55 0x+55 int8(+87381)
int8(+big32) 0 0 -1 -1 -1 -1 -1 -1 -1 int8(+big32) DIFFER
int8(+big64) 0 0 -1 -1 -1 -1 -1 -1 -1 int8(+big64) DIFFER
int8(-87381) 0x-55 0x-55 0x-55 0x-55 0x-55 0x-55 0x-55 0x-55 0x-55 int8(-87381)
int8(-big32) 0 0 0 0 0 0 0 0 0 int8(-big32)
int8(-big64) 0 0 0 0 0 0 0 0 0 int8(-big64)
int16(+87381) 0x+5555 0x+5555 0x+5555 0x+5555 0x+5555 0x+5555 0x+5555 0x+5555 0x+5555 int16(+87381)
int16(+big32) 0 0 -1 -1 -1 -1 -1 -1 -1 int16(+big32) DIFFER
int16(+big64) 0 0 -1 -1 -1 -1 -1 -1 -1 int16(+big64) DIFFER
int16(-87381) 0x-5555 0x-5555 0x-5555 0x-5555 0x-5555 0x-5555 0x-5555 0x-5555 0x-5555 int16(-87381)
int16(-big32) 0 0 0 0 0 0 0 0 0 int16(-big32)
int16(-big64) 0 0 0 0 0 0 0 0 0 int16(-big64)
int32(+big32) MININT32 MININT32 MAXINT32 MAXINT32 MAXINT32 MAXINT32 MAXINT32 MAXINT32 MAXINT32 int32(+big32) DIFFER
int32(+big64) MININT32 MININT32 MAXINT32 MAXINT32 MAXINT32 MAXINT32 MAXINT32 MAXINT32 MAXINT32 int32(+big64) DIFFER
int32(-big32) MININT32 MININT32 MININT32 MININT32 MININT32 MININT32 MININT32 MININT32 MININT32 int32(-big32)
int32(-big64) MININT32 MININT32 MININT32 MININT32 MININT32 MININT32 MININT32 MININT32 MININT32 int32(-big64)
int64(+big64) 0 MININT64 0x-100000000 MAXINT64 MAXINT64 MAXINT64 MAXINT64 MAXINT64 MAXINT64 int64(+big64) DIFFER
int64(-big64) 0 MININT64 0 MININT64 MININT64 MININT64 MININT64 MININT64 MININT64 int64(-big64) DIFFER
uint8(+87381) 0x+55 0x+55 0x+55 0x+55 0x+55 0x+55 0x+55 0x+55 0x+55 uint8(+87381)
uint8(+big32) 0 0 0x+ff 0x+ff 0x+ff 0x+ff 0x+ff 0x+ff 0x+ff uint8(+big32) DIFFER
uint8(+big64) 0 0 0x+ff 0x+ff 0x+ff 0x+ff 0x+ff 0x+ff 0x+ff uint8(+big64) DIFFER
uint8(-87381) 0x+ab 0x+ab 0x+ab 0x+ab 0x+ab 0x+ab 0x+ab 0x+ab 0x+ab uint8(-87381)
uint8(-big32) 0 0 0 0 0 0 0 0 0 uint8(-big32)
uint8(-big64) 0 0 0 0 0 0 0 0 0 uint8(-big64)
uint16(+87381) 0x+5555 0x+5555 0x+5555 0x+5555 0x+5555 0x+5555 0x+5555 0x+5555 0x+5555 uint16(+87381)
uint16(+big32) 0 0 0x+ffff 0x+ffff 0x+ffff 0x+ffff 0x+ffff 0x+ffff 0x+ffff uint16(+big32) DIFFER
uint16(+big64) 0 0 0x+ffff 0x+ffff 0x+ffff 0x+ffff 0x+ffff 0x+ffff 0x+ffff uint16(+big64) DIFFER
uint16(-87381) 0x+aaab 0x+aaab 0x+aaab 0x+aaab 0x+aaab 0x+aaab 0x+aaab 0x+aaab 0x+aaab uint16(-87381)
uint16(-big32) 0 0 0 0 0 0 0 0 0 uint16(-big32)
uint16(-big64) 0 0 0 0 0 0 0 0 0 uint16(-big64)
uint32(+big32) 0x+55555555 0x+55555555 MAXUINT32 0x+55555555 0x+55555555 0x+55555555 0x+55555555 0x+55555555 0x+55555555 uint32(+big32) DIFFER
uint32(+big64) 0 0 MAXUINT32 MAXUINT32 MAXUINT32 MAXUINT32 MAXUINT32 MAXUINT32 MAXUINT32 uint32(+big64) DIFFER
uint32(-87381) 0x+fffeaaab 0x+fffeaaab 0 0x+fffeaaab 0x+fffeaaab 0x+fffeaaab 0x+fffeaaab 0x+fffeaaab 0x+fffeaaab uint32(-87381) DIFFER
uint32(-big32) 0x+aaaaaaab 0x+aaaaaaab 0 0x+aaaaaaab 0x+aaaaaaab 0x+aaaaaaab 0x+aaaaaaab 0x+aaaaaaab 0x+aaaaaaab uint32(-big32) DIFFER
uint32(-big64) 0 0 0 0 0 0 0 0 0 uint32(-big64)
uint64(+big64) 0 1<<63 0x+ffffffff00000000 MAXUINT64 MAXUINT64 MAXUINT64 MAXUINT64 MAXUINT64 MAXUINT64 uint64(+big64) DIFFER
uint64(-87381) 0x+fffffffffffeaaab 0x+fffffffffffeaaab 0x+fffffffffffeaaab 0 0x+fffffffffffeaaab 0x+fffffffffffeaaab 0 0x+fffffffffffeaaab 0x+fffffffffffeaaab uint64(-87381) DIFFER
uint64(-big32) 0x+fffffffeaaaaaaab 0x+fffffffeaaaaaaab 0x+fffffffeaaaaaaab 0 0x+fffffffeaaaaaaab 0x+fffffffeaaaaaaab 0 0x+fffffffeaaaaaaab 0x+fffffffeaaaaaaab uint64(-big32) DIFFER
uint64(-big64) 0 1<<63 0 0 1<<63 1<<63 0 1<<63 1<<63 uint64(-big64) DIFFER
conversions 386 amd64 arm arm64 loong64 riscv64 s390x ppc64 ppc64le conversions
int64(+Inf) 0 MININT64 0x-100000000 MAXINT64 MAXINT64 MAXINT64 MAXINT64 MAXINT64 MAXINT64 int64(+Inf) DIFFER
int64(-Inf) 0 MININT64 0 MININT64 MININT64 MININT64 MININT64 MININT64 MININT64 int64(-Inf) DIFFER
int64(NaN) 0 MININT64 0 0 0 MAXINT64 MININT64 MININT64 MININT64 int64(NaN) DIFFER
int8(+Inf) 0 0 -1 -1 -1 -1 -1 -1 -1 int8(+Inf) DIFFER
int8(-Inf) 0 0 0 0 0 0 0 0 0 int8(-Inf)
int8(NaN) 0 0 0 0 0 -1 0 0 0 int8(NaN) DIFFER
uint32(+Inf) 0 0 MAXUINT32 MAXUINT32 MAXUINT32 MAXUINT32 MAXUINT32 MAXUINT32 MAXUINT32 uint32(+Inf) DIFFER
uint32(-Inf) 0 0 0 0 0 0 0 0 0 uint32(-Inf)
uint32(NaN) 0 0 0 0 0 MAXUINT32 0 0 0 uint32(NaN) DIFFER
uint8(+Inf) 0 0 255 255 255 255 255 255 255 uint8(+Inf) DIFFER
uint8(-Inf) 0 0 0 0 0 0 0 0 0 uint8(-Inf)
uint8(NaN) 0 0 0 0 0 255 0 0 0 uint8(NaN) DIFFER

Metadata

Metadata

Assignees

No one assigned

    Labels

    LanguageChangeSuggested changes to the Go languageLanguageChangeReviewDiscussed by language change review committeeLanguageProposalIssues describing a requested change to the Go language specification.Proposal

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions