-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Description
Proposal Details
What
Change (implementation-defined) overflow behavior for Go's conversions from float{32,64} to [u]int{8,16,32,64} to be saturating on all platforms. Put this under a GOEXPERIMENT for 1.26, if that goes well, make this the implementation-defined behavior in 1.27, and change the spec in 1.28. This is already directly supported for conversions to 32 and 64 bit integers on arm64, loong64, riscv64, s390x, ppc64, and wasm; the notable outlier is amd64.
int64(NaN) is an additional concern; hardware produces 3 different results:
- 0 - arm64, loong64, wasm
- MININT - amd64, s390x, ppc64
- MAXINT - riscv64
Mandating a single answer for NaN conversion will add some code to signed conversions for some languages; defining all result bits zero except for the sign bit only requires an additional change (beyond others already made) on Riscv64.
Why
Variations in overflow behavior lead to unportability and we don't have any particular tools for checking this. Removing this unnecessarily implementation-defined behavior would remove this source of portability bugs and also make the out-of-range results easier to describe and understand. We've had the same bug reported twice against x/time/rate because of this implementation defined behavior (#71154, #18762)
For conversions to 8 and 16-bit integers, the overflow results are not intuitive even when they agree across platforms. I am slightly concerned that there will be more of these in our glorious machine-learning future, and that their current behavior will cause bugs. Changing the behavior of these conversions will result in a simpler and more consistent specification.
Compatibility issues
Technically, this is implementation-defined behavior, and for conversions to 32 and 64 bit integers this is merely normalizing the implementation-defined behavior to be what is supported by (some) popular hardware. The "compatibility issues" are actually "portability issues".
A not-experiment-gated version of this was tried, and was found to break some external code (that was quickly modified to remove the break).
Within Google, testing generated 147 apparent failures. Bisection succeeded for 124 of these (that is, there are at least 124 real failures) and 92 of these were traced to an at-best-confused conversion in a testing library. For the other 32 bisected failures, neither the code nor the test showed any indication of intent in the case of overflow; apparently someone had observed a behavior and declared it golden.
The largest compatibility change is the conversions to 8 and 16-bit types. Because no hardware supports this, overflow behavior across platforms is somewhat more consistent, and the implementation cost increases somewhat across all platforms. If, however, the former behavior/performance is required/desired, it is easily obtained by first casting the floating point number to int64, then to the smaller integer type.
Implementation plans/costs
For conversions to 32 and 64 bit signed integers, most hardware already supports saturating conversions. Amd64 doesn't, and there's some variation in NaN conversion that we might want to eliminate. The details below show only 64-bit architectures; the pattern is similar for 32-bit, plus softfloat work.
Hardware support
Saturating conversions to 32 and 64-bit integers are already supported by many 64-bit hardware platforms. The exceptions are amd64 and some older mips64 chips (FCSRNAN2008=0). Starred platforms have non-zero NaN conversion.
| arm64 | riscv64* | ppc64* | wasm | s390x* | sparcv9 | mips64,loong64 | |
|---|---|---|---|---|---|---|---|
| f32->s32 | FCVTSSW | FCVTWS | - | I32TruncSatF32S | CFEBRA | FSTOI | TRUNCFW |
| f32->u32 | FCVTZUSW | FCVTWUS | - | I32TruncSatF32U | - | - | - |
| f32->s64 | FCVTSS | FCVTLS | - | I64TruncSatF32S | CGEBRA | FSTOX | TRUNCFV |
| f32->u64 | FCVTZUS | FCVTLUS | - | I64TruncSatF32U | - | - | - |
| f64->s32 | FCVTSDW | FCVTWD | FCTIWZ | I32TruncSatF64S | CFDBRA | FDTOI | TRUNCDW |
| f64->u32 | FCVTZUDW | FCVTWUD | FCTIWUZ | I32TruncSatF64U | - | - | - |
| f64->s64 | FCVTSD | FCVTLD | FCTIDZ | I64TruncSatF64S | CGDBRA | FDTOX | TRUNCDV |
| f64->u64 | FCVTZUD | FCVTLUD | FCTIDUZ | I64TruncSatF64U | - | - | - |
Signed 64-bit conversion on Amd64
func Int64(x float64) (result int64) {
result = int64(x) // CVTTSD2SQ (-)1 << 63 if NaN, +Inf, or -Inf
if result == -1<<63 { // error value
// should be one comparison of x and 0, and two conditional moves (but isn't yet)
if !(x < 0) { // x > 0 || x is NaN
result = 0
}
if x > 0 {
result = 0x7fff_ffff_ffff_ffffw
}
}
return
}
Normalizing NaN before hardware conversion to [u]int{32,64} if necessary (riscv64, ppc64, s390x)
func Int64(x float64) (result int64) {
if x != x { // NaN // compiles to conditional move if there is one.
x = 0
}
result = int64HW(x)
}
Unsigned 32/64 from signed conversion (amd64, s390x, sparcv9, loong64)
func Uint32(x float32) (result uint32) {
const W = 32
const cutoff = 1 << (W - 1)
if !(x >= cutoff) { // likely; also NaN
result = uint32(int32HW(x)) // first signed conversion
if !(x >= 0) { // unlikely; also handles NaN
result = 0
}
} else {
y := x - cutoff
z := int32HW(y) // second signed conversion
if amd64 && z < 0 { // amd64 positive overflow
z-- // or, z = cutoff-1
}
result = uint32(z) | cutoff
}
return
}
8/16 unsigned from unsigned float conversion (arm64, riscv64, wasm)
func Uint8(x float64) (result uint8) {
const cutoff = 1<<(intY_Size) // intY_Size is 16 or 8
result = uintY(uint64HW(x))
if riscv64 && !(x < cutoff) { // optional riscv64 NaN correction
result = 0
}
if x >= cutoff {
result = cutoff - 1
}
}
8/16 unsigned from signed float conversion (amd64, s390x, sparcv9, loong64)
func Uint8(x float64) (result uint8) {
const Y = 8
const cutoff = 1 << (Y - 1)
if !(x >= cutoff) { // likely; also NaN
result = uint8(int64HW(x)) // first signed conversion
if !(x >= 0) { // unlikely; also NaN
result = 0
}
} else {
result = cutoff - 1
}
return
}
8/16 signed from signed float conversion (all platforms)
func Int8(x float64) int8 {
const Y = 8
const W = 64
const OVERFLOW = (1<<(Y-1) - 1)
s := int64HW(x)
signs := uint64(s >> (Y - 1)) // want all 1s or 0s
if signs+1 < 2 { // all 1s or 0s, +1, yields 0 or 1
return int8(s) // includes 0 for NaN on arm64, loong64, wasm
}
// Some kind of overflow or amd64 && , ppc64, s390x, riscv64} nonzero NaN conversion
s = (s >> (W - 1)) ^ OVERFLOW // +/- overflow
if (amd64 || ppc64 || s390x || riscv64) && x != x {
s = 0 // NaN
}
If amd64 && x > 0 { // NaN has low bits zero, and NaN is not > 0.
s = OVERFLOW
}
return int8(s)
}
Existing conversion overflow behavior
Here 87381=1.5p16, big32=1.5p32, big64=1.5p64, values chosen so that they overflow conversions into both signed and unisgned integer types but are not infinity. Saturated results (that is, the goal of this proposal) appear in bold italic.
| conversion | 386 | amd64 | arm | arm64 | loong64 | riscv64 | s390x | ppc64 | ppc64le | conversion | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| int8(+87381) | 0x+55 | 0x+55 | 0x+55 | 0x+55 | 0x+55 | 0x+55 | 0x+55 | 0x+55 | 0x+55 | int8(+87381) | |
| int8(+big32) | 0 | 0 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | int8(+big32) | DIFFER |
| int8(+big64) | 0 | 0 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | int8(+big64) | DIFFER |
| int8(-87381) | 0x-55 | 0x-55 | 0x-55 | 0x-55 | 0x-55 | 0x-55 | 0x-55 | 0x-55 | 0x-55 | int8(-87381) | |
| int8(-big32) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | int8(-big32) | |
| int8(-big64) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | int8(-big64) | |
| int16(+87381) | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | int16(+87381) | |
| int16(+big32) | 0 | 0 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | int16(+big32) | DIFFER |
| int16(+big64) | 0 | 0 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | int16(+big64) | DIFFER |
| int16(-87381) | 0x-5555 | 0x-5555 | 0x-5555 | 0x-5555 | 0x-5555 | 0x-5555 | 0x-5555 | 0x-5555 | 0x-5555 | int16(-87381) | |
| int16(-big32) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | int16(-big32) | |
| int16(-big64) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | int16(-big64) | |
| int32(+big32) | MININT32 | MININT32 | MAXINT32 | MAXINT32 | MAXINT32 | MAXINT32 | MAXINT32 | MAXINT32 | MAXINT32 | int32(+big32) | DIFFER |
| int32(+big64) | MININT32 | MININT32 | MAXINT32 | MAXINT32 | MAXINT32 | MAXINT32 | MAXINT32 | MAXINT32 | MAXINT32 | int32(+big64) | DIFFER |
| int32(-big32) | MININT32 | MININT32 | MININT32 | MININT32 | MININT32 | MININT32 | MININT32 | MININT32 | MININT32 | int32(-big32) | |
| int32(-big64) | MININT32 | MININT32 | MININT32 | MININT32 | MININT32 | MININT32 | MININT32 | MININT32 | MININT32 | int32(-big64) | |
| int64(+big64) | 0 | MININT64 | 0x-100000000 | MAXINT64 | MAXINT64 | MAXINT64 | MAXINT64 | MAXINT64 | MAXINT64 | int64(+big64) | DIFFER |
| int64(-big64) | 0 | MININT64 | 0 | MININT64 | MININT64 | MININT64 | MININT64 | MININT64 | MININT64 | int64(-big64) | DIFFER |
| uint8(+87381) | 0x+55 | 0x+55 | 0x+55 | 0x+55 | 0x+55 | 0x+55 | 0x+55 | 0x+55 | 0x+55 | uint8(+87381) | |
| uint8(+big32) | 0 | 0 | 0x+ff | 0x+ff | 0x+ff | 0x+ff | 0x+ff | 0x+ff | 0x+ff | uint8(+big32) | DIFFER |
| uint8(+big64) | 0 | 0 | 0x+ff | 0x+ff | 0x+ff | 0x+ff | 0x+ff | 0x+ff | 0x+ff | uint8(+big64) | DIFFER |
| uint8(-87381) | 0x+ab | 0x+ab | 0x+ab | 0x+ab | 0x+ab | 0x+ab | 0x+ab | 0x+ab | 0x+ab | uint8(-87381) | |
| uint8(-big32) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | uint8(-big32) | |
| uint8(-big64) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | uint8(-big64) | |
| uint16(+87381) | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | 0x+5555 | uint16(+87381) | |
| uint16(+big32) | 0 | 0 | 0x+ffff | 0x+ffff | 0x+ffff | 0x+ffff | 0x+ffff | 0x+ffff | 0x+ffff | uint16(+big32) | DIFFER |
| uint16(+big64) | 0 | 0 | 0x+ffff | 0x+ffff | 0x+ffff | 0x+ffff | 0x+ffff | 0x+ffff | 0x+ffff | uint16(+big64) | DIFFER |
| uint16(-87381) | 0x+aaab | 0x+aaab | 0x+aaab | 0x+aaab | 0x+aaab | 0x+aaab | 0x+aaab | 0x+aaab | 0x+aaab | uint16(-87381) | |
| uint16(-big32) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | uint16(-big32) | |
| uint16(-big64) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | uint16(-big64) | |
| uint32(+big32) | 0x+55555555 | 0x+55555555 | MAXUINT32 | 0x+55555555 | 0x+55555555 | 0x+55555555 | 0x+55555555 | 0x+55555555 | 0x+55555555 | uint32(+big32) | DIFFER |
| uint32(+big64) | 0 | 0 | MAXUINT32 | MAXUINT32 | MAXUINT32 | MAXUINT32 | MAXUINT32 | MAXUINT32 | MAXUINT32 | uint32(+big64) | DIFFER |
| uint32(-87381) | 0x+fffeaaab | 0x+fffeaaab | 0 | 0x+fffeaaab | 0x+fffeaaab | 0x+fffeaaab | 0x+fffeaaab | 0x+fffeaaab | 0x+fffeaaab | uint32(-87381) | DIFFER |
| uint32(-big32) | 0x+aaaaaaab | 0x+aaaaaaab | 0 | 0x+aaaaaaab | 0x+aaaaaaab | 0x+aaaaaaab | 0x+aaaaaaab | 0x+aaaaaaab | 0x+aaaaaaab | uint32(-big32) | DIFFER |
| uint32(-big64) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | uint32(-big64) | |
| uint64(+big64) | 0 | 1<<63 | 0x+ffffffff00000000 | MAXUINT64 | MAXUINT64 | MAXUINT64 | MAXUINT64 | MAXUINT64 | MAXUINT64 | uint64(+big64) | DIFFER |
| uint64(-87381) | 0x+fffffffffffeaaab | 0x+fffffffffffeaaab | 0x+fffffffffffeaaab | 0 | 0x+fffffffffffeaaab | 0x+fffffffffffeaaab | 0 | 0x+fffffffffffeaaab | 0x+fffffffffffeaaab | uint64(-87381) | DIFFER |
| uint64(-big32) | 0x+fffffffeaaaaaaab | 0x+fffffffeaaaaaaab | 0x+fffffffeaaaaaaab | 0 | 0x+fffffffeaaaaaaab | 0x+fffffffeaaaaaaab | 0 | 0x+fffffffeaaaaaaab | 0x+fffffffeaaaaaaab | uint64(-big32) | DIFFER |
| uint64(-big64) | 0 | 1<<63 | 0 | 0 | 1<<63 | 1<<63 | 0 | 1<<63 | 1<<63 | uint64(-big64) | DIFFER |
| conversions | 386 | amd64 | arm | arm64 | loong64 | riscv64 | s390x | ppc64 | ppc64le | conversions | |
| int64(+Inf) | 0 | MININT64 | 0x-100000000 | MAXINT64 | MAXINT64 | MAXINT64 | MAXINT64 | MAXINT64 | MAXINT64 | int64(+Inf) | DIFFER |
| int64(-Inf) | 0 | MININT64 | 0 | MININT64 | MININT64 | MININT64 | MININT64 | MININT64 | MININT64 | int64(-Inf) | DIFFER |
| int64(NaN) | 0 | MININT64 | 0 | 0 | 0 | MAXINT64 | MININT64 | MININT64 | MININT64 | int64(NaN) | DIFFER |
| int8(+Inf) | 0 | 0 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | int8(+Inf) | DIFFER |
| int8(-Inf) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | int8(-Inf) | |
| int8(NaN) | 0 | 0 | 0 | 0 | 0 | -1 | 0 | 0 | 0 | int8(NaN) | DIFFER |
| uint32(+Inf) | 0 | 0 | MAXUINT32 | MAXUINT32 | MAXUINT32 | MAXUINT32 | MAXUINT32 | MAXUINT32 | MAXUINT32 | uint32(+Inf) | DIFFER |
| uint32(-Inf) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | uint32(-Inf) | |
| uint32(NaN) | 0 | 0 | 0 | 0 | 0 | MAXUINT32 | 0 | 0 | 0 | uint32(NaN) | DIFFER |
| uint8(+Inf) | 0 | 0 | 255 | 255 | 255 | 255 | 255 | 255 | 255 | uint8(+Inf) | DIFFER |
| uint8(-Inf) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | uint8(-Inf) | |
| uint8(NaN) | 0 | 0 | 0 | 0 | 0 | 255 | 0 | 0 | 0 | uint8(NaN) | DIFFER |