C++ proposal: There are exactly 8 bits in a byte
open-std.org288 points by Twirrim 9 months ago
288 points by Twirrim 9 months ago
Previously, in JF's "Can we acknowledge that every real computer works this way?" series: "Signed Integers are Two’s Complement" <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p09...>
Maybe specifying that floats are always IEEE floats should be next? Though that would obsolete this Linux kernel classic so maybe not.
https://github.com/torvalds/linux/blob/master/include/math-e...
I'm literally giving a talk next week who's first slide is essentially "Why IEEE 754 is not a sufficient description of floating-point semantics" and I'm sitting here trying to figure out what needs to be thrown out of the talk to make it fit the time slot.
One of the most surprising things about floating-point is that very little is actually IEEE 754; most things are merely IEEE 754-ish, and there's a long tail of fiddly things that are different that make it only -ish.
The IEEE 754 standard has been updated several times, often by relaxing previous mandates in order to make various hardware implementations become compliant retroactively (eg, adding Intel's 80-bit floats as a standard floating point size).
It'll be interesting if the "-ish" bits are still "-ish" with the current standard.
The first 754 standard (1985) was essentially formalization of the x87 arithmetic; it defines a "double extended" format. It is not mandatory:
> Implementations should support the extended format corresponding to the widest basic format supported.
_if_ it exists, it is required to have at least as many bits as the x87 long double type.¹
The language around extended formats changed in the 2008 standard, but the meaning didn't:
> Language standards or implementations should support an extended precision format that extends the widest basic format that is supported in that radix.
That language is still present in the 2019 standard. So nothing has ever really changed here. Double-extended is recommended, but not required. If it exists, the significand and exponent must be at least as large as those of the Intel 80-bit format, but they may also be larger.
---
¹ At the beginning of the standardization process, Kahan and Intel engineers still hoped that the x87 format would gradually expand in subsequent CPU generations until it became what is now the standard 128b quad format; they didn't understand the inertia of binary compatibility yet. So the text only set out minimum precision and exponent range. By the time the standard was published in 1985, it was understood internally that they would never change the type, but by then other companies had introduced different extended-precision types (e.g. the 96-bit type in Apple's SANE), so it was never pinned down.
The first 754 standard has still removed some 8087 features, mainly the "projective" infinity and it has slightly changed the definition of the remainder function, so it was not completely compatible with 8087.
Intel 80387 was made compliant with the final standard and by that time there were competing FPUs also compliant with the final standard, e.g. Motorola 68881.
I'm interested by your future talk, do you plan to publish a video or a transcript?
> there's a long tail of fiddly things that are different that make it only -ish.
Perhaps a way to fill some time would be gradually revealing parts of a convoluted Venn diagram or mind-map of the fiddling things. (That is, assuming there's any sane categorization.)
Hi! I'm JF. I half-jokingly threatened to do IEEE float in 2018 https://youtu.be/JhUxIVf1qok?si=QxZN_fIU2Th8vhxv&t=3250
I wouldn't want to lose the Linux humor tho!
That line is actually from a famous Dilbert cartoon.
I found this snapshot of it, though it's not on the real Dilbert site: https://www.reddit.com/r/linux/comments/73in9/computer_holy_...
Whether double floats can silently have 80 bit accumulators is a controversial thing. Numerical analysis people like it. Computer science types seem not to because it's unpredictable. I lean towards, "we should have it, but it should be explicit", but this is not the most considered opinion. I think there's a legitimate reason why Intel included it in x87, and why DSPs include it.
Numerical analysis people do not like it. Having _explicitly controlled_ wider accumulation available is great. Having compilers deciding to do it for you or not in unpredictable ways is anathema.
It isn’t harmful, right? Just like getting a little accuracy from a fused multiply add. It just isn’t useful if you can’t depend on it.
It can be harmful. In GCC while compiling a 32 bit executable, making an std::map< float, T > can cause infinite loops or crashes in your program.
This is because when you insert a value into the map, it has 80 bit precision, and that number of bits is used when comparing the value you are inserting during the traversal of the tree.
After the float is stored in the tree, it's clamped to 32 bits.
This can cause the element to be inserted into into the wrong order in the tree, and this breaks the assumptions of the algorithm leaidng to the crash or infinite loop.
Compiling for 64 bits or explicitly disabling x87 float math makes this problem go away.
I have actually had this bug in production and it was very hard to track down.
10 years ago, a coworker had a really hard time root-causing a bug. I shoulder-debugged it by noticing the bit patterns: it was a miscompile of LLVM itself by GCC, where GCC was using an x87 fldl/fstpl move for a union { double; int64; }. The active member was actually the int64, and GCC chose FP moved based on what was the first member of the union... but the int64 happened to be the representation of SNaN, so the instructions transformed it quietly to a qNaN as part of moving. The "fix" was to change the order of the union's members in LLVM. The bug is still open, though it's had recent activity: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58416
It also affected eMacs compilation and the fix is in the trunk now.
Wow 11 years for such a banal minimal code trigger. I really don’t quiet understand how we can have the scale of infrastructure in operation when this kind of infrastructure software bugs exist. This is not just gcc. All the working castle of cards is an achievement by itself and also a reminder that good enough is all that is needed.
I also highly doubt you could get a 1 in 1000 developers to successfully debug this issue were it happening in the wild, and much smaller to actually fix it.
If you think that’s bad let me tell you about the time we ran into a bug in memmove.
It had to be an unaligned memmove and using a 32 bit binary on a 64 bit system, but still! memmove!
And this bug existed for years.
This caused our database replicas to crash every week or so for a long time.
What use case do you have that requires indexing a hashmap by a floating point value? Keep in mind, even with a compliant implementation that isn't widening your types behind your back, you still have to deal with NaN.
In fact, Rust has the Eq trait specifically to keep f32/f64s out of hash tables, because NaN breaks them really bad.
std::map is not a hash map. It's a tree map. It supports range queries, upper and lower bound queries. Quite useful for geometric algorithms.
Rust's BTreeMap, which is much closer to what std::map is, also requires Ord (ie types which claim to possess total order) for any key you can put in the map.
However, Ord is an ordinary safe trait. So while we're claiming to be totally ordered, we're allowed to be lying, the resulting type is crap but it's not unsafe. So as with sorting the algorithms inside these container types, unlike in C or C++ actually must not blow up horribly when we were lying (or as is common in real software, simply clumsy and mistaken)
The infinite loop would be legal (but I haven't seen it) because that's not unsafe, but if we end up with Undefined Behaviour that's a fault in the container type.
This is another place where in theory C++ gives itself license to deliver better performance at the cost of reduced safety but the reality in existing software is that you get no safety but also worse performance. The popular C++ compilers are drifting towards tacit acceptance that Rust made the right choice here and so as a QoI decision they should ship the Rust-style algorithms.
> you still have to deal with NaN.
Detecting and filtering out NaNs is both trivial and reliable as long as nobody instructs the compiler to break basic floating point operations (so no ffast-math). Dealing with a compiler that randomly changes the values of your variables is much harder.
That's purely a problem of Rust being wrong.
Floats have a total order, Rust people just decided to not use it.
Are you mixing up long double with float?
Old Intel CPUs only had long double, 32 bit and 64 bit floats were a compiler hack on top of the 80 bit floating point stack.
It’s absolutely harmful. It turns computations that would be guaranteed to be exact (e.g. head-tail arithmetic primitives used in computational geometry) into “maybe it’s exact and maybe it’s not, it’s at the compiler’s whim” and suddenly your tests for triangle orientation do not work correctly and your mesh-generation produces inadmissible meshes, so your PDE solver fails.
Thank you, I found this hint very interesting. Is there a source you wouldn't mind pointing me to for those "head, tail" methods?
I am assuming it relates to the kinds of "variable precision floating point with bounds" methods used in CGAL and the like; Googling turns up this survey paper:
https://inria.hal.science/inria-00344355/PDF/p.pdf
Any additional references welcome!
Note here is a good starting point for the issue itself: http://www.cs.cmu.edu/~quake/triangle.exact.html
References for the actual methods used in Triangle: http://www.cs.cmu.edu/~quake/robust.html
If not done properly, double rounding (round to extended precision then rounding to working precision) can actually introduce larger approximation error than round to nearest working precision directly. So it can actually make some numerical algorithms perform worse.
I suppose it could be harmful if you write code that depends on it without realizing it, and then something changes so it stops doing that.
I get what you mean and agree, and have seen almost traumatized rants against ffast-math from the very same people.
After digging, I think this is the kind of thing I'm referring to:
https://people.eecs.berkeley.edu/~wkahan/JAVAhurt.pdf
https://news.ycombinator.com/item?id=37028310
I've seen other course notes, I think also from Kahan, extolling 80-bit hardware.
Personally I am starting to think that, if I'm really thinking about precision, I had maybe better just use fixed point, but this again is just a "lean" that could prove wrong over time. Somehow we use floats everywhere and it seems to work pretty well, almost unreasonably so.
Yeah. Kahan was involved in the design of the 8087, so he’s always wanted to _have_ extended precision available. What he (and I, and most other numerical analysts) are opposed to is the fact that (a) language bindings historically had no mechanism to force rounding to float/double when necessary, and (b) compilers commonly spilled x87 intermediate results to the stack as doubles, leading to intermediate rounding that was extremely sensitive to optimization and subroutine calls, making debugging numerical issues harder than it should be.
Modern floating-point is much more reproducible than fixed-point, FWIW, since it has an actual standard that’s widely adopted, and these excess-precision issues do not apply to SSE or ARM FPUs.
I was curious about float16, and TIL that the 2008 revision of the standard includes it as an interchange format:
Note that this type (which Rust will/ does in nightly call "f16" and a C-like language would probably name "half") is not the only popular 16-bit floating point type, as some people want to have https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
The IEEE FP16 format is what is useful in graphics applications, e.g. for storing color values.
The Google BF16 format is useful strictly only for machine learning/AI applications, because its low precision is insufficient for anything else. BF16 has very low precision, but an exponent range equal to FP32, which makes overflows and underflows less likely.
Permalink (press 'y' anywhere on GitHub): https://github.com/torvalds/linux/blob/4d939780b70592e0f4bc6....
That file hasn't been touched in over 19 years. I don't think we have to worry about the non-permalink url breaking any time soon.
Which one? Remember the decimal IEEE 754 floating point formats exist too. Do folks in banking use IEEE decimal formats? I remember we used to have different math libs to link against depending, but this was like 40 years ago.
Binding float to the IEEE 754 binary32 format would not preclude use of decimal formats; they have their own bindings (e.g. _Decimal64 in C23). (I think they're still a TR for C++, but I haven't been keeping track).
Nothing prevents banks (or anyone else) from using a compiler where "float" means binary floating point while some other native or user-defined type supports decimal floating point. In fact, that's probably for the best, since they'll probably have exacting requirements for that type so it makes sense for the application developer to write that type themselves.
I was referring to banks using decimal libraries because they work in base 10 numbers, and I recall a big announcement many years ago when the stock market officially switched from fractional stock pricing to cents "for the benefit of computers and rounding", or some such excuse. It always struck me as strange, since binary fixed and floating point represent those particular quantities exactly, without rounding error. Now with normal dollars and cents calculations, I can see why a decimal library might be preferred.
During an internship in 1986 I wrote C code for a machine with 10-bit bytes, the BBN C/70. It was a horrible experience, and the existence of the machine in the first place was due to a cosmic accident of the negative kind.
I wrote code on a DECSYSTEM-20, the C compiler was not officially supported. It had a 36-bit word and a 7-bit byte. Yep, when you packed bytes into a word there were bits left over.
And I was tasked with reading a tape with binary data in 8-bit format. Hilarity ensued.
That is so strange. If it were 9-bit bytes, that would make sense: 8bits+parity. Then a word is just 32bits+4 parity.
7 bits matches ASCII, so you can implement entire ASCII character set, and simultaneously it means you get to fit one more character per byte.
Using RADIX-50, or SIXBIT, you could fit more but you'd lose ASCII-compatibility
8 bits in a byte exist in the first place because "obviously" a byte is a 7 bit char + parity.
(*) For some value of "obviously".
Hah. Why did they do that?
Which part of it?
8 bit tape? Probably the format the hardware worked in... not actually sure I haven't used real tapes but it's plausible.
36 bit per word computer? Sometimes 0..~4Billion isn't enough. 4 more bits would get someone to 64 billion, or +/- 32 billion.
As it turns out, my guess was ALMOST correct
https://en.wikipedia.org/wiki/36-bit_computing
Paraphrasing, legacy keying systems were based on records of up to 10 printed decimal digits of accuracy for input. 35 bits would be required to match the +/- input but 36 works better as a machine word and operations on 6 x 6 bit (yuck?) characters; or some 'smaller' machines which used a 36 bit larger word and 12 or 18 bit small words. Why the yuck? That's only 64 characters total, so these systems only supported UPPERCASE ALWAYS numeric digits and some other characters.
Somehow this machine found its way onto The Heart of Gold in a highly improbable chain of events.
I programmed the Intel Intellivision cpu which had a 10 bit "decl". A wacky machine. It wasn't powerful enough for C.
I've worked on a machine with 9-bit bytes (and 81-bit instructions) and others with 6-bit ones - nether has a C compiler
The Nintendo64 had 9-bit RAM. But, C viewed it as 8 bit. The 9th bit was only there for the RSP (GPU).
I think the pdp-10 could have 9 bit bytes, depending on decisions you made in the compiler. I notice it's hard to Google information about this though. People say lots of confusing, conflicting things. When I google pdp-10 byte size it says a c++ compiler chose to represent char as 36 bits.
PDP-10 byte size is not fixed. Bytes can be 0 to 36 bits wide. (Sure, 0 is not very useful; still legal.)
I don't think there is a C++ compiler for the PDP-10. One of the C compiler does have a 36-bit char type.
I was summarizing this from a Google search. https://isocpp.org/wiki/faq/intrinsic-types#:~:text=One%20wa....
As I read it, this link may be describing a hypothetical rather than real compiler. But I did not parse that on initial scan of the Google result.
Do you have any links/info on how that 0-bit byte worked? It sounds like just the right thing for a Friday afternoon read ;D
It should be in the description for the byte instructions: LDB, DPB, IBP, and ILDB. http://bitsavers.org/pdf/dec/pdp10/1970_PDP-10_Ref/
Basically, loading a 0-bit byte from memory gets you a 0. Depositing a 0-bit byte will not alter memory, but may do an ineffective read-modify-write cycle. Incrementing a 0-bit byte pointer will leave it unchanged.
10-bit arithmetics are actually not uncommon on fpgas these days and are used in production in relatively modern applications.
10-bit C, however, ..........
How so? Arithmetic on FPGA usually use the minimum size that works, because any size over that will use more resources than needed.
9-bit bytes are pretty common in block RAM though, with the extra bit being used for either for ECC or user storage.
10-bit C might be close to non-existent, but I've heard that quite a few DSP are word addressed. In practice this means their "bytes" are 32 bits.
sizeof(uint32_t) == 1
D made a great leap forward with the following:
1. bytes are 8 bits
2. shorts are 16 bits
3. ints are 32 bits
4. longs are 64 bits
5. arithmetic is 2's complement
6. IEEE floating point
and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!
Oh, and Unicode was the character set. Not EBCDIC, RADIX-50, etc.
Zig is even better:
1. u8 and i8 are 8 bits.
2. u16 and i16 are 16 bits.
3. u32 and i32 are 32 bits.
4. u64 and i64 are 64 bits.
5. Arithmetic is an explicit choice. '+' overflowing is illegal behavior (will crash in debug and releasesafe), '+%' is 2's compliment wrapping, and '+|' is saturating arithmetic. Edit: forgot to mention @addWithOverflow(), which provides a tuple of the original type and a u1; there's also std.math.add(), which returns an error on overflow.
6. f16, f32, f64, f80, and f128 are the respective but length IEEE floating point types.
The question of the length of a byte doesn't even matter. If someone wants to compile to machine whose bytes are 12 bits, just use u12 and i12.
Zig allows any uX and iX in the range of 1 - 65,535, as well as u0
u0?? Why?
Sounds like zero-sized types in Rust, where it is used as marker types (eg. this struct own this lifetime). It also can be used to turn a HashMap into a HashSet by storing zero sized value. In Go a struct member of [0]func() (an array of function, with exactly 0 members) is used to make a type uncomparable as func() cannot be compared.
Same deal with Rust.
I've heard that Rust wraps around by default?
Rust has two possible behaviours: panic or wrap. By default debug builds with panic, release builds with wrap. Both behaviours are 100% defined, so the compiler can't do any shenanigans.
There are also helper functions and types for unchecked/checked/wrapping/saturating arithmetic.
LLVM has:
i1 is 1 bit
i2 is 2 bits
i3 is 3 bits
…
i8388608 is 2^23 bits
(https://llvm.org/docs/LangRef.html#integer-type)
On the other hand, it doesn’t make a distinction between signed and unsigned integers. Users must take care to use special signed versions of operations where needed.
How does 5 work in practice? Surely no one is actually checking if their arithmetic overflows, especially from user-supplied or otherwise external values. Is there any use for the normal +?
You think no one checks if their arithmetic overflows?
I'm sure it's not literally no one but I bet the percent of additions that have explicit checks for overflow is for all practical purposes indistinguishable from 0.
Lots of secure code checks for overflow
fillBufferWithData(buffer, data, offset, size)
You want to know that offset + size don't wrap past 32bits (or 64) and end up with nonsense and a security vulnerability.Eh I like the nice names. Byte=8, short=16, int=32, long=64 is my preferred scheme when implementing languages. But either is better than C and C++.
It would be "nice" if not for C setting a precedent for these names to have unpredictable sizes. Meaning you have to learn the meaning of every single type for every single language, then remember which language's semantics apply to the code you're reading. (Sure, I can, but why do I have to?)
[ui][0-9]+ (and similar schemes) on the other hand anybody can understand at the first glance.
> D made a great leap forward
> and a big chunk of wasted time trying to abstract these away and getting it wrong anyway was saved. Millions of people cried out in relief!
Nah. It is actually pretty bad. Type names with explicit sizes (u8, i32, etc) are way better in every way.
> Type names with explicit sizes (u8, i32, etc) are way better in every way
Until one realizes that the entire namespace of innn, unnn, fnnn, etc., is reserved.
"1. bytes are 8 bits"
How big is a bit?
This doesn't feel like a serious question, but in case this is still a mystery to you… the name bit is a portmanteau of binary digit, and as indicated by the word "binary", there are only two possible digits that can be used as values for a bit: 0 and 1.
So trinary and quaternary digits are trits and quits?
Yes, trit is commonly used for ternary logic. "quit" I have never heard in such a context.
A bit is a measure of information theoretical entropy. Specifically, one bit has been defined as the uncertainty of the outcome of a single fair coin flip. A single less than fair coin would have less than one bit of entropy; a coin that always lands heads up has zero bits, n fair coins have n bits of entropy and so on.
https://en.m.wikipedia.org/wiki/Information_theory
https://en.m.wikipedia.org/wiki/Entropy_(information_theory)
That is a bit in information theory. It has nothing to do with the computer/digital engineering term being discussed here.
This comment I feel sure would repulse Shannon in the deepest way. A (digital, stored) bit, abstractly seeks to encode and make useful through computation the properties of information theory.
Your comment must be sarcasm or satire, surely.
I do not know or care what would Mr. Shannon think. What I do know is that the base you chose for the logarithm on the entropy equation has nothing to do with the amount of bits you assign to a word on a digital architecture :)
How philosophical do you want to get? Technically, voltage is a continuous signal, but we sample only at clock cycle intervals, and if the sample at some cycle is below a threshold, we call that 0. Above, we call it 1. Our ability to measure whether a signal is above or below a threshold is uncertain, though, so for values where the actual difference is less than our ability to measure, we have to conclude that a bit can actually take three values: 0, 1, and we can't tell but we have no choice but to pick one.
The latter value is clearly less common than 0 and 1, but how much less? I don't know, but we have to conclude that the true size of a bit is probably something more like 1.00000000000000001 bits rather than 1 bit.
A bit is either a 0 or 1. A byte is the smallest addressable piece of memory in your architecture.
Technically the smallest addressable piece of memory is a word.
I don't think the term word has any consistent meaning. Certainly x86 doesn't use the term word to mean smallest addressable unit of memory. The x86 documentation defines a word as 16 bits, but x86 is byte addressable.
ARM is similar, ARM processors define a word as 32-bits, even on 64-bit ARM processors, but they are also byte addressable.
As best as I can tell, it seems like a word is whatever the size of the arithmetic or general purpose register is at the time that the processor was introduced, and even if later a new processor is introduced with larger registers, for backwards compatibility the size of a word remains the same.
Depends on your definition of addressable.
Lots of CISC architectures allow memory accesses in various units even if they call general-purpose-register-sized quantities "word".
Iirc the C standard specifies that all memory can be accessed via char*.
Every ISA I've ever used has used the term "word" to describe a 16- or 32-bit quantity, while having instructions to load and store individual bytes (8 bit quantities). I'm pretty sure you're straight up wrong here.
That's only true on a word-addressed machine; most CPUs are byte-addressed.
The difference between address A and address A+1 is one byte. By definition.
Some hardware may raise an exception if you attempt to retrieve a value at an address that is not a (greater than 1) multiple of a byte, but that has no bearing on the definition of a byte.
Which … if your heap always returns N bit aligned values, for some N … is there a name for that? The smallest heap addressable segment?
If your detector is sensitive enough, it could be just a single electron that's either present or absent.
That's a bit self-pat-on-the-back-ish, isn't it, Mr. Bright, the author of D language? :)
Of course!
Over the years I've known some engineers who, as a side project, wrote some great software. Nobody was interested in it. They'd come to me and ask why that is? I suggest writing articles about their project, and being active on the forums. Otherwise, who would ever know about it?
They said that was unseemly, and wouldn't do it.
They wound up sad and bitter.
The "build it and they will come" is a stupid Hollywood fraud.
BTW, the income I receive from D is $0. It's my gift. You'll also note that I've suggested many times improvements that could be made to C, copying proven ideas in D. Such as this one:
https://www.digitalmars.com/articles/C-biggest-mistake.html
C++ has already adopted many ideas from D.
> https://www.digitalmars.com/articles/C-biggest-mistake.html
To be fair, this one lies on the surface for anyone trying to come up with an improved C. It's one of the first things that gets corrected in nearly all C derivatives.
> C++ has already adopted many ideas from D.
Do you have a list?
Especially for the "adopted from D" bit rather than being a evolutionary and logical improvement to the language.
Yeah, this is something Java got right as well. It got "unsigned" wrong, but it got standardizing primitive bits correct
byte = 8 bits
short = 16
int = 32
long = 64
float = 32 bit IEEE
double = 64 bit IEEE
I like the Rust approach more: usize/isize are the native integer types, and with every other numeric type, you have to mention the size explicitly.
On the C++ side, I sometimes use an alias that contains the word "short" for 32-bit integers. When I use them, I'm explicitly assuming that the numbers are small enough to fit in a smaller than usual integer type, and that it's critical enough to performance that the assumption is worth making.
<cstdint> has int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, and uint64_t. I still go back and forth between uint64_t, size_t, and unsigned int, but am defaulting to uint64_t more and more, even if it doesn't matter.
> you have to mention the size explicitly
It's unbelievably ugly. Every piece of code working with any kind of integer screams "I am hardware dependent in some way".
E.g. in a structure representing an automobile, the number of wheels has to be some i8 or i16, which looks ridiculous.
Why would you take a language in which you can write functional pipelines over collections of objects, and make it look like assembler.
If you don't care about the size of your number, just use isize or usize.
If you do care, then isn't it better to specify it explicitly than trying to guess it and having different compilers disagreeing on the size?
A type called isize is some kind of size. It looks wrong for something that isn't a size.
Then just define a type alias, which is good practice if you want your types to be more descriptive: https://doc.rust-lang.org/reference/items/type-aliases.html
Nope! Because then you will also define an alias, and Suzy will define an alias, and Bob will define an alias, ...
We should all agree on int and uint; not some isize nonsense, and not bobint or suzyint.
Ok, it is obvious that you are looking for something to complaint about and don't want to find a solution. That is not a productive attitude in life, but whatever floats your boat. Have a good day.