The point of this post, though, is even something as simple as "give me this string as an integer" doesn't have an answer that doesn't come with "are you OK with this best effort parse under these edge cases? Oh and we use this number as error, so you can't parse that".
Like… edge cases? It's parsing a number! We're not talking about I/O on hard vs soft intr NFS mounts, here. There's a right answer.
strlen(), on valid null terminated strings, doesn't come with caveats like "oh we can't measure strings of length 99".
But sure, C is turing complete. It is possible to solve any problem a turing machine can solve.
> understand the target platform and the target compiler’s behavior.
Somewhat true, but C is pretty close to translating directly to machine code, even if most compilers now do so many complex things the assembly can be pretty far off. My point being is that if you have a type int in your program, it's specifically tied to the byte size of an integer on the target platform. While it can be 8, 16, 32, 64 bits, it's defined based on what the target platform supports efficiently.
So, when you say, "it's purely the language", I have to disagree. The language means different things on different platforms but it's still defined exactly on the target platform. And it's efficient on that platform.
Nowadays, we prefer correct vs. efficient, which I do agree with, of course. But, I also understand why C is like it is. It is possible to claim it's a problem of the language but I would argue that it is not. C gives us barebones and working with it we have to know this. If that's not needed then sure, other languages will be easier to work with.
> C is pretty close to translating directly to machine code
The C standard defines only its abstract machine, not actual hardware.
> The language means different things on different platforms but it's still defined exactly on the target platform
It's implemented to support a target platform, so that programs behave as if they ran on the abstract machine.
It'd be nice if we could move more stuff from UB to implementation defined.
Do keep in mind that target platform can change, in this regard. E.g. IIRC OpenBSD doesn't guarantee the ABI backward compatibility that Linux does, and can change things like size of int if they want, between versions.
> I also understand why C is like it is
Yup. It can be true that I understand why, and still understand that it's 2026.
Is their misbehavior part of the spec as well? If not, we can always add the correct behavior to the spec and let anyone who implemented a broken version deal with fixing every program compiled using it.
For strtoul and friends, maybe? 7.24.1 is pretty dense, but the key parts are "the expected form of the subject sequence is a sequence of letters and digits representing an integer with the radix specified by base, optionally preceded by a plus or minus sign […] If the correct value is outside the range of representable values […] ULONG_MAX […] is returned".
So the "expected form" allows a minus sign, but then it's clearly "outside the range of representable values" for strtoul to try parsing a negative value. So maybe it should return ULONG_MAX on those.
So arguably a minus sign present could already be treated as an error, and still be standard compliant. Unless I'm misreading.
Passing a negative value to a function that is specifically for converting strings into unsigned numbers is pretty much an error. In the case of functions that return an unsigned number, at least, negative return values can represent errors.
It’s more fun when the result can be signed though. Maybe strcmp with the representation of the LONG_MAX, and if it doesn’t match, call strtol and watch for a LONG_MAX indicating an error.
C is a bit messy. Would be nicer to return a struct with a possible error and the desired value, Golang style.
If you pass a negative number as a string to a function that converts strings representing positive numbers into positive long integers, then yes, it should return an error status instead of the wrong result.
While snprintf() is better than sprintf(), I find that it's easy for people to not check if the return value is bigger than the provided size. Sure, it prevents a buffer overflow, but there could still be a string truncation problem.
Similar to how strlcpy() is not a slam dunk fix to the strcpy() problem.
If someone uses sprintf() you have to go faffing around to check whether they've thought about the destination buffer size. The size of the structure may be buried far away through several layers of other APIs/etc.
Using snprintf() doesn't solve this in any way, but checking whether the new use of snprintf() checks the return value is relatively simple. Again, there's still no guarantee that there aren't other problems with snprintf() but, in our experience, we found that once people were forced to use it over sprintf() and had things checked in PR reviews we found that the number of instances of misuse dropped dramatically.
It wasn't the switch of functions that reduced the number of problems we saw, but the outright banning of the known footgun `sprintf()` and the careful auditing and replacement of it with `snprintf()` that served as a whole load of reference copies for how to use it. We spread the work of replacing `sprintf()` around the team so that everyone got to do some of the switches and everyone got to review the changes. And we found a whole load of possible problems (most of which were very unlikely to ever lead to a crash or corruption.)
The same would apply if you picked any other known footgun and did similar refactoring/rewrites/auditing/etc.
Anyway, I haven't done C commercially/professionally for about 5 years now. I do miss it though.
> they don't like seem to like that unsigned parsing will accept negative numbers and then automatically wrap them to their unsigned equivalents, nor do they like that C number parsing often bails with best effort on non-numeric trailing data rather than flagging it an error, nor do they like that ULONG_MAX is used as a sentinel value by sscanf.
That's right. I don't like asking it to parse the number contained inside a string, and getting a different number as a result.
That's just simply not the right answer.
> I'm not sure what they mean by "output raw" vs "output"
I can see how that's very unclear. Changed now to "Readable".
Are you talking about creating a pointer (more than one item) past an array, or dereferencing that pointer? Both are currently UB.
For the former, I kinda get it. It may need to be there for cases like with segmented address space where p+10 could actually be a value less than p, for the eventually generated assembly. Maybe it should be fine to create such a pointer, but have it be "indeterminate value" or whatever, if you try to compare that pointer to anything? I don't know enough about compiler internals to say one way or the other.
Dereferencing, though, can only be UB. There may not be a "value" behind that address. There may be a motor that's been I/O mapped, or a self destruct button.
I think I would defer to someone more of a language lawyer than we, but I'm not sure what you're describing can be expressed in the C abstract machine. If a pointer is invalid, not pointing to an object, then I'm not sure it means anything to "read from there".
I know what you mean, but I'm just not sure you're describing something that fits what C "is". We program C to the abstract machine specified in the standard (5.1.2), and the compiler's job is to translate that into something with identical behavior on particular hardware. Piercing the layers down to actual hardware or assembly isn't really done.
Even "volatile" just says (basically) "touching this object has side effects". It implies no double-loading, speculative store, etc, but doesn't say "don't emit assembly instructions to load this unless the program logic path takes the route where the C program does load it".
The standard is not using ancient language when it refers to "objects with static storage duration" instead of "heap" or ".data segment". It is the true class of objects in the abstract machine.
Wouldn't that make a compiler that emitted bounds checks violate the standard, since it would not be emitting the actual memory operations if you deref out of bounds?
I agree. The point of the post is not to enumerate and explain the implications of all 283 uses of the word "undefined" in the standard. Nor enumerate all the things that are undefined by omission.
The point of the post is to say it's not possible to avoid them. Or at least, no human since the invention of C in 1972 has.
And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.
The (one!) exploitable flaw found by Mythos in OpenBSD was an impressive endorsement of the OpenBSD developers, and yet as the post says, I pointed it at the simplest of their code and found a heap of UB.
Now, is it exploitable that `find` also reads the uninitialized auto variable `status` (UB) from a `waitpid(&status)` before checking if `waitpid()` returned error? (not reported) I can't imagine an architecture or compiler where it would be, no.
FTA:
> The following is not an attempt at enumerating all the UB in the world. It’s merely making the case that UB is everywhere, and if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL nontrivial C and C++ code has UB.
> Now, is it exploitable that `find` also reads the uninitialized auto variable `status` (UB) from a `waitpid(&status)` before checking if `waitpid()` returned error? (not reported) I can't imagine an architecture or compiler where it would be, no.
The only signal handler find installs is for SIGINFO, and it uses the SA_RESTART flag, so EINTR can be ruled out. The pid argument is definitely valid as you can't reach the above if it wasn't, and there's no other way for the child process to be reaped[1], so no ECHILD.
A check should probably be added in case the situation changes in the future, triggering spooky action at a distance, or were that code to be copy+pasted somewhere where the invariants didn't hold. But I think the current code in its current context is, strictly speaking, correct as-is.
[1] OpenBSD lacks the kernel features for such surprises that might theoretically be possible on Linux.
Indeed. That's why I didn't deem it worth reporting.
But in my code, I would have fixed for the reasons you mention. Sprinkle enough of these around, and some low percentage will in the future have its assumption invalidated.
No. You only get EINTR when a signal handler fires and you didn't use the SA_RESTART flag with sigaction. If you don't install any signal handlers, or you use SA_RESTART on all handlers, or you've blocked/masked all signals (or at least the ones with handlers), you won't get EINTR.
When writing library code, it's important to consider EINTR because you can't know about signal dispositions. Though, the common practice of looping on EINTR kind of defeats the purpose.
> And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.
And I 100% agree. UB is way overused by these standards for how dangerous it is, and as a consequence using C (and C++) for anything nontrivial amounts to navigating a minefield.
I think as compilers got smarter, UB changed somewhat in meaning. Originally the compilers didn't perform such complex analysis, and while invoking UB could break your program, it would still do something reasonable.
You see a reasoning here, basically when all those C compiler benchmarks started, vendors moved from what Frank Allen described, to anything goes to win SPEC something benchmarks.
"Oh, it was quite a while ago. I kind of stopped when C came out. That was a big blow. We were making so much good progress on optimizations and transformations. We were getting rid of just one nice problem after another. When C came out, at one of the SIGPLAN compiler conferences, there was a debate between Steve Johnson from Bell Labs, who was supporting C, and one of our people, Bill Harrison, who was working on a project that I had at that time supporting automatic optimization...The nubbin of the debate was Steve's defense of not having to build optimizers anymore because the programmer would take care of it. That it was really a programmer's issue.... Seibel: Do you think C is a reasonable language if they had restricted its use to operating-system kernels? Allen: Oh, yeah. That would have been fine. And, in fact, you need to have something like that, something where experts can really fine-tune without big bottlenecks because those are key problems to solve. By 1960, we had a long list of amazing languages: Lisp, APL, Fortran, COBOL, Algol 60. These are higher-level than C. We have seriously regressed, since C developed. C has destroyed our ability to advance the state of the art in automatic optimization, automatic parallelization, automatic mapping of a high-level language to the machine. This is one of the reasons compilers are ... basically not taught much anymore in the colleges and universities."
-- Fran Allen interview, Excerpted from: Peter Seibel. Coders at Work: Reflections on the Craft of Programming
“Implementation defined behaviour”: compiler author chooses, and documents the choice.
A lot of UB should be implementation defined behaviour instead; this would much better match programmers’ intuitions as they reason about their code - you can even see it in the comments of this post: it’s always things like “this hardware supports / doesn’t support unaligned accesses”, it’s never nasal demons.
That isn't true, for UB the compiler is allowed to assume the UB can never happen. For example if you dereference a pointer and only after check if it is NULL, the compiler can remove the NULL check, since it is clearly impossible (nevermind that you might be on a microcontroller where NULL is a valid address).
The fallout of this are quite large!
If behaviour is implementation defined the compiler has to stick to one consistent behaviour. No such need for UB, you can get different behaviour bu changing unrelated code, by changing between debug and release or just because of what garbage happened to be on the stack.
Since the compiler is allowed to assume the UB doesn't happen it will also sometimes look like the compiler miscompiled your code elsewhere, but what actually happened was some inlining followed by extrapolating "this can never happen".
UB is often surprising: I have seen unaligned loads crash on x86 due to it bring UB in C (even though x86 is generally fine with it). But once a newer compiler decided that it was fine to vectorise that code (since it clearly aligned) the CPU was no longer happy with it.
>That is, the compiler de facto defines what happens when you compile UB code.
That is not what undefined behavior is though, that is unspecified behavior.
The entire point of undefined behavior is to cover the cases where the compiler can't define the semantics of your program either because doing so is genuinely not possible, or is incredibly onerous to deduce, or would require introducing runtime checks whose performance cost is at odds with C and C++'s predominant use cases.
Both are wrong. It means "this standard does not constrain the behaviour of code that does this".
It's entirely legal for implementations to have predictable behaviour, documented or not, for code that is undefined by the standard. In their quest for maxxing benchmark performance they generally choose not to, but there's really nothing in any standard that stops you from making an implementation that prioritises safety.
Every implementation so far has predictable behavior in all cases. Sometimes the rules for predicting it are very obscure. But it's all fully defined within the compiler's binary code. And none of them link to nasal portals.
How do you propose to predict the behavior of a true race condition with only the binary, faithfully translated by the compiler?
Moreover, this is at best an incredibly pedantic point, not something that changes how programmers need to approach UB. You can't review the source code of a compiler that hasn't been written yet.
I didn't suggest that implementations should entirely eliminate every form of UB. There is plenty of middle ground. For example, you could easily limit the consequences of integer overflow by specifying or partially specifying overflow behaviour, with very little runtime cost.
I'm not suggesting you change how you write code, but with a better implementation the code that you do write - that lives in the real world where mistakes are made - might work better. How is that being pedantic?
An interesting case where compiler writers did something like that is casting via union members, but I'm running out of time, so we can talk about that another day.
Turns out that when you're implementing network applications, the set of things that could happen also depends on what the script kiddie on the other side of the globe feels like this morning.
Some would prefer less excitement than this.
C code should be more predictable and easier to reason about than using a macro assembler. To the extent it is not, the language has failed.
Given that it sits at the heart of the network stack, kernel and device drivers for every major operating system, is in many, many embedded devices in the World around us, and is responsible for making decent chunk of the global economy keep moving, that’s quite a failure case.
Perhaps some professional programmers know how to write secure software in a language with undefined behaviour. Maybe we should think about that more rather than just writing off an obviously huge success as a failure?
You're missing the point. Volatile forces two loads of a value that may have changed in the middle. So the value of "x" may depend on the time/order of load.
Which is, if I understand correctly, the entire point of volatile. Don't use it if you don't want that behavior.
And in fact, in the example given, if there is something (another thread or whatever) that can change the value of x, then you don't know what either number will be. Well, in that circumstance, without volatile, it may print the same number both times, but you still don't know what the number will be (unless the read gets optimized away entirely).
I suspect that many undefined behaviors reflect the inability of the standard committee to come to a consensus on the nuances involved. “Punt to the implementers” is a way to allow every tool vendor to select their own expected behavior in those cases.
You seem to be operating under the assumption "undefined behavior" means "the compiler authors can decide what to do." That's not what it means. It means "any program that causes this behavior to be triggered is not a valid C program, the programmer knows this and did not submit an invalid program, and the programmer explicitly prevented this from happening elsewhere in ways automated analysis cannot detect. Proceed with compilation knowing this branch is impossible."
The spelling for compiler authors getting to choose a behavior is "implementation defined", as the other comment mentions.
It means the C standard does not specify what the program does. Other documents may still specify what the program does. And the program definitely still does something, whether specified or not.
Why is that missing the point? Loading it twice, possibly with different values, is the intended behavior. It's only undefined because the C spec doesn't specify the order of the loads (unlike most other languages which have a perfectly well-defined order for side effects in a single expression).
No I'm just repeating what the original comment said, which is that it's explicitly UB:
"5.1.2.4.1 says any volatile access - including just reading it - is a side effect. 6.5.1.2 says that unsequenced side effects on the same scalar object (in this case, x) are UB. 6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other."
If function arguments were sequenced with respect to each other, it wouldn't be a problem.
But actually, maybe the original comment is wrong. Presumably "indeterminately sequenced" and "unsequenced" mean different things, although I don't have a copy of the standard at hand to check.
Why? Just for this edge case? It could be faster and/or allow smaller code size to allow this to be undefined.
Undefined is also different from "depends on the compiler", because which behavior is chosen can even depend on the circumstances, whatever code appears before and/or after it.
That said, UB in code, such as this example of ordering of reads of volatile parameters being undefined, does not automatically mean that code that uses it is bad. It may very well be that the function being called doesn't misbehave either way.
That’s the point of the whole article. It’s not worth the speed gain to have a language that nobody can safely use because you can’t really prevent UB when you write it.
> It may very well be that the function being called doesn't misbehave either way.
The function being good or bad has nothing to do with the UB. The UB occurs before the function is called.
There is no uninitialized variable, I explicitly initialized it to 5.
And yes indeed, C could do what Rust does and define the order of evaluation for function arguments.
If the argument expressions are indeed side-effect-less, the compiler can always make use of the "as-if" rule and legally reorder the computation anyway, for example to alleviate register pressure.
> Or at least, no human since the invention of C in 1972 has.
No human without proper tools maybe, but what about seL4? It goes beyond proving the code is UB-free and actually formally verifies the code works as intended. And the code is written in C. (the proofs of course aren't)
The proof is interesting because it goes beyond just proving the C code is correct. For some platforms, they compile the code with an ordinary compiler, and verify that the machine code does what the C code is supposed to do. (that's because just writing correct C code doesn't help you if you trigger a compiler bug)
This works even if the compiler (in this case, GCC) isn't verified - they verify a specific output of the compiler, not that the compiler always generates machine code correctly.
> The point of the post is to say it's not possible to avoid them. Or at least, no human since the invention of C in 1972 has.
What are you talking about? UB was coined only in the first C standard, in 1989. Prior to that there was no "If you do this, anything can happen". It was "If you do this, that will happen".
> UB was coined only in the first C standard, in 1989
Pre 1989, when C did not have a standard, was the behavior unspecified or undefined? That is, of course, a trick question. Because in this context the very definitions of the words come from the standard itself.
Before a language gets a specification, is the de facto specification the five words "you know what I mean"?
The very definition of "UB" in C later became "[…] this document imposes no requirements". Is that not the same thing as "there is to specification (yet)"?
It sounds very zen, but "a non existing specification imposes no requirements".
But I don't think it's meaningful to argue the semantic difference before the (in-context) existence of the words "undefined" vs "unspecified".
> Prior to that there was no "If you do this, anything can happen".
Of course it was. You relied on "common sense".
> It was "If you do this, that will happen".
Haha, of course it wasn't. Before a specification there is neither a definition of "this" nor "that".
Unless you mean ye olde "the compiler implementation is the specification". In which case we'll get dragged into "what even is a language" and "what is the sound of one hand clapping?".
Or, alternatively, it's as true then as it is today. If you go by "GCC x.y.z on platform Z kernel Y, (etc…) is the specification" then there is no UB.
> UB was coined only in the first C standard, in 1989. Prior to that there was no "If you do this, anything can happen".
I.e., the context is, before UB existed as a concept, how would these things be categorized. And I was trying to offer the correction that, before UB existed, it wasn't "all behavior is defined" but rather many behaviors depend on your particular local environment. While that may technically be implementation defined, the current standard requires that implementation defined be documented, and UB-like edge cases were most definitely not documented anywhere consistently in the old days!
No, that's actually UB. The important bit here is "compiler defined" -- UB means the compiler is allowed to assume it never happens while compiling.
Consider, for example, an implementation defined function f() -- which can also diverge/crash horribly, etc.
If I write
if p {
print("p is true")
} else {
g()
}
if p {
f()
}
Then either we:
- print p is true and execute f
- do nothing
This is true regardless of if f immediately crashes the computer, nasal demons, whatever -- that's implementation defined.
UB means f may never happen.
And that means the compiler may optimize this to just:
g()
Notice the difference here -- the print never happens!, and g always happens.
You can see why this is concerning when you write code like
if dry_run {
print("would run rm -rf /")
} else {
run("rm -rf /")
}
if dry_run {
// oops: some_debug_string is NULL and will segfault!
print(some_debug_string);
}
I see what you're going for, but I don't see how your example is UB. If `p` is a pointer, and, after your `if (p)` check, `p` is dereferenced unconditionally, then yes, your check for `p == NULL` could be removed, and the code under the `if` would be removed as well. But the example you've constructed is not UB.
If doesn't matter what 'p' is in their example. The point is: if 'f' is undefined behavior (rather than just impl-defined), then the optimizer concludes that the "if p { f() }" can never happen... which means that we're allowed to assume that 'if p { ... } else { ... }' (in the first part of the example) will always take the else branch. The compiler will optimize accordingly and just always call g() unconditionally.
> if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL
It's fair to blame the programmer for the choice of programming in a language like this, if it was in fact their choice. As you've so eloquently put, choosing those languages is essentially equivalent to choosing UB, so starting a new project with one of them is 100% blameworthy when the UB is inevitably found.
Not all projects are green field. But sure, new modules can be written in other languages. And C is, as cross-language barriers go, fairly easy to interface with.
> A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned71) for the referenced type, the behavior is undefined.
Like… edge cases? It's parsing a number! We're not talking about I/O on hard vs soft intr NFS mounts, here. There's a right answer.
strlen(), on valid null terminated strings, doesn't come with caveats like "oh we can't measure strings of length 99".
But sure, C is turing complete. It is possible to solve any problem a turing machine can solve.
> understand the target platform and the target compiler’s behavior.
This is neither. This is purely the language.
reply