If you ever write a parser for C, you would know the most annoying part is the ambiguities. A well-designed language should allow users to map their high-level abstract intents to some concrete language constructs and express them using the least amount of typing possible. Therefore, these constructs must be as expressive and flexible as possible, and their grammar must be simple and context-free. Ambiguities are the opposite of this, and sadly for C, it is full of them, from declarations to expressions. Theoretically, most of these can be avoided by checking whether an identifier is a type. In other words, you must build a type table while parsing C. I hate this idea for multiple reasons, which I won’t get into now (more on that later). I want to write down a list of ambiguities I’ve encountered, and how I (hackily) solve them.
1. Pointer declarations are ambiguous
The infamous pointer declaration syntax:
foo * bar;
Is the above statement a declaration or a multiplicative expression? You can’t know for sure without checking the type table.
2. Declarations with parentheses are ambiguous
How can you differentiate between a declaration using parentheses and a function call?
foo(bar);
This could be a declaration of the variable bar with the type foo (meaningless parentheses), or it could be calling the function foo with bar.
Things get messier when you combine this with the pointer declaration syntax to declare function pointers.
foo(*bar)()
;
This could mean that bar is a pointer to a function that returns foo, or it could mean that foo is a function that takes in bar and returns a function pointer. A more complicated example would be `foo(*bar(baz))[5]
`. This can either be a function declaration or an expression with two function calls.
3. Anonymous argument declarations are ambiguous
Most people probably know this, but C allows functions to have unnamed arguments:
int MyFunc(int a, float b, char c); // valid
int MyFunc(int, float, char); // valid
Furthermore, when your arguments are pointers to functions, C has a special syntax where you don’t need to put parentheses around the argument’s name:
int MyFunc(int (*func)(int)); // valid
int MyFunc(int func(int)); // valid
Combining these two features will lead to cases like:
int MyFunc(foo (bar));
Does this argument have the name bar and a type foo, or is it an anonymous pointer to a function that takes in bar and returns foo? It depends on whether bar is a type or not.
4. The cast operator is ambiguous
We’ve been discussing declarations for a while now, so let’s move on to expressions. The cast syntax in C is just a pair of prefix parentheses, and we all know how “unambiguous” parentheses in C are.
(foo)(bar); // Is this a cast or a function call?
(foo)*bar; // Is this a cast of a dereferenced pointer or a simple multiplication?
Because parentheses are also valid inside declarations, there are 3 ways to interpret:
foo (*(bar)(baz))[3]; // I'll leave it to the reader as an exercise
5. Polyfixic operators can be ambiguous
This is an interesting observation that I realized while writing an expression parser. C doesn’t have this problem, but I want to share with you anyway. This is about the fixity of operators – things like prefix, infix, postfix, etc. Imagine you have an operator # that can be prefix, infix, or postfix1—what is the order of evaluation of:
a # # b; // Is it `(a#) # b` or `a # (#b)`?—You don’t know.
The same applies to any combination of operators that covers all fixities. If operator # can only be prefix or infix, and $ can be both infix or postfix, then `a $ # b
` is ambiguous.
Another problem with having infix-postfix operators is lookahead. If you want to write an LL(1) parser, you can’t know whether or not the current token is infix or postfix—you must read more tokens to know. This is why C and most other programming languages don’t have an infix-postfix operator.
Dealing with ambiguity in C
In his LL and LR in Context: Why Parsing Tools Are Hard article, Josh Haberman gives us 3 ways to deal with this:
Operator precedence: prefer operators with higher precedence (as defined by the language), regardless of fixity. If you care, you can ensure at least N prefixes before switching to infix or always matching the longest postfix possible, for example.
Prioritized choice: always prefer one alternative over another. An example of this is C++’s most vexing parse. Another example would be the dangling else problem.
Semantic predicates: Turing-complete code, like building a type table. Technically speaking, I consider the previous two simpler versions of this, but I want to emphasize their simplicity. You obviously can go the opposite way and have a more extreme version of this if your language is badly designed.
The third option is out because I don’t want to build a type table. Since C has no polyfixity problem, let’s ignore option 1. All that’s left for us is prioritizing choices. I want to stress that we aren’t actually “fixing” the problem. The only correct choice is building a type table, which we’ve already said we don’t want to. All that we can do now is make a couple of heuristic guesses that hopefully match most of our use cases.
In case of `foo * bar
`, I always parse it as a pointer declaration if it is at the beginning of a statement because I never type a meaningless multiplication. If it’s inside an expression (a = foo * bar
), then it’s a multiplication because you can’t have declarations inside expressions. The same goes for the parentheses example. If it’s at the beginning of a statement and there’s nothing between foo and “(“ (most of my more complicated declarations usually have a “*” in between), then I parse it as a function call because it's more common.
foo (*bar[2]); // This is parsed as a function call
foo * (*bar[2]); // This is parsed as a declaration (and not an expression because of the first rule)
Regarding anonymous function pointer arguments and the cast operator, I never use the former, and I’m still figuring out how to deal with the latter. I plan to have some simple code that checks for foo in `(foo)(bar)
`. If foo is a single identifier, I will parse it as a type cast because I never write `(MyFunc)(arg)
`. If foo is a sub-expression or contains multiple identifiers, it will get slightly more complicated because I could imagine writing `(cond ? func1 : func2)(arg)
`. In that case, maybe checking whether the expression inside the parentheses is a type (or arithmetic) expression is not that hard.
The problem with C’s grammar
Most of C’s ambiguities stem from the declaration syntax and the cast operator. I believe most of these design choices made sense when it’s first being developed. I’ve heard before that Brian and Dennis, in interviews and the K&R book, have explained why some C syntaxes are designed the way they are: declaration syntax, array syntax, cast syntax, operator precedence, etc. I’ve never looked up those explanations, so it would be great if a reader could fill me in on why a specific syntax/feature was designed that way, and what kind of parser were they writing that was more suitable with that? But with more than half a decade later, what should a better design look like?
Cast syntax is quite simple, so I will tackle it first. You have 4 options:
(foo)bar; // C-style
foo(bar); // My personal favorite
cast(foo)bar; // You can replace `cast` with any keywords you like
cast(foo, bar);
The first has multiple problems, as we have seen. The second is my personal preference because it looks like a function call, which is what I desired. The only downside is that it doesn’t play well with pointer casting, e.g., `foo*(bar)
`. If anyone knows how to fix this, please let me know. So your only remaining choices are the last two, and they are quite similar. The latter requires an extra token “,” but also clearer if bar is a subexpression. People may be confused about the order of evaluation of `cast(foo)bar + 10
`, but not `cast(foo, bar+10)
`.
On the Ambiguity of the Syntax of Declarations
Ginger Bill2, the creator of Odin, has written about this before, where he categorized it as 3 different styles:
Qualifier-focused (Pascal family) has specific keywords/qualifiers/specifiers for declarations
var foo = 123; const bar = “string”;
Type-focused (C family) puts the type at the beginning of the declaration
int x = 123; char const* bar = “string“;
Name-focused3 (Odin/JAI/etc) puts the name at the beginning of the declaration
x := 123; bar :: “string”;
Some languages use a hybrid approach. Python uses qualifier-focused for functions/classes/imports and name-focused for variables. On the other hand, C uses an incomplete type-focused syntax for its pointers and arrays: the * and [] symbols are actually bound to the name and not the type. These get fixed in most modern C-like languages.
int *a, **b, c; // int *a; int **b; int c;
int* arr[10]; // declare arr as array 10 of pointer to int
Each style has different pros and cons, which I won’t get into now. What I want to focus4 on is which style leads to more ambiguities down the line. At first glance, you may be thinking that type-focused has more ambiguities (e.g, `foo * bar
`), but that isn’t true at all. `bar: foo*
` is easily disambiguous because of the “:” symbol. If you remove it and swap the “*” position, then `bar * foo
` is still ambiguous. Vice versa, none of the following are ambiguous at all:
bar foo* (name-focused)
*foo bar (type-focused)
bar:*foo (name-focused)
foo*:bar (type-focused)
So the actual thing that makes a language construct (declaration) distinct and different from another language construct (expression) is to have some kind of special markers for it, either special keywords (qualifier-focused) or symbols (name-focused).
Why I hate building a type table
First of all, because it’s more work. Just design your language properly, don’t shift all the work onto users. This makes it harder for you to read/understand the code, for compiler vendors to make/maintain (fast) compilers, and for people to make tools (editors, static analyzers, debuggers) for the language.
Secondly, a type table doesn’t make sense when you’re writing a parser for an editor to have syntax highlighting and code indexing. This is also one of the reasons I hate LSP5. The idea of using a compiler frontend (clangd) to parse source files in a code editor makes completely no sense to me because what the compiler sees is irrelevant to what you, as the user, need to see, which makes the two types of parser fundamentally different6.
The point of a compiler is to work with syntactically/semantically valid files, while the point of an editor is to constantly/simultaneously invalidate those files. Every key press, every open/close parenthesis, every character will be perceived as invalid by the compiler. Because of this, these parsers don’t work well with half-written code, and all they can do is report an error and exit early, without any kind of fallback. A code editor can’t work this way. It can’t pause, but must always continue, trying to parse and link together different sections of the file as much as possible. It can’t just crash if the user types something invalid or recursively references other constructs. It can’t reparse the same file multiple times, each with different input to the preprocessor, like a compiler, because it must only display each file once and independently of each other. If I #if
my code out, I still want syntax highlighting to work inside it. If I change the include order of compilation or some compiler settings, the editor shouldn’t show something different. If I have multiple constructs with the same name, because I’m modifying the code, the editor must understand and display all of them properly.
Because of this, a type table isn’t that useful for an editor. You want the editor to know if this particular piece of text is a declaration or not, without knowing what its type is, because the name is more important to the compiler than the type is. More generally, what you actually want is a hash table of names, where the editor can look up information regarding those names and decide how it should color/navigate them. You never want a complete standard spec-compliant language parser. That is way overkill, and guarantees a lot of wasted/inefficient work for a slower and worse result. This is also why having a separate preprocessor on top of the base language with its own syntax and semantics is a bad decision. It makes your language’s grammar less context-free and less unifying, which in turn makes it harder for the editor to retrieve these names.
Thanks for reading! If you enjoyed it, consider subscribing for more.
-Trần Thành Long
I call these operators “polyfixic”, inspired by “polymorphic”.
His real name is Bill Hall, for anyone who wondered, like I did.
See what I did there? Don't want to brag, but my mom thinks I'm very funny, which is why she lets me live in her basement.
Using a compiler frontend is not exclusive to LSPs. Other editors and tools utilize similar mechanisms without involving any server protocol. The definition of LSP also doesn’t mandate the use of a compiler frontend, and even without it, LSPs still face other significant issues. However, in practice, most LSPs follow this trend, and people often conflate these two concepts. In my opinion, LSP represents the worst of both worlds, as it not only suffers from some fundamental problems but also faces additional challenges due to the misuse of compiler frontends in these misplaced contexts.
Editor parsing and compiler parsing are indeed different. However, there are some similar stages where deduplication makes sense. Unfortunately, this isn’t possible in our current ecosystem, due to all popular compilers being shipped and treated as black boxes. This is also one aspect of the current ecosystem, which I mentioned in the previous post, that should be reevaluated.
In this language, are you askin' how to maintain separate namespaces in symbol table, such that you can have foo(bar) and int(bar) be distinct, or do you want a parser lookahead type thing for it, or do you intend to have the function call syntax also be something different like the different casting is (assuming there's traditional things like function calls in this language) p.s. whats the dwelling name of the language in case we refer to it? im workin' on a custom shell called 'blush'
Your not giving me a lot of room to respond... you are saying on one hand, i am making a mistake by offering opinions on this compare and contrast of what you yourself have called a hypothetical language.. im responding to your hypothetical... if youd prefer, i could just not respond and unsubscribe. im giving you thoughtful things, thuoughtful ideas, as well as pointing you in the direction of your question, and you keep saying "what makes you think im making a language" or "i think your confused", ive read everything youve said, and ive given you intelligible comments to read. Your hypothetical situation meant i wanted more information about the hypothetical landscape that you yourself have called a hypothetical language. If you didnt want any one to respond to it, such that NOTHING out of confusion could arise, you should have made an exclaimer, "DONT RESPOND", because thats the impression you are giving me, staggering me between "i think you are confused" and "im not talking about a language". Well, you are talking about a language, and im giving you my input on that language. Why do you want ppl to respond to you, if you dont think any of it is relevant? Thank you for wasting my time.