Lexer (The GNU C Preprocessor Internals) (2025)

Overview

The lexer is contained in the file lex.cc. It is a hand-codedlexer, and not implemented as a state machine. It can understand C, C++and Objective-C source code, and has been extended to allow reasonablysuccessful preprocessing of assembly language. The lexer does not makean initial pass to strip out trigraphs and escaped newlines, but handlesthem as they are encountered in a single pass of the input file. Itreturns preprocessing tokens individually, not a line at a time.

It is mostly transparent to users of the library, since the library’sinterface for obtaining the next token, cpp_get_token, takes careof lexing new tokens, handling directives, and expanding macros asnecessary. However, the lexer does expose some functionality so thatclients of the library can easily spell a given token, such ascpp_spell_token and cpp_token_len. These functions areuseful when generating diagnostics, and for emitting the preprocessedoutput.

Lexing a token

Lexing of an individual token is handled by _cpp_lex_direct andits subroutines. In its current form the code is quite complicated,with read ahead characters and such-like, since it strives to not stepback in the character stream in preparation for handling non-ASCII fileencodings. The current plan is to convert any such files to UTF-8before processing them. This complexity is therefore unnecessary andwill be removed, so I’ll not discuss it further here.

The job of _cpp_lex_direct is simply to lex a token. It is notresponsible for issues like directive handling, returning lookaheadtokens directly, multiple-include optimization, or conditional blockskipping. It necessarily has a minor rôle to play in memorymanagement of lexed lines. I discuss these issues in a separate section(see Lexing a line).

The lexer places the token it lexes into storage pointed to by thevariable cur_token, and then increments it. This variable isimportant for correct diagnostic positioning. Unless a specific lineand column are passed to the diagnostic routines, they will examine theline and col values of the token just before the locationthat cur_token points to, and use that location to report thediagnostic.

The lexer does not consider whitespace to be a token in its own right.If whitespace (other than a new line) precedes a token, it sets thePREV_WHITE bit in the token’s flags. Each token has itsline and col variables set to the line and column of thefirst character of the token. This line number is the line number inthe translation unit, and can be converted to a source (file, line) pairusing the line map code.

The first token on a logical, i.e. unescaped, line has the flagBOL set for beginning-of-line. This flag is intended forinternal use, both to distinguish a ‘#’ that begins a directivefrom one that doesn’t, and to generate a call-back to clients that wantto be notified about the start of every non-directive line with tokenson it. Clients cannot reliably determine this for themselves: the firsttoken might be a macro, and the tokens of a macro expansion do not havethe BOL flag set. The macro expansion may even be empty, and thenext token on the line certainly won’t have the BOL flag set.

New lines are treated specially; exactly how the lexer handles them iscontext-dependent. The C standard mandates that directives areterminated by the first unescaped newline character, even if it appearsin the middle of a macro expansion. Therefore, if the state variablein_directive is set, the lexer returns a CPP_EOF token,which is normally used to indicate end-of-file, to indicateend-of-directive. In a directive a CPP_EOF token never meansend-of-file. Conveniently, if the caller was collect_args, italready handles CPP_EOF as if it were end-of-file, and reports anerror about an unterminated macro argument list.

The C standard also specifies that a new line in the middle of thearguments to a macro is treated as whitespace. This white space isimportant in case the macro argument is stringized. The state variableparsing_args is nonzero when the preprocessor is collecting thearguments to a macro call. It is set to 1 when looking for the openingparenthesis to a function-like macro, and 2 when collecting the actualarguments up to the closing parenthesis, since these two cases need tobe distinguished sometimes. One such time is here: the lexer sets thePREV_WHITE flag of a token if it meets a new line whenparsing_args is set to 2. It doesn’t set it if it meets a newline when parsing_args is 1, since then code like

#define foo() barfoobaz

would be output with an erroneous space before ‘baz’:

foo baz

This is a good example of the subtlety of getting token spacing correctin the preprocessor; there are plenty of tests in the testsuite forcorner cases like this.

The lexer is written to treat each of ‘\r’, ‘\n’, ‘\r\n’and ‘\n\r’ as a single new line indicator. This allows it totransparently preprocess MS-DOS, Macintosh and Unix files without theirneeding to pass through a special filter beforehand.

We also decided to treat a backslash, either ‘\’ or the trigraph‘??/’, separated from one of the above newline indicators bynon-comment whitespace only, as intending to escape the newline. Ittends to be a typing mistake, and cannot reasonably be mistaken foranything else in any of the C-family grammars. Since handling it thisway is not strictly conforming to the ISO standard, the library issues awarning wherever it encounters it.

Handling newlines like this is made simpler by doing it in one placeonly. The function handle_newline takes care of all newlinecharacters, and skip_escaped_newlines takes care of arbitrarilylong sequences of escaped newlines, deferring to handle_newlineto handle the newlines themselves.

The most painful aspect of lexing ISO-standard C and C++ is handlingtrigraphs and backlash-escaped newlines. Trigraphs are processed beforeany interpretation of the meaning of a character is made, and unfortunatelythere is a trigraph representation for a backslash, so it is possible forthe trigraph ‘??/’ to introduce an escaped newline.

Escaped newlines are tedious because theoretically they can occuranywhere—between the ‘+’ and ‘=’ of the ‘+=’ token,within the characters of an identifier, and even between the ‘*’and ‘/’ that terminates a comment. Moreover, you cannot be surethere is just one—there might be an arbitrarily long sequence of them.

So, for example, the routine that lexes a number, parse_number,cannot assume that it can scan forwards until the first non-numbercharacter and be done with it, because this could be the ‘\’introducing an escaped newline, or the ‘?’ introducing the trigraphsequence that represents the ‘\’ of an escaped newline. If itencounters a ‘?’ or ‘\’, it calls skip_escaped_newlinesto skip over any potential escaped newlines before checking whether thenumber has been finished.

Similarly code in the main body of _cpp_lex_direct cannot simplycheck for a ‘=’ after a ‘+’ character to determine whether ithas a ‘+=’ token; it needs to be prepared for an escaped newline ofsome sort. Such cases use the function get_effective_char, whichreturns the first character after any intervening escaped newlines.

The lexer needs to keep track of the correct column position, includingcounting tabs as specified by the -ftabstop= option. Thisshould be done even within C-style comments; they can appear in themiddle of a line, and we want to report diagnostics in the correctposition for text appearing after the end of the comment.

Some identifiers, such as __VA_ARGS__ and poisoned identifiers,may be invalid and require a diagnostic. However, if they appear in amacro expansion we don’t want to complain with each use of the macro.It is therefore best to catch them during the lexing stage, inparse_identifier. In both cases, whether a diagnostic is neededor not is dependent upon the lexer’s state. For example, we don’t wantto issue a diagnostic for re-poisoning a poisoned identifier, or forusing __VA_ARGS__ in the expansion of a variable-argument macro.Therefore parse_identifier makes use of state flags to determinewhether a diagnostic is appropriate. Since we change state on aper-token basis, and don’t lex whole lines at a time, this is not aproblem.

Another place where state flags are used to change behavior is whilstlexing header names. Normally, a ‘<’ would be lexed as a singletoken. After a #include directive, though, it should be lexed asa single token as far as the nearest ‘>’ character. Note that wedon’t allow the terminators of header names to be escaped; the first‘"’ or ‘>’ terminates the header name.

Interpretation of some character sequences depends upon whether we arelexing C, C++ or Objective-C, and on the revision of the standard inforce. For example, ‘::’ is a single token in C++, but in C it istwo separate ‘:’ tokens and almost certainly a syntax error. Suchcases are handled by _cpp_lex_direct based upon command-lineflags stored in the cpp_options structure.

Once a token has been lexed, it leads an independent existence. Thespelling of numbers, identifiers and strings is copied to permanentstorage from the original input buffer, so a token remains valid andcorrect even if its source buffer is freed with _cpp_pop_buffer.The storage holding the spellings of such tokens remains until theclient program calls cpp_destroy, probably at the end of the translationunit.

Lexing a line

When the preprocessor was changed to return pointers to tokens, onefeature I wanted was some sort of guarantee regarding how long areturned pointer remains valid. This is important to the stand-alonepreprocessor, the future direction of the C family front ends, and evento cpplib itself internally.

Occasionally the preprocessor wants to be able to peek ahead in thetoken stream. For example, after the name of a function-like macro, itwants to check the next token to see if it is an opening parenthesis.Another example is that, after reading the first few tokens of a#pragma directive and not recognizing it as a registered pragma,it wants to backtrack and allow the user-defined handler for unknownpragmas to access the full #pragma token stream. The stand-alonepreprocessor wants to be able to test the current token with theprevious one to see if a space needs to be inserted to preserve theirseparate tokenization upon re-lexing (paste avoidance), so it needs tobe sure the pointer to the previous token is still valid. Therecursive-descent C++ parser wants to be able to perform tentativeparsing arbitrarily far ahead in the token stream, and then to be ableto jump back to a prior position in that stream if necessary.

The rule I chose, which is fairly natural, is to arrange that thepreprocessor lex all tokens on a line consecutively into a token buffer,which I call a token run, and when meeting an unescaped new line(newlines within comments do not count either), to start lexing back atthe beginning of the run. Note that we do not lex a line oftokens at once; if we did that parse_identifier would not havestate flags available to warn about invalid identifiers (see Invalid identifiers).

In other words, accessing tokens that appeared earlier in the currentline is valid, but since each logical line overwrites the tokens of theprevious line, tokens from prior lines are unavailable. In particular,since a directive only occupies a single logical line, this means thatthe directive handlers like the #pragma handler can jump aroundin the directive’s tokens if necessary.

Two issues remain: what about tokens that arise from macro expansions,and what happens when we have a long line that overflows the token run?

Since we promise clients that we preserve the validity of pointers thatwe have already returned for tokens that appeared earlier in the line,we cannot reallocate the run. Instead, on overflow it is expanded bychaining a new token run on to the end of the existing one.

The tokens forming a macro’s replacement list are collected by the#define handler, and placed in storage that is only freed bycpp_destroy. So if a macro is expanded in the line of tokens,the pointers to the tokens of its expansion that are returned will alwaysremain valid. However, macros are a little trickier than that, sincethey give rise to three sources of fresh tokens. They are the built-inmacros like __LINE__, and the ‘#’ and ‘##’ operatorsfor stringizing and token pasting. I handled this by allocatingspace for these tokens from the lexer’s token run chain. This meansthey automatically receive the same lifetime guarantees as lexed tokens,and we don’t need to concern ourselves with freeing them.

Lexing into a line of tokens solves some of the token memory managementissues, but not all. The opening parenthesis after a function-likemacro name might lie on a different line, and the front ends definitelywant the ability to look ahead past the end of the current line. Socpplib only moves back to the start of the token run at the end of aline if the variable keep_tokens is zero. Line-buffering isquite natural for the preprocessor, and as a result the only time cpplibneeds to increment this variable is whilst looking for the openingparenthesis to, and reading the arguments of, a function-like macro. Inthe near future cpplib will export an interface to increment anddecrement this variable, so that clients can share full control over thelifetime of token pointers too.

The routine _cpp_lex_token handles moving to new token runs,calling _cpp_lex_direct to lex new tokens, or returningpreviously-lexed tokens if we stepped back in the token stream. It alsochecks each token for the BOL flag, which might indicate adirective that needs to be handled, or require a start-of-line call-backto be made. _cpp_lex_token also handles skipping over tokens infailed conditional blocks, and invalidates the control macro of themultiple-include optimization if a token was successfully lexed outsidea directive. In other words, its callers do not need to concernthemselves with such issues.

Lexer (The GNU C Preprocessor Internals) (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Reed Wilderman

Last Updated:

Views: 6059

Rating: 4.1 / 5 (72 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Reed Wilderman

Birthday: 1992-06-14

Address: 998 Estell Village, Lake Oscarberg, SD 48713-6877

Phone: +21813267449721

Job: Technology Engineer

Hobby: Swimming, Do it yourself, Beekeeping, Lapidary, Cosplaying, Hiking, Graffiti

Introduction: My name is Reed Wilderman, I am a faithful, bright, lucky, adventurous, lively, rich, vast person who loves writing and wants to share my knowledge and understanding with you.