@* Introduction to the grammar. @:title@> The most intricate part of \.{CWEAVE} is its mechanism for converting \Cee-like code into \TeX\ code, and we shall consider this aspect of the program now. This parsing mechanism must be able to deal with fragmentary constructions whose overall ``part of speech'' is not known (e.g., those within~`\pb'), which implies that recognition must in principle be able to proceed without any context information. Therefore a ``bottom-up'' approach is used, that collects successive tokens, and only takes action once it has seen enough to recognise a complete match with a syntax rule. Bottom-up parsing is a very powerful technique (it can handle grammars that are not tractable by many other parsing methods); it is one of the many significant contributions to computer science that the designer of the original \.{WEB} system, D.~E. Knuth, has made. Even so, the technique used here is less powerful than that of traditional bottom-up parsers produced by parser generators, since those can use information derived from the full left-context (and a small amount of right-context) to help decide difficult situations. In practice such context is not often required to unambiguously parse program fragments, and most situations where there is a possibility of incorrect recognition can be avoided be careful specification of the syntax rules to be used; this does however make formulation of the grammar a slightly subtle matter. As we have already seen, the input is represented as a sequence of {\sl scraps}, each of which specifies a {\sl category\/} and a {\sl translation}. The category defines a syntactic class, and the translation is a token list that represents \TeX\ code; the latter is effectively just a single string of characters, although it is stored in a different form for efficiency reasons. Rules of syntax and semantics tell us how to combine adjacent scraps into larger ones, and if we are lucky an entire \Cee\ text that starts out as hundreds of small scraps will join together into one gigantic scrap whose translation is the desired \TeX\ code. If we are unlucky (i.e., if the input is not properly formed for recognition by our parser), we will be left with several scraps that don't combine; their translations will simply be output, one by one. The parsing process is governed by a set of reduction rules, each of which specifies how a sequence of consecutive scraps with certain given categories can be combined into a single new scrap, and it also tells what category the new scrap has and how its translation is obtained from those of the original scraps. The set of reduction rules is given by a table, which is actually an initialiser for a static array, starting in section@#rules@>. In many cases a rule is simply given by listing the category codes forming its pattern (its left hand side) and the category code of the resulting scrap (the right hand side). For example, one of the reduction rules could be represented as $$ \hbox{|expression| |binop| |expression|}\Rightarrow\hbox{|expression|}. $$ For such simple rules the translations of the original scraps are simply concatenated to form the translation of the resulting scrap. The line in the table corresponding to this rule actually looks like this: $$ \hbox{|{2,{{expression,binop,expression}},{expression, NULL}}|}. $$ The `2' is the number used to identify the rule, which is useful when tracing the activities of the parser, and the |NULL| signifies that the default translation applies; the braces reflect the structure used to represent syntax rules, which has some additional components that are not specified explicitly here, and that are therefore initialised to~0. For other rules specific additional items are to be inserted in between the translations of the scraps that match the pattern. For instance, here is another rule present in the table $$ \hbox{|{ 6,{{expression, comma, expression}},{expression,"__p1_"}}|}. $$ Here the translation is specified by the ``format string'' |"__p1_"|, where the underscores stand for the translations of the three participating scraps, the `\.{p1}' for an |opt| token with argument~`\.1' (eventually producing `\.{\\penalty10}') that comes after the translation of the second scrap, with category |comma|. That translation will almost always be |","|, since there are no rules with |comma| as their right hand side, but if a comment follows the comma, it will have been incorporated into its scrap while preparing for the reduction, and in this case the optional break will follow the comment (and have no effect, because there is already a forced break after comments). The interpretation of all characters that can appear in the format strings can be found in section@#format@>. To handle all cases that can arise in programs, in combination with a large number of formatting options that might be selected, our mechanism for rules of a somewhat more general form than those presented above, but the extensions are used only for a few rules. Consider the following rule: $$ \hbox{|{132,{{lpar,statement,statement},1}, {statement,"_B_"},forced_statements}|}. $$ Here, the `1' indicates that the first scrap from the left in the pattern, |lpar|, serves as context only, and does not take part in the actual reduction. The rule therefore reduces a sequence of two consecutive statements into a single one, inserting a |break_space| (from the `\.B' format character) between their translations, but only if the statements are preceded by a left parenthesis (category~|lpar|); this restriction means that the rule will only apply to the first two semicolon-separated expressions following `|for|~|(|'. The entry |forced_statements| means moreover that the rule will not be used at all unless a `\.{+f}'~or~`\.{+a}' command line option was specified, which forces line breaks between consecutive statements; this rule then avoids such line breaks between the controlling expressions of a |for|-loop, since it takes precedence (see below) over the rule that normally combines statements. @ The rules are applied from left to right, as follows. Suppose we are currently working on the sequence of scraps with categories $c_1\,c_2\ldots c_n$. We try first to find a rule whose pattern matches an initial substring $c_1\,c_2\ldots\,$, and if no such rule exists, we try to find a rule applicable to the next substring $c_2\,c_3\ldots\,$; if that fails, we try to match $c_3\,c_4\ldots\,$, etc. When several patterns match, starting at the same scrap, the longest one is selected (no two rules can have identical patterns). For instance, there is a rule that reduces `|struct_like| |expression| |lbrace|' to `|struct_head|', and another one that reduces `|struct_like| |expression|' to `|int_like|'; the latter is only applied when no |lbrace| scrap follows. This is a sensible rule, since the longer pattern is the more specific one, and without such a rule it could never match. One should be aware however that this only works because the scrap with category |lbrace| represents a single token that requires no reduction to create it, for otherwise the two-scrap reduction would be applied before the three-scrap reduction would have a chance to match. One might say that we use a left-to-right eager strategy for choosing reductions; this strategy is chosen on heuristic grounds, and there is no guarantee that it will find a successful sequence of reductions if one exists. In other words, if we interchange left and right hand sides of the rules and view them as {\sl production\/} rules of a context free grammar, then the language generated by this grammar can be much larger than the one recognised by our parser. For instance, the context free grammar would, from the rules given above, generate expressions with multiple operators where implicit parentheses are placed arbitrarily (and therefore it would be ambiguous); our parsing strategy however will always place implicit parentheses to the left. (We don't care if implicit parentheses are placed incorrectly, because it does not influence typesetting.) It is an interesting theoretic problem to find an algorithm that will transform a set of reductions rules into a context free grammar generating exactly the language recognised; that would help to verify that the rules will work as intended. We must be cautious in formulating the rules, not only to give rules that are sufficient to reduce all desired programs, but also not to specify rules that could in certain circumstances match with precedence over the ones we intended for the situation. Some of the formatting of the output is controlled by the \TeX~format in the file \.{cwebxmac.tex}, rather than directly by \.{CWEAVE}. For instance, the `\.{\\penalty10}' mentioned above is actually written as `\.{\\31}'; the macro `\.{\\3}' will in fact raise the penalty for any enclosing pair of braces or parentheses; thus breaks in argument lists or initialiser lists will be avoided in favour of ones outside those lists. There are other effects not visible in the grammar, like optional breaks automatically associated with certain operators. In the whole it is a delicate interplay between the set of reduction rules, the reduction mechanism, the macro definitions, and the (mathematical) typesetting rules of \TeX\ that determines how programs will be typeset. This division of labour is quite convenient (e.g., it would tremendously complicate the task of \.{CWEAVE} if it had to decide which optional breaks should actually be taken), but it has the unfortunate consequence that it is not easy to gain a comprehensive understanding of the translation process. Indeed there have been quite a few surprises during the construction of the system, and we cannot be sure that there are no more to come; yet we believe the situation is more transparent than in the original |WEAVE|, since we have avoided rules that are oriented towards the target language (\TeX) rather than towards the source language (\Cee). @ Here is a list of the category codes that scraps can have. The category codes that apply to reserved words (e.g., |while_like|, but also |declaration| for |va_dcl|) as well as |expression| (that is used for |type_defined| identifiers in their typedef declaration) are sufficiently high that they can be distinguished from |ilk| values that are not category codes, like |type_defined|, |TeX_like|, |NULL_like|, |const_like|, and~|typedef_like|. A number of categories can occur in scraps, but do not occur in any of the reduction rules, since they are handled by other means; they have values exceeding |max_rule_cat|. The macro |valid_cat| checks whether |c| is a category that might match in a rule; it uses its argument twice, so its argument should not cause a side effect. If this section is changed, section@#category output@> should be changed correspondingly. @d max_category end_expr /* largest scrap category */ @d max_rule_cat return_like /* largest category in rules */ @d valid_cat(c) ((c)>0 && (c)<=max_rule_cat) @= enum @/ @:categories@> { unop = 1, /* a unary operator like `|!|' */ binop, /* a binary operator like `|<|' */ unorbinop, /* an operator that can be either, like `|-|' */ select, /* structure selection: `|.|' or `|->|' */ question, /* a question mark operator */ lbrace, rbrace, lpar, rpar, lbrack, rbrack, /* `|{|', \ `|}|', `(', \ `)', `[', \ `]' */ comma, semi, colon, colcol, magic, /* `,', `;', `:', `$\CC$', \:; */ subscript, /* an array subscript, like `|[]|' or `|[i++]|' */ struct_head, /* the beginning of a struct specifier, like `|struct s{|' */ short_lbrace, short_struct_head, /* from `\.{\{@@;}', for one-liners */ compound_statement, /* a complete compound statement */ statement, /* a complete statement, possibly compound */ function, /* a complete function definition */ function_head, /* a function identifier followed by formal parameters */ parameters, /* parameters in function declaration, or casting operator like `|(int)|' */ label, /* a statement label */ if_head, /* `|if|' followed by a (parenthesised) expression */ if_else_head, /* \lq|if @t\dots@>@; else|', \lq|while(@t\dots@>)|' or \lq|switch(@t\dots@>)|' */ do_head, /* `|do @t\dots@>@; while|' */ mod_scrap, /* module name */ declarator, /* abstract declarator, like `|(*)(int,char*[])|' */ declaration, /* a complete declaration */ expression, /* an expression, possibly a single identifier */ while_like, /* `|for|', `|while|', `|switch|' */ do_like, /* `|do|' */ if_like, /* `|if|' */ else_like, /* `|else|' */ int_like, /* `|int|', `|char|', `|extern|', \dots */ case_like, /* `|case|', `|default|' */ sizeof_like, /* `|sizeof|' */ struct_like, /* `|struct|', `|union|', `|enum|' */ return_like, /* `|return|', `|break|', `|continue|', `|goto|' */ lproc, /* `\&\#' and following identifier starting preprocessor directive */ rproc, /* end of a preprocessor directive */ insert, /* comment or other syntactically inert item */ begin_expr, end_expr /* \:[ and \:] */ }; @ As we have already seen, tokens are converted to elementary scraps by the function |C_read|; these scraps form the `terminal symbols' of our grammar. The translation of tokens to scraps is largely governed by the static array |trans_ini|, whose initialisation values we shall now give. @d yes_math 1 /* should be in math mode */ @d no_math 2 /* should be in horizontal mode */ @d maybe_math 0 /* works in either horizontal or math mode */ @< Initialiser for |trans_ini| @>= { '!', unop, yes_math, "\\R" }, @/ @.\\R@> { '~', unop, yes_math, "\\CM" }, @/ @.\\CM@> { '/', binop, yes_math, "/" }, @/ { '<', binop, yes_math, "<" }, @/ { '>', binop, yes_math, ">" }, @/ { '.', select, yes_math, "." }, @/ { '=', binop, yes_math, "\\K" }, @/ @.\\K@> { '|', binop, yes_math, "\\OR" }, @.\\OR@> { or, binop, yes_math, "\\OR" }, @/ { '^', binop, yes_math, "\\XOR" }, @/ @.\\XOR@> { '%', binop, yes_math, "\\MOD" }, @/ @.\\MOD@> { '+', unorbinop, yes_math, "+" }, @/ { '-', unorbinop, yes_math, "-" }, @/ { '*', unorbinop, yes_math, "*" }, @/ { '&', unorbinop, yes_math, "\\AND" }, @/ @.\\AND@> { '?', question, yes_math, "\\?" }, @/ { '(', lpar, yes_math, "(" }, @/ { ')', rpar, yes_math, ")" }, @/ { '[', lbrack, maybe_math, "[" }, @/ { ']', rbrack, maybe_math, "]" }, @/ { '{', lbrace, yes_math, "\\{" }, @/ { '}', rbrace, yes_math, "\\}" }, @/ { ',', comma, yes_math, "," }, @/ { ';', semi, yes_math, ";" }, @/ { ':', colon, maybe_math, ":" }, @/ { '#', insert, maybe_math, "\\#" }, @/ @.\\\#@> /* this should occur only in macro definitions */ { at_sign_image, insert, maybe_math, "@@" }, /* this should not occur in legal \Cee~text */ @) { not_eq, binop, yes_math, "\\I" }, @/ @.\\I@> { lt_eq, binop, yes_math, "\\Z" }, @/ @.\\Z@> { gt_eq, binop, yes_math, "\\G" }, @/ @.\\G@> { eq_eq, binop, yes_math, "\\E" }, @/ @.\\E@> { and_and, binop, yes_math, "\\W" }, @/ @.\\W@> { or_or, binop, yes_math, "\\V" }, @/ @.\\V@> { plus_plus, unop, yes_math, "\\PP" }, @/ @.\\PP@> { minus_minus, unop, yes_math, "\\MM" }, @/ @.\\MM@> { minus_gt, select, yes_math, "\\MG" }, @/ @.\\MG@> { gt_gt, binop, yes_math, "\\GG" }, @/ @.\\GG@> { lt_lt, binop, yes_math, "\\LL" }, @/ @.\\LL@> { mul_assign, binop, yes_math, "\\KK*" }, @/ @.\\KK@> { div_assign, binop, yes_math, "\\KK/" }, @/ { mod_assign, binop, yes_math, "\\KK\\MOD" }, @/ @.\\MOD@> { plus_assign, binop, yes_math, "\\KK+" }, @/ { minus_assign, binop, yes_math, "\\KK-" }, @/ { left_assign, binop, yes_math, "\\KK\\LL" }, @/ { right_assign, binop, yes_math, "\\KK\\GG" }, @/ { and_assign, binop, yes_math, "\\KK\\AND" }, @/ @.\\AND@> { xor_assign, binop, yes_math, "\\KK\\XOR" }, @/ @.\\XOR@> { or_assign, binop, yes_math, "\\KK\\OR" }, @/ @.\\OR@> { thin_space, insert, yes_math, "\\," }, @/ @.\\,@> { pseudo_semi, magic, maybe_math, "" }, @/ { force_expr_open, begin_expr, maybe_math, "" }, @/ { force_expr_close, end_expr, maybe_math, "" }, @/ { join, insert, no_math, "\\J" }, @/ @.\\J@> { ellipsis, int_like, yes_math, "\\ldots" }, @/ { sh_sh, binop, yes_math, "\\SS" }, @/ @.\\SS@> { colon_colon, colcol, yes_math, "\\CC" } @ Certain tokens that lead to fixed scraps are not included in the |trans_ini| array because their translations involve non-character tokens. Since there are only a few of them the easiest solution is to install each one explicitly into the |token_trans| array. @d start_scrap(s,c,m) p=&token_trans[s],p->cat=c, p->mathness=5*(m) @d end_scrap p->trans=text_ptr, freeze_text(); @< Install the translations of tokens involving line breaks @>= { scrap* p; start_scrap(math_break,insert,maybe_math); /* \:\v */ app(opt), app('0'); end_scrap; @/start_scrap(line_break,insert,no_math); /* \:/ */ app(force); end_scrap; @/start_scrap(end_preproc,rproc,no_math); /* end of preprocessor directive */ app(force); end_scrap; @/start_scrap(' ',insert,no_math); /* space within preprocessor directive */ app(break_space); end_scrap; @/start_scrap(big_line_break,insert,no_math); /* \:) */ app(big_force); end_scrap; @/start_scrap(backup_line,insert,no_math); /* \:\\ */ app(backup); end_scrap; @/start_scrap(no_line_break,insert,no_math); /* \:+ */ app(cancel),app(relax),app(break_space),app(relax),app(cancel); end_scrap; @/start_scrap(include_preproc,insert,yes_math); /* \:p */ app(force),app_str("\\ATP"),app(force); end_scrap; @.\\ATP@> } @ When \.{CWEAVE} is compiled with the |DEBUG| switch, it can display its parsing steps. The order of strings in |cat_name| must match that in the |enum| declaration in section@#categories@>. @c @:category output@> #ifdef DEBUG void print_cat (int c) /* symbolic printout of a category */ { static char* cat_name[]= { "unop", "binop", "op", "select" , "?", "{", "}", "(", ")", "[", "]", ",", ";", ":", "::", "@@;" , "subscr", "struct_head", "short_{", "short_struct_head" , "cmp_stmt", "stmt" , "function", "function_head", "params", "label" , "if_head", "if_else_head", "do_head" , "mod_name", "declarator", "decl", "exp", "for", "do", "if", "else" , "int", "case", "sizeof", "struct", "return" , "#{", "#}", "insert", "@@[", "@@]" }; if (c<=max_category && c>0) printf(cat_name[c-1]); else printf ("IMPOSSIBLE"); } #endif /* |DEBUG| */ @ Another major class of terminal symbols is formed by the reserved words. If a name exists in the hash table with an |ilk| specifying a reserved word, then |id_lookup| will return the reserved word when called with that name, and |C_read| will use the |ilk| to set the category of the resulting scrap. So all that has to be done is to get all the reserved words into the hash table with the right ilks initially. The simplest way to do this is to call |id_lookup| for all reserved words with the proper |ilk| at the beginning of each run of \.{CWEAVE}. Fortunately there are not too many reserved words. This code below uses the fact that instead of using pointers to beginning and end of a string for |id_lookup|, one may also pass a single pointer to a null-terminated string provided the other pointer is null. @^reserved words@> @= { int i; static char* int_likes[]= { "auto","char","double","extern","float","int","long","register" , "short","signed","static","unsigned","void" }; static char* defined_types[] = { "FILE", "size_t", "ptrdiff_t", "wchar_t" , "jmp_buf", "sig_atomic_t", "fpos_t", "div_t", "ldiv_t" , "clock_t","time_t" , "va_list" }; static char* return_likes[]= {"break","continue","goto","return"}; int int_like_nr=array_size(int_likes), defined_type_nr=array_size(defined_types), return_like_nr=array_size(return_likes); for (i=0; i } @ The main difference between \Cee\ and \Cpp, as far as \.{CWEAVE} is concerned, is that the latter has a number of additional reserved words. Most of them are sufficiently like some \Cee-reserved word (or category) that we can simply make it behave like that \Cee~symbol, without changing the syntax. For `\&{new}' and `\&{operator}', some additional syntax rules will be needed however; nevertheless we do not need to extend the set of syntactic categories. For `\&{operator}' we abuse the category |case_like|, since its proper use is rather restricted (`|case|' it is always followed by an expression, while `|default|', `\&{private}' and its relatives are always followed by a colon), so there will be no confusion with `\&{operator}', which is always followed by an operator symbol. @= { id_lookup("asm", NULL, int_like); id_lookup("class", NULL, struct_like); id_lookup("delete", NULL, sizeof_like); id_lookup("friend", NULL, int_like); id_lookup("inline", NULL, int_like); id_lookup("new", NULL, sizeof_like); id_lookup("operator", NULL, case_like); id_lookup("private", NULL, case_like); id_lookup("protected", NULL, case_like); id_lookup("public", NULL, case_like); id_lookup("this", NULL, expression); id_lookup("virtual", NULL, int_like); } @ There are a few more kinds of elementary scraps that the functions we have given before can produce, which we mention here for completeness. Ordinary identifiers get category |expression|, and their names will be expanded on output as argument to a control sequence that provides the proper formatting. For strings, constants, verbatim constructions, and \TeX~strings, the applicable control sequences and the constituent characters (escaped with backslashes where necessary) are written explicitly into token memory; their scraps also have category |expression|. Comments are converted to scraps of category |insert|, and their contents are also stored literally; in case of `\pb' fragments a reference to a text marked with |inner_text_flag| is stored, for the production of which the parsing mechanism has in fact already been invoked. By contrast module names are stored by reference to the name table, just like identifiers, and their scraps have category |mod_scrap|; in their case the parsing mechanism may be called during the {\sl output\/} process, if any `\pb' constructions occur. The final \TeX\ output produced for elementary scraps will often be marked with special control sequences. Ordinary multi-character identifiers are enclosed in `\.{\\\\\{}$\,\ldots\,$\.\}' (single character identifiers are merely preceded by a space; they will be set in math italic), identifiers whose |ilk| is |TeX_like| will become control sequences that are also enclosed in `\.{\\\\\{}$\,\ldots\,$\.\}' (to establish italic type), reserved words are enclosed in `\.{\\\&\{}$\,\ldots\,$\.\}', strings and all-caps identifiers in `\.{\\.\{}$\,\ldots\,$\.\}', constants in `\.{\\T\{}$\,\ldots\,$\.\}', and verbatim constructions in `\.{\\vb\{}$\,\ldots\,$\.\}'. Comments are enclosed in `\.{\\C\{}$\,\ldots\,$\.\}' and usually followed by `\.{\\6}' (a forced break), and module names take the form `\.{\\X$n$:}$\,\ldots\,$\.{\\X}' where |n| is the section number (since module names have |mathness==yes_math|, there is no danger that the final `\.{\\X}' will disable a following space when coming from a `\hbox{\.{\v@@< ... @@>\v}}' construction). @* The parsing mechanism. @:title@> Conceptually the working sequence of scraps is like a deck of cards, in which we repeatedly replace a sequence of consecutive cards by a single new card. Since such replacements never increase the number of cards, we can use sequential allocation for the current sequence of scraps, and our only difficulty will be how to conveniently fill the holes that might be left after each reduction step. Now reduction usually takes place near the beginning of the scrap sequence (assuming that the scrap sequence makes syntactical sense) because that is where we are looking first, and we want to avoid shifting down the whole remainder of the scrap sequence each time. Therefore the sequence of scraps, which initially occupies the positions from |scrap_base| to |scrap_ptr|, is allowed to have a hole in its middle, the low and high end of which are pointed to by variables |lo_ptr| and~|hi_ptr|. There is also a variable that points to the place where reductions are currently taking place, which is the parsing pointer~|pp|. It will always point into the area below the hole, and when it approaches the hole so closely that a potential reduction might involve scraps from above, the situation is remedied by sliding down scraps to the lower region, effectively raising the hole. Therefore the scraps in the higher region are those that have never been considered for a reduction yet. Eventually all scraps have been moved down (i.e., we have |hi_ptr==scrap_ptr|), and after that has happened a scrap with category~0 (which is not otherwise used) is copied down to signal the imminent end of the reduction process. When finally no more rules match the scraps in the lower region, the parsing stops. @< Global variables @>= scrap_pointer pp; /* current position for reducing scraps */ scrap_pointer lo_ptr; /* end of sequence of scraps that have been examined */ scrap_pointer hi_ptr; /* first scrap that has not been examined */ @ The |mathness| is an attribute of scraps that says whether their translation is to be processed in a math mode context or not. Since the translation can be concatenated from a large number of other scraps, there can be switches in and out of math inside the translation, and we need to specify the mathness at each of the boundaries. For some scraps it either makes no difference whether their translation is processed in math mode or not, or the required mathness is to be determined by the grammatical context rather than by the scrap itself. Such scraps have mathness |maybe_math| at both ends; otherwise a definite mathness is specified at either end. The least significant pair of bits of the |mathness| field of a scrap controls the right boundary, and the pair of bits to its left controls the left boundary. @d left_math(a) (a->mathness>>2) @d right_math(a) (a->mathness&0x3) @ If we combine two scraps neither of which has mathness |maybe_math| at its boundaries, then a `\.\$' is inserted in between if and only if the mathnesses at the common boundary do not agree; if a scrap with |maybe_math| joins one with a definite mathness, that mathness is propagated across the former scrap to its other boundary. In order to implement this, we maintain two mathness values while building up a text: |init_mathness| and |cur_mathness| which represent the values at the left and right boundaries of the part contributed so far; these are local variables of whatever function is concatenating translations, but they should be called by these names since they are addressed by the macros below. As a consequence of the left-to-right order of combining translations, a |maybe_math| scrap that is combined with scraps with definite mathnesses, will actually be set in the mode inherited from its left (unless it appears as the leftmost scrap in a reduction); this can be used to make certain symbols, such as colons, behave in two slightly different ways depending on their syntactic function. (This method is not infallible however, as a comment following the symbol will always force it to be processed in horizontal mode; this happens because |insert| scraps are tacked onto the scrap before them before any ordinary reduction can affect it.) @ Before scraps requiring some definite mathness are contributed, we invoke |set_mode(yes_math)| or |set_mode(no_math)| as appropriate; the first time this happens will determine the value of |init_mathness|. @d set_mode(x) if (cur_mathness==maybe_math) cur_mathness=init_mathness=x; else if (cur_mathness!=x) {@; app('$'); cur_mathness=x; } else @; /* expect semicolon */ @ The macro |app_trans| is invoked with a |scrap_pointer| as argument, and appends its translation as a single token; |add_trans| will in addition to this administrate |init_mathness| and |cur_mathness|, and interpolate any necessary math shifts. @d app_trans(a) app_tok(text_flag+text_index((a)->trans)) @d add_trans(a) { scrap_pointer scr=a; /* so that evaluating |a| may have side effects */ if (left_math(scr)!=maybe_math) { if (cur_mathness==maybe_math) init_mathness=left_math(scr); else if (cur_mathness!=left_math(scr)) app('$'); cur_mathness=right_math(scr); } app_trans(scr); } @ The function call |fuse(s,n)| will concatenate the translations of |n| scraps starting with |*s|, taking care of the mathnesses, and install the resulting text into |s->trans|. @c void fuse (scrap_pointer s, int n) { int cur_mathness=maybe_math, init_mathness=maybe_math; scrap_pointer p=s; check_toks(n); check_text(); do add_trans(p++)@; while (--n>0); /* gather all the translations */ s->trans=text_ptr; freeze_text(); s->mathness=(init_mathness<<2)+cur_mathness; } @ An |int_like| identifier following a |struct_like| token, a selection operator (`|.|' or `|->|'), or the special code \:;, is to be typeset as an ordinary identifier. The function |make_nonreserved| alters the flag of the token representing such an identifier occurrence. The scrap representing the |int_like| identifier should not be formed by any reduction, but come directly from |C_read|, so in principle we expect the translation of our scrap to be an unnested text consisting of a single token. However, a comment (or other |insert|) directly following the |int_like| identifier may complicate this picture slightly, because |insert| scraps are tacked onto the previous scrap before it gets the chance to take part in any reduction; this means our token may be buried inside one or more levels of text nesting, but still is the very first token of the translation. Any such levels of nesting are soaked off by the |while| loop below; after this process a single-token text should remain containing a reserved word, of which we replace |res_flag| by |id_flag| to make it print as an ordinary identifier. @c void make_nonreserved (scrap_pointer p) { text_pointer q=p->trans; token t; while (text_flag<=(t=text_begin(q)[0]) && t @c id_pointer first_ident(text_pointer p) { token_pointer q; token t; if (p>=text_ptr) confusion("first_ident"); @.first\_ident@> for (q=text_begin(p); q@;}|', after a |struct_like| token (declaring a structure-, union-, or enumeration tag) and for all identifiers in an enumeration list. This is accomplished by the invocation of |make_underlined| at appropriate times. @c void make_underlined (scrap_pointer p) /* underline entry for first identifier in |p->trans| */ { id_pointer name=first_ident(p->trans); /* name of first identifier */ if (name==NULL) return; /* this happens for unsyntactical things like `|int 3;|' */ { sixteen_bits head=xref_index(name->xref),* r=&head; int n; while ((n=xnum(*r)&num_mask)!=0 && nxref=xref_ptr; @+ else *r=xref_index(xref_ptr); } } } @ We will now consider the how the reduction rules themselves are represented and used. As we have seen, a rule must define a sequence of categories for its left hand side, and for its right hand side a category and a prescription for constructing its translation. In addition, some categories of the left hand side may be marked as context, so that they will not take part in the reduction, and there is a way to specify conditional loading of rules. A few more pieces of information are included for convenience and efficiency. Individual reduction rules are stored in a structure called |reduction|. It is organised in a way that allows for semi-static initialisation, i.e., the essential parts of information are stored near the beginning of the structure or of one of its sub-structures, so that they can be defined by an initialiser expression, while some further fields are computed from them and assigned at startup time. Within the fields that are statically initialised some fields that usually are~0 are put at the end, so that in the default case they can be omitted from the initialiser. The field |id| holds an identification number for the rule, which is used in debugging. Then follows the left hand side information, consisting of an array of at most |max_lhs_length| categories (which include those of a possible context; if less than the maximal number of categories are present they are padded with zeros), followed by integers |context| and |length|. The field |context| specifies which categories, if any, form the context: this can be a sequence of one or more categories at either end of the left hand side of the rule, but not at both ends. If |context==0| (as is the case if no explicit initialiser is specified), there are no context categories; when |context>0|, the first |context| categories form the left context, and when |context<0|, the last |abs(context)| categories form the right context. (In practice it is wise to use only token categories (ones that do not require reduction to be formed) for a right context, unless one can be quite sure that no unintended reduction will affect the categories taking part in the intended reduction before the right context has been reduced.) There must be at least one category that is not part of the context, lest a ``reduction'' would increase the number of scraps. The right hand side of a rule specifies a category and a string used as a format to build up the translation; for the common case that the translation is formed by concatenating the translations of all scraps of the left hand side (not including those those of the context), a null pointer may be given instead of a format string. The following field |mask| can be used to specify selective loading of rule at startup time: any bit set in it will suppress loading under some condition dependent on the setting of option flags in the call of \.{CWEAVE}. The field |displacement| is computed at startup time to record the number of positions (usually negative) by which the parsing pointer~|pp| should be changed after application of the rule. @d max_lhs_length 4 @< Typedef and enumeration declarations @>= typedef struct { short id; /* for debugging */ struct {@; eight_bits category[max_lhs_length]; signed char context,length; } lhs; struct {@; eight_bits category; char *translation; } rhs; sixteen_bits mask; short displacement; } reduction; @ We shall organise the rules in a ``trie'' structure for fast matching. If |q| points to a trie node reached after matching some sequence of categories, and that sequence corresponds to the left hand side (including context) of some rule (which should be unique), then |q->rule| points to that rule, otherwise |q->rule==NULL|. If that sequence of categories is a proper prefix of the left hand side of a rule (which may happen whether or not |q->rule==NULL|), and $c$ is the next category in that left hand side (which implies |0next[c-1]| is the index of the trie node reached after a further match of~$c$. The entries of |q->next| that do not correspond to any such successor node are set to~0, which is unambiguous because the root of the trie does not figure as a successor of any node. We do not attempt a sparse representation (which would avoid storage of such 0's), but we do use a compact |eight_bits| representation for the entries of |q->next|; this saves a considerable amount of space, since there is a total of |max_rule_cat*max_no_of_nodes| such entries. @= typedef struct {@; reduction* rule; eight_bits next[max_rule_cat]; } trie_node; @ Trie nodes are allocated from an array |trie_node_mem|, with the root of the trie at |trie_node_mem[0]|. We introduce some macros that will help us find our way around the trie. The address the successor of the trie node pointed to by |q| for category |c| can be written as |successor(q,c)|. The absence of such a successor can be found by testing |no_successor(q,c)|. If |x| is the address of any node in the tree (except the root), then we can make that node the successor of |q| for category |c| by invoking |set_successor(q,c,x)|. @d trie_root (&trie_node_mem[0]) @d successor(q,c) (&trie_node_mem[(q)->next[(c)-1]]) /* category codes start at 1 */ @d no_successor(q,c) ((q)->next[(c)-1]==0) @d set_successor(q,c,x) ((q)->next[(c)-1]=(eight_bits)((x)-trie_node_mem)) @= trie_node trie_node_mem[max_no_of_nodes]; int node_no = 1; /* number of trie nodes allocated */ #ifdef DEBUG boolean install_failed=false; #endif @ Trie nodes are allocated in a straightforward sequential way. We don't trust that uninitialised statically allocated pointers will be |NULL| (although they should), especially not on machines where |NULL| is not represented as ``all bits cleared'', so we do a bit of extra work here. For the entries of the array |next| we expect no problems however (since |eight_bits| is an integral type), so we do not explicitly initialise them. @c trie_node *get_new_trie_node(void) { if (node_no>=max_no_of_nodes) overflow("trie node"); @.trie node capacity exceeded@> trie_node_mem[node_no].rule=NULL; return &trie_node_mem[node_no++]; } @ The function |install_rule| installs a reduction, and if \.{CWEAVE} was compiled with |DEBUG| set, it also performs some checks on the validity of the rule; if any check fails the variable |install_failed| is set to |true|. Since |print| has a variable number of arguments, the macro |rule_error| does not incorporate them but just prepends |print| to the argument list; the replacement text of |rule_error| can therefore not be parenthesised, and the macro should be used with some care. @d rule_error install_failed=true,print /* report a problematic syntax rule */ @ A global variable |rule_mask| is set at startup time according to the relevant option flags; any rules~|r| for which |rule_mask & r->mask!=0| are suppressed. For each set of mutually exclusive settings, a number of bits in |rule_mask| is reserved equal to the size of the set; the current setting will have its corresponding bit set while the others are cleared. Therefore the bits set in |r->mask| specify the option settings for which the rule is disabled, and the default state |r->rule==0| means that the rule always applies, regardless of any optional settings. @< Global... @>= sixteen_bits rule_mask; @ The function |install_rule| enters a rule into the trie structure, and if |DEBUG| is defined, performs some sanity checks on the rule. @< Prototypes @>= void install_rule (reduction* rule); @~The length of rules can be found because category~0 is not used for scraps. @c void install_rule(reduction *rule) { if ((rule_mask & rule->mask)==0) { eight_bits* p=rule->lhs.category, i=0; while (ilhs.length=i; #ifdef DEBUG @< Check left-hand side of |rule| @> @< Check right-hand side of |rule| @> #endif @< Install |rule| in the trie structure @> @< Compute |rule->displacement|, and modify |rule->lhs.length| and |rule->context| @> } } @ The left-hand side should not be only context, and all categories should be legal ones. @< Check left... @>= { if (rule->lhs.length<=abs(rule->lhs.context)) rule_error("\nNo scraps to replace in rule %d.\n", rule->id); @.No scraps to replace...@> for(i=0; ilhs.length; ++i) if (!valid_cat(p[i])) rule_error("\nUnknown category %d in LHS of rule %d.\n", p[i], rule->id); @.Unknown category...@> } @ The right-hand side should have a valid category, and unless its translation is |NULL| (or |""| which we treat as if it were |NULL|), it should contain as many times |'_'| as there are non-context categories in the left-hand side. @< Check right... @>= { int c=rule->rhs.category; char* s=rule->rhs.translation; if (!valid_cat(c)) rule_error("\nUnknown category %d in RHS of rule %d.\n", c, rule->id); @.Unknown category...@> if (s!=NULL) { if (*s=='\0') s=rule->rhs.translation=NULL; /* replace empty string */ else { i=0; do if (*s!='p') i+= *s++=='_'; /* count underscores */ else if (++s,isdigit((eight_bits)*s)) ++s; /* skip digit and advance */ else rule_error("\nDigit should follow 'p' in format of rule %d.\n" @.Digit should follow 'p'...@> , rule->id); while (*s!='\0'); if (i!=rule->lhs.length-abs(rule->lhs.context)) rule_error("\nCount of '_' not equal to length LHS in rule %d.\n" @.Count of '\_' ...@> , rule->id); } } } @ Since trie nodes are not represented sparsely, insertion is easy. @= { trie_node* q=trie_root; for (i=0; ilhs.length; ++i) { if (no_successor(q,p[i])) set_successor(q,p[i],get_new_trie_node()); q=successor(q,p[i]); } #ifdef DEBUG if (q->rule!=NULL) rule_error("\nIdentical left-hand sides in rules %d and %d.\n" @.Identical left-hand sides...@> , q->rule->id, rule->id); #endif q->rule=rule; } @ We compute |displacement| conservatively, based on local considerations; alternatively we might also consider the whole set of rules to find larger (less negative) values that would make parsing go a bit faster. A rule can have a left hand side of length |max_lhs_length|. This means that it is safe to move |pp| so that it will afterwards be |max_lhs_length-1| positions to the left of the first scrap that may have got a new category. After a rule has been installed, there is no need to record the full length of the left hand side, including context, since this is implicit from the place in the trie where the pointer to this rule is located; rather we store the number of scraps that will be replaced. Similarly it is more useful to know the offset of the first scrap to be replaced (which is~0 in case of a right context) rather than the value of |context| as stored at initialisation. @< Compute |rule->displacement|... @>= { int k=rule->lhs.context,d; if (k<0) {@; rule->lhs.length+=k; k=0; } @+else rule->lhs.length-=k; d=1-max_lhs_length+k; /* this cannot be positive */ if (rule->lhs.category[k]==rule->rhs.category) /* no category change */ { ++d; #ifdef DEBUG if (rule->lhs.length==1) rule_error("\nNo categories change in rule %d.\n", rule->id); @.No categories change...@> #endif } rule->lhs.context=k; rule->displacement=d; /* if positive, an error was reported */ } @ The function |match| tests whether the category pointed to by~|p|, and its successors, match the categories in the trie structure, starting at the root of the trie, and up to a node that contains a rule. Multiple matches are possible, in which case the longest one takes precedence. We avoid using a side effect in the argument of |valid_cat|. @c reduction *match (scrap_pointer p) { trie_node* q=trie_root; reduction* rule=NULL; int c; while (c=p++->cat,valid_cat(c) && !no_successor(q,c)) if ((q=successor(q,c))->rule!=NULL) rule=q->rule; return rule; } @ When a matching rule has been found, the function |reduce| is called to perform the corresponding actions. At that point |pp| points to the first scrap involved in the match, and the argument |rule| to |reduce| points to the matching rule. If a rule has a left hand side of length~1 (not counting context) and also the default translation (plain concatenation), then all that is to be done is to change the category of a scrap, and part of the processing can be skipped. @c void reduce (reduction* rule) { int k=rule->lhs.context, l=rule->lhs.length; scrap_pointer s = pp+k, p=s;/* position of the new scrap */ char* f=rule->rhs.translation; /* format string for translation */ s->cat=rule->rhs.category; if (l>1 || f!=NULL) /* otherwise ready already */ { if (f==NULL) fuse(s,l), p+=l; /* default translation */ else @ if (l>1) @ } @ @displacement,scrap_base)| @> } @ When we have applied a reduction to the sequence of scraps, we usually remove scraps (we never create more scraps than we remove), thereby creating a small ``hole'' in the sequence. We fix that hole by sliding scraps downward across it, thereby moving the hole upwards, until it reaches the ``official'' hole at |lo_ptr|; then |lo_ptr| is adjusted so that the small hole is incorporated in the official hole. During the translation process the pointer~|p| was moved across all scraps that took part in the reduction, so the scraps to move are at positions~|i| with |p<=i= { scrap_pointer q=s+1; /* position after the newly formed scrap */ while (p= if (ppdisplacement) pp=scrap_base; @+ else pp+=rule->displacement; @ We need no extensive coding mechanism for describing translations, since they all follow a similar pattern. In all cases all the translations of the scraps in the left hand side (not including the context scraps) are used exactly once, in left to right order (violation of these principles would result in very strange effects indeed for the printed output). The only things that need to be added are formatting controls like |indent|, |force| or |break_space|, math shifts (but these are already taken care of by |add_trans|), and white space. We may also specify calls of |make_underlined| or |make_unreserved| for certain scraps in the left hand side. Although the items inserted are of a modest variety, one should realise that their presence is the only reason we need to parse at all; without them the translations could have been computed by purely lexical means. In format strings an underscore indicates the translation of the next scrap of the left hand side. Other characters each encode a formatting control; the character `\.p' encodes |opt| and is followed by a digit that becomes its argument. The characters `\.!' and `\.\$' respectively cause |make_underlined| and |make_nonreserved| to be called for the next scrap; `\.\ ' and~`\.\~' produce a space in the translation, where the latter is non-breakable and the former forces horizontal mode. To just force horizontal or math mode there are `\.h'~and~`\.m'; the latter avoids the possibility of a completely empty formula by adding a space inside math mode. The precise meaning of these and other formatting characters is easily read off from the code below. The `\.o' and `\.r' format characters (the latter is used only in compatibility mode) affect the math category used by \TeX\ for the next symbol (character or control sequence); the syntax rules using them do not put braces around that symbol, since these could also capture a following comment, causing a \TeX\ error. The number of free tokens we require to be available by calling |check_toks| is a conservative estimate, based on a hypothetic ``worst case'' reduction rule with a left hand side of length~4 and with 8 additional items in its translation (counting format codes with multi-character translations with multiplicity), whose mathnesses alternate, so that a maximal number of math shifts is required. @< Generate token list... @>= { int cur_mathness=maybe_math, init_mathness=maybe_math; check_toks(23); check_text(); do switch (*f++) @:format@> { case '+': app(indent); break; case '-': app(outdent); break; case 'p': app(opt); app(*f++); break; case 'f': set_mode(no_math); app(force); break; case 'F': set_mode(no_math); app(big_force); break; case 'b': set_mode(no_math); app(backup); break; case 'B': set_mode(no_math); app(break_space); break; case 't': set_mode(yes_math); app_str("\\a"); break; @.\\a@> /* next item in tab space */ case ',': set_mode(yes_math); app_str("\\,"); break; @.\\,@> /* thin space */ case 'h': set_mode(no_math); break; /* force horizontal mode */ case 'm': set_mode(yes_math); app(' '); break; /* force math mode, avoid `\.{\$\$}' */ case 'o': set_mode(yes_math); app_str("\\m"); break; @.\\m@> /* make ``mathord'' */ case 'r': set_mode(yes_math); app_str("\\MRL"); break; @.\\MRL@> /* make ``mathrel'' */ case '!': make_underlined(p); break; case '$': make_nonreserved(p); break; case ' ': set_mode(no_math); app(' '); break; case '~': app('~'); break; default: printf("%c: ",f[-1]); confusion("illegal character in format string"); case '_': add_trans(p++); } while (*f!='\0'); s->trans=text_ptr; freeze_text(); s->mathness=(init_mathness<<2)+cur_mathness; } @ We have now seen how a match is made, and what is done once a matching rule is found; we still have to consider how everything is set up properly, and how rules are repeatedly applied until no more reduction is possible, responding properly to successful and failing matches. All this is performed by the function |translate|; as we have seen it is called by |do_C| and |finish_C| after scraps have been stored from |scrap_base| to |scrap_ptr| of the |scrap_info| array, and it returns a pointer to the text representing the result of parsing all those scraps. We start with appending a dummy scrap if either no scraps are present at all, or some tokens remain that have not been packed into a scrap yet; the latter can only be due to `\.{@@t...@@>}' as a final item in compatibility mode. Then we set |lo_ptr| and |hi_ptr| appropriately, and begin to apply rules as long as possible. When this is done and more than one scrap remains, their translations are wrapped rather bluntly together to a single text. @c text_pointer translate (void) /* converts a sequence of scraps */ { pp=lo_ptr=hi_ptr=scrap_base; if (scrap_ptr==pp || dangling_tokens()) /* then append dummy scrap */ {@; check_scrap(); pack_scrap(insert,no_math); } @< If tracing, print an indication of where we are @> @< Reduce the scraps using the rules until no more rules apply @> @< Combine the irreducible scraps that remain, and |return| the resulting text @> } @ Before applying |match|, we must make sure it has good input (at least |max_lhs_length| scraps). If a match at |pp| exists, |reduce| will perform the required processing and updating of |pp| (in this case |pp| is certainly not increased), if not, we move to the right and try again. @< Reduce the scraps... @>= do { reduction *rule; @< Make sure the entries |pp| through |pp+max_lhs_length-1| of |scrap_info| are defined, or that |lo_ptr->cat==0| @> if ((rule=match(pp))!=NULL) reduce(rule); else {@; ++pp; @< Take special action if |pp->cat| is |end_expr| or |rproc| @> } } while (ppcat=0|, which will prevent any matches to scraps beyond those that are actually present. All scraps of category |insert| pass across the hole between |lo_ptr| and |hi_ptr|, and we take the opportunity to remove them, tacking their translation to the scrap below the hole. We even make sure that |hi_ptr->cat!=insert|, so that the scraps with which the |insert| scrap is combined will not have undergone any ordinary reduction yet. The only possible remaining |insert| scrap is one at the very start of the list; it will be handled at the end of all reductions. @< Make sure the entries... @>= { scrap_pointer lo_min = pp+max_lhs_length; while (lo_ptrcat!=0) if (hi_ptr>=scrap_ptr) lo_ptr->cat=0; else { *lo_ptr++ = *hi_ptr++; while (hi_ptrcat==insert) {@; *lo_ptr = *hi_ptr++; fuse(lo_ptr-1,2); } } } @ The category pairs |lproc|--|rproc| and |begin_expr|--|end_expr| are special in that the don't occur in any rules, but rather serve only as markers. When the material in between will not reduce any further the whole construction will be wrapped up, their translations concatenated, and the result treated as an |insert| respectively an |expression|. We do not actually form an |insert| scrap if it is not at the start of the scrap sequence, but rather combine everything directly with the scrap immediately before the |lproc| scrap, leaving the category of that scrap as it is; this is necessary, because the code that combines ordinary |insert| scraps with their predecessor only looks at scraps when they cross the hole from |hi_ptr| to~|lo_ptr|. We use here that |rproc-lproc==end_expr-begin_expr==1|. The variable names |s| and |p| are chosen with the same meaning as in |reduce|, so that we could reuse a module. @< Take special action... @>= if (pp->cat==end_expr || pp->cat==rproc) { int start=pp->cat-1; /* the opening category matching |pp->cat| */ scrap_pointer s=pp, p=pp+1; while ((--s)->cat!=start && s>scrap_base) {} if (s->cat==start) /* if opening symbol is missing, take no action */ { if (start==begin_expr) s->cat=expression; else if (s==scrap_base) s->cat=insert; @+ else --s; /* position of new scrap */ fuse(s,(int)(p-s)); @< Fill vacant scrap positions @> /* using values of |p| and |s| */ pp= s-scrap_base= { scrap_pointer j; if (scrap_base->cat==insert && lo_ptr>scrap_base+1) { fuse(scrap_base,2); /* merge initial |insert| into the next scrap */ j=scrap_base; j->cat=j[1].cat; --lo_ptr; while (++j check_toks(1); for (j=scrap_base; j= #ifdef DEBUG int tracing; /* how much parsing details to show */ #endif @ When parsing fails to reduce everything to a single scrap, pleasing results will probably not be obtained; it is therefore advisable to run \.{CWEAVE} with |tracing==trace1| before a final version of a |CWEB| program is fixed. In order to allow this without changing the source file itself, we initialise |tracing| to |trace1| if the flag `\.{+d}' is supplied to \.{CWEAVE}. @< Set initial values @>= #ifdef DEBUG tracing = flags['d'] ? trace1 : trace0; #endif @ The following code is activated by the `\.{+d}' flag or the \:1 control code. @= #ifdef DEBUG { if (tracing==trace1 && lo_ptr>=scrap_base+2) { print("\nIrreducible scrap sequence at line %d in section %d:\n" @.Irreducible scrap sequence...@> ,cur_line, section_count); mark_harmless(); for (j=scrap_base; jcat), putchar(' '); print_cat(j->cat); new_line(); /* |term_line_empty| is still valid */ } } #endif @ When full tracing is enabled the following message indicates which piece of \Cee~text is being parsed (but for section bodies it will generally show the first line of the next section, since that has already been fetched). @= #ifdef DEBUG { if (tracing>=trace2) { print("\nTracing after l.%d:\n", cur_line); @.Tracing after...@> if (loc>buffer+50) /* shorten long lines to keep within margins */ {@; printf("..."); term_write (loc-50,50); } else term_write(buffer,loc-buffer); new_line(); /* |term_line_empty| is still valid */ } } #endif @ After each reduction, full tracing will print a line starting with the rule number, followed by a display of all the categories of scraps which have been considered until now, i.e., those at positions below |lo_ptr|. The scrap that was produced by this reduction, which is pointed to by~|s|, has its category highlighted by enclosing it in inverted angle brackets. If tracing is set to~3, extra-full tracing is active, and mathnesses at the boundaries of scraps are indicated. @d math_char(x) ((x)==yes_math ? '+' : (x)==no_math ? '-' : '?') @= #ifdef DEBUG { scrap_pointer k; /* pointer into |scrap_info| */ if (tracing>=trace2) { print("\n%3d:", rule->id); for (k=scrap_base; k'), print_cat(k->cat), putchar('<'); else print_cat(k->cat); if (tracing==trace3) putchar(math_char(right_math(k))); } print("%s\n", hi_ptr