ABSTRACT
Computer program input generally has some structure; in fact, every computer program that does input can be thought of as defining an ``input language'' which it accepts. An input language may be as complex as a programming language, or as simple as a sequence of numbers. Unfortunately, usual input facilities are limited, difficult to use, and often are lax about checking their inputs for validity.
Yacc provides a general tool for describing the input to a computer program. The Yacc user specifies the structures of his input, together with code to be invoked as each such structure is recognized. Yacc turns such a specification into a subroutine that handles the input process; frequently, it is convenient and appropriate to have most of the flow of control in the user's application handled by this subroutine.
Yacc is written in portable C. The class of specifications accepted is a very general one: LALR(1) grammars with disambiguating rules.
INTRODUCTION
Yacc is a piece of computer software that serves as the standard parser.
Generator on Unix systems. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser (the part of a compiler that tries to make sense of the input) based on a grammar written in BNF notation. Yacc generates the code for the parser in the C programming language.
Yacc was developed by Steven C. Johnson at AT&T for the Unix operating system. Later compatible programs were written, such as Berkeley Yacc, GNU bison, MKS yacc and Abraxas yacc. Each offer slight improvements and additional features over the original Yacc, but the concept has remained the same.
Since the parser generated by Yacc requires a lexical analyzer, it is often used in combination with a lexical analyzer generator, in most cases the Lex program. The IEEE POSIX P1003.2 standard defines the functionality and requirements to both Lex and Yacc.
Yacc used to be available as the default parser generator on most Unix systems. It has since been supplanted as the default by more recent, largely compatible, programs such asBerkeley Yacc, GNU bison, MKS yacc and Abraxas pcyacc. An updated version of the original AT&T version is included as part of Sun's OpenSolaris project. Each offers slight improvements and additional features over the original yacc, but the concept has remained the same. Yacc has also been rewritten for other languages,including Ratfor, ML, Ada,Pascal, Java, Python and Common Lisp.
IMPORTANCE
It is possible to create a simple parser using Lex alone. by making extensive use of the user-defined states (ie start-conditions). However, such a parser quickly becomes unmaintainable, as the number of user-defined states tends to explode.
Once our input file syntax contains complex structures, such as "balanced" brackets, or contains elements which are context-sensitive, we should be considering yacc.
"Context-sensitive" in this case means that a word or symbol can have different interpretations, depending on where it appears in the input language. For example in C, the '*' character is used for both multiplication, and to specify indirection (ie to dereference a pointer to a piece of memory). It's meaning is "context-sensitive".
THE YACC SPECIFICATION
Like lex, yacc has it's own specification language. A yacc specification is structured along the same lines as a Lex specification.
%{
/* C declarations and includes */
%}
/* Yacc token and type declarations */
%%
/* Yacc Specification
in the form of grammer rules like this:
*/
symbol : symbols tokens
{ $$ = my_c_code($1); }
;
%%
/* C language program (the rest) */
The Yacc Specification rules are the place where you "glue" the various tokens together that lex has conviniently provided to you.
Each grammar rule defines a symbol in terms of:
other symbols
tokens (or terminal symbols) which come from the lexer.
Each rule can have an associated action, which is executed after all the component symbols of the rule have been parsed. Actions are basically C-program statements surrounded by curly braces.
YACC RULES
Yacc Grammar Rules
Simple Rule
Yacc rules define what is a legal sequence of tokens in our specification language. In our case, lets look at the rule for a simple, executable menu-command:
menu_item : LABEL EXEC
;
This rule defines a non-terminal symbol, menu_item in terms of the two tokens LABEL and EXEC. Tokens are also known as "terminal symbols", because the parser does not need to expand them any further. Conversely, menu_item is a "non-terminal symbol" because it can be expanded into LABEL and EXEC.
Alternate Rules
We've just hit our first complication: Any given menu-item may also have the keyword DEFAULT appear between the label and the executable command. Yacc allows us to have, multiple alternate definitions ofmenu_item, like this:
menu_item : LABEL EXEC
| LABEL DEFAULT EXEC
;
Note that the colon (:) semi-colon (;) and or-symbol (|) are part of the yacc syntax - they are not part of our menu-file definition. All yacc rules follow the basic syntax shown above and must end in a semi-colon. We've put the semi-colon on the next line for clarity, so that it does not get confused with our syntax-definitions. This is not a strict requirement, either, but another convention of style that we will adhere to.
Litteral Characters in a Rule
There is a way to include litteral text within a rule, but it requires that the lexer pass the characters to the parser one-by-one, as tokens.
For example, remember that menu-items may have an icon-file instead of a label, like this:
</usr/X11R6/icons/mini.netscape.xpm> exec netscape
When our lexer encounters a < or > it returns the character as a token
We can include litteral characters in a grammar rule, like this:
menu_item : LABEL EXEC '\n'
| '<' LABEL '>' EXEC '\n'
;
Where the second form of the menu_item is a used when specifying an icon-file instead of a text-label.
Recursive Rules
So far, we've defined a single menu-item, whereas our menu-file may contain any number of such menu-items. Yacc handles this allowing recursive rules, like this:
menu_items : menu_item
| menu_items menu_item
;
By defining menu_items in terms of itself, we now have a rule which means "one or more menu items".
Note that we could also have written our recursive definition the other way round, as:
| menu_item menu_items
but, due to the internals of yacc, this builds a less memory-efficient parser. Refer to the section "Recursion" in the Yacc/Bison documentation for the reasons behind this.
Empty Rules
Referring back to the rule for a single menu_item, there is another way we could accomodate the optional DEFAULT keyword; by defining an empty rule, like this:
menu_item : LABEL default EXEC '\n'
;
DEFAULT : /* empty */
| DEFAULT
;
The comment /* empty */ is ignored by yacc, and can be omitted, but again, it is conventional to include it for any empty rules.
Strange as it may seem, the absence of the keyword DEFAULT is also a valid rule! Yacc acknowledges the empty rule for "default" when it sees it's current look-ahead token is EXEC, and not DEFAULT. See the section "Look-Ahead" in the Bison documentation for more information about "look-ahead".
To understand why this 2nd approach might be considered better than our earlier one, we need to explore Yacc Actions.
YACC ACTIONS
Actions within yacc rules take the form:
menu_item : LABEL EXEC '\n'
{ C-program statements }
;
So far, we have only considered the tokens LABEL and EXEC as single-valued integers which are passed from the lexer to the parser. What we really need, is access to the text-strings associated with these tokens (ie their semantic value).
We could do this using a global variable (like token_txt in our spam-checking program), except that yacc executes the action after it has read all the tokens up to that point. Hence the string value for EXECwould overwrite the one for LABEL before we had a chance to use it. We could use seperate global variables for the LABEL and EXEC strings, but this won't always work, because sometimes yacc has to read a token in advance before it can
In order to accomodate a variety of different token-types, yylval is declared as a union of different types.
A YACC IMPLEMENTATION -
Compilation Sequence
You code patterns and input them to lex. It will read your patterns and generate C code for a lexical analyzer or scanner. You code a grammar and input it to yacc. Yacc will read your grammar and generate C code for a syntax analyzer or parser.
ANOTHER EXAMPLE
Olvwm2dtwmrc Menu Converter
Our example program will read an olvwm menu file, with the intent of afterwards writing out an equivalent menu, for a different window manager.
Let's review a typical openwin-menu file:
# comments
Workspace TITLE
Editors MENU $OPENWINHOME/lib/openwin-menu-e
Properties PROPERTIES
SEPARATOR
Netscape exec netscape
</usr/X11R6/icons/mini.netscape.xpm> exec netscape
"Long command" exec netscape \
http://www.luv.asn.au/
"Meditation" MENU
"Coffee" TITLE
"drift" exec xlock -remote -nolock -mode drift
"flame" exec xlock -remote -nolock -mode flame
"Meditation" END PIN
"Optional Menu" INCLUDE maybe-existant-menu
"Exit" EXIT
Most entries are single-line entries, begining with a label or icon filename. We also have various keywords to take into account, plus the more complex structure of a sub-menu.
REFRENCES:
http://opengroup.org/onlinepubs/007908775/xcu/yacc.html
http://en.wikipedia.org/wiki/Yacc
http://dinosaur.compilertools.net/yacc/index.html
http://dinosaur.compilertools.net/
http://opengroup.org/onlinepubs/007908775/xcu/yacc.html