EiffelParse Tutorial

OVERVIEW

Parsing is the task of analyzing the structure of documents such as programs, specifications or other structured texts.

Many systems need to parse documents. The best-known examples are compilers, interpreters and other software development tools; but as soon as a system provides its users with a command language, or processes input data with a non-trivial structure, it will need parsing facilities.

This chapter describes the EiffelParse library, which you can use to process documents of many different types. It provides a simple and flexible parsing scheme, resulting from the full application of object-oriented principles.

Because it concentrates on the higher-level structure, the EiffelParse library requires auxiliary mechanisms for identifying a document's lexical components: words, numbers and other such elementary units. To address this need it is recommended, although not required, to complement EiffelParse with the EiffelLex library studied in the previous chapter.

Figure 1 shows the inheritance structure of the classes discussed in this chapter.

Figure 1: Parse class structure

WHY USE THE EIFFELPARSE LIBRARY

Let us fist look at the circumstances under which you may want - or not want - to use the EiffelParse library.

The EiffelParse library vs. parser generators

Parsing is a heavily researched area of computing science and many tools are available to generate parsers. In particular, the popular Yacc tool, originally developed for Unix, is widely used to produce parsers.

In some cases Yacc or similar tools are perfectly adequate. It is also sometimes desirable to write a special-purpose parser for a language, not relying on any parser generator. Several circumstances may, however, make the Parse library attractive:

  • The need to interface the parsing tasks with the rest of an object-oriented system (such as a compiler or more generally a "document processor" as defined below) in the simplest and most convenient way.
  • The desire to apply object-oriented principles as fully as possible to all aspects of a system, including parsing, so as to gain the method's many benefits, in particular reliability, reusability and extendibility.
  • The need to tackle languages whose structure is not easily reconciled with the demands of common parser generator, which usually require the grammar to be LALR (1). (The EiffelParse library uses a more tolerant LL scheme, whose only significant constraint is absence of left-recursivity; the library provides mechanisms to detect this condition, which is easy to correct.)
  • The need to define several possible semantic treatments on the same syntactic structure.

The last reason may be the most significant practical argument in favor of using EiffelParse. Particularly relevant is the frequent case of a software development environment in which a variety of tools all work on the same basic syntactic structure. For example an environment supporting a programming language such as Pascal or Eiffel may include a compiler, an interpreter, a pretty-printer, software documentation tools (such as Eiffel's short and flat-short facilities), browsing tools and several other mechanisms that all need to perform semantic actions on software texts that have the same syntactic structure. With common parser generators such as Yacc, the descriptions of syntactic structure and semantic processing are inextricably mixed, so that you normally need one new specification for each tool. This makes design, evolution and reuse of specifications difficult and error-prone.

In contrast, the EiffelParse library promotes a specification style whereby syntax and semantics are kept separate, and uses inheritance to allow many different semantic descriptions to rely on the same syntactic stem. This will make EiffelParse particularly appropriate in such cases.

A word of caution

At the time of publication the EiffelParse library has not reached the same degree of maturity as the other libraries presented in this book. It should thus be used with some care. You will find at the end of this chapter a few comments about the work needed to bring the library to its full realization.

AIMS AND SCOPE OF THE EIFFELPARSE LIBRARY

To understand the EiffelParse library it is necessary to appreciate the role of parsing and its place in the more general task of processing documents of various kinds.

Basic terminology

First, some elementary conventions. The word document will denote the texts to be parsed. The software systems which perform parsing as part of their processing will be called document processors.

Typical document processors are compilers, interpreters, program checkers, specification analyzers and documentation tools. For example the EiffelStudio environment contains a number of document processors, used for compiling, documentation and browsing.

Parsing, grammars and semantics

Parsing is seldom an end in itself; rather, it serves as an intermediate step for document processors which perform various other actions.

Parsing takes care of one of the basic tasks of a document processor: reconstructing the logical organization of a document, which must conform to a certain syntax (or structure), defined by a grammar.

Note: The more complete name syntactic grammar avoids any confusion with the lexical grammars discussed in the EiffelLex Tutorial. By default, "grammar" with no further qualification will always denote a syntactic grammar. A syntactic grammar normally relies on a lexical grammar, which gives the form of the most elementary components - the tokens - appearing in the syntactic structure.

Once parsing has reconstructed the structure of a document, the document processor will perform various operations on the basis of that structure. For example a compiler will generate target code corresponding to the original text; a command language interpreter will execute the operations requested in the commands; and a documentation tool such as the short and flat-short commands for Eiffel will produce some information on the parsed document. Such operations are called semantic actions. One of the principal requirements on a good parsing mechanism is that it should make it easy to graft semantics onto syntax, by adding semantic actions of many possible kinds to the grammar.

The EiffelParse library provides predefined classes which handle the parsing aspect automatically and provide the hooks for adding semantic actions in a straightforward way. This enables developers to write full document processors - handling both syntax and semantics - simply and efficiently.

As noted at the beginning of this chapter, it is possible to build a single syntactic base and use it for several processors (such as a compiler and a documentation tool) with semantically different goals, such as compilation and documentation. In the EiffelParse library the semantic hooks take the form of deferred routines, or of routines with default implementations which you may redefine in descendants.

LIBRARY CLASSES

The EiffelParse library contains a small number of classes which cover common document processing applications. The classes, whose inheritance structure was shown at the beginning of this chapter, are:

  • CONSTRUCT , describing the general notion of syntactical construct.
  • AGGREGATE , describing constructs of the "aggregate" form.
  • CHOICE , describing constructs of the "choice" form.
  • REPETITION , describing constructs of the "repetition" form.
  • TERMINAL , describing "terminal" constructs with no further structure.
  • KEYWORD , describing how to handle keywords.
  • L_INTERFACE , providing a simple interface with the lexical analysis process and the Lex library.
  • INPUT , describing how to handle the input document.

EXAMPLES

The EiffelStudio delivery includes (in the examples/library/parse subdirectory) a simple example using the EiffelParse Library classes. The example is a processor for "documents" which describe computations involving polynomials with variables. The corresponding processor is a system which obtains polynomial specifications and variable values from a user, and computes the corresponding polynomials. This example illustrates the most important mechanisms of the EiffelParse Library and provides a guide for using the facilities described in this chapter. The components of its grammar appear as illustrations in the next sections.

CONSTRUCTS AND PRODUCTIONS

A set of documents possessing common properties, such as the set of all valid Eiffel classes or the set of all valid polynomial descriptions, is called a language.

In addition to its lexical aspects, the description of a language includes both syntactic and semantic properties. The grammar - the syntactic specification - describes the structure of the language (for example how an Eiffel class is organized into a number of clauses); the semantic specification defines the meaning of documents written in the language (for example the run-time properties of instances of the class, and the effect of feature calls).

To discuss the EiffelParse library, it is simpler to consider "language' as a purely syntactic notion; in other words, a language is simply the set of documents conforming to a certain syntactic grammar (taken here to include the supporting lexical grammar). Any semantic aspect will be considered to belong to the province of a specific document processor for the language, although the technique used for specifying the grammar will make it easy to add the specification of the semantics, or several alternative semantic specifications if desired.

This section explains how you may define the syntactic base - the grammar.

Constructs

A grammar consists of a number of constructs, each representing the structure of documents, or document components, called the specimens of the construct. For example, a grammar for Eiffel will contain constructs such as Class, Feature_clause and Instruction. A particular class text is a specimen of construct Class.

Each construct will be defined by a production, which gives the structure of the construct's specimens. For example, a production for Class in an Eiffel grammar should express that a class (a specimen of the Class construct) is made of an optional Indexing part, a Class_header, an optional Formal_generics part and so on. The production for Indexing will indicate that any specimen of this construct - any Indexing part - consists of the keyword indexing followed by zero or more specimens of Index_clause.

Although some notations for syntax descriptions such as BNF allow more than one production per construct, the EiffelParse library relies on the convention that every construct is defined by at most one production. Depending on whether there is indeed such a production, the construct is either non-terminal or terminal:

  • A non-terminal construct (so called because it is defined in terms of others) is specified by a production, which may be of one of three types: aggregate, choice and repetition. The construct will accordingly be called an aggregate, choice or repetition construct.
  • A terminal construct has no defining production. This means that it must be defined outside of the syntactical grammar. Terminals indeed come from the lexical grammar. Every terminal construct corresponds to a token type (regular expression or keyword) of the lexical grammar, for which the parsing duty will be delegated to lexical mechanisms, assumed in the rest of this chapter to be provided by the Lex library although others may be substituted if appropriate.

All specimens of terminal constructs are instances of class TERMINAL . A special case is that of keyword constructs, which have a single specimen corresponding to a keyword of the language. For example, if is a keyword of Eiffel. Keywords are described by class KEYWORD , an heir of TERMINAL .

The rest of this section concentrates on the parsing-specific part: non-terminal constructs and productions. Terminals will be studied in the discussion of how to interface parsing with lexical analysis.

Varieties of non-terminal constructs and productions

An aggregate production defines a construct whose specimens are obtained by concatenating ("aggregating") specimens of a list of specified constructs, some of which may be optional. For example, the production for construct Conditional in an Eiffel grammar may read:Conditional [=] if Then_part_list [Else_part] end

This means that a specimen of Conditional (a conditional instruction) is made of the keyword if, followed by a specimen of Then_part_list, followed by zero or one specimen of Else_part (the square brackets represent an optional component), followed by the keyword end.

Note: {{{1}}}

A choice production defines a construct whose specimens are specimens of one among a number of specified constructs. For example, the production for construct Type in an Eiffel grammar may read:Type [=] Class_type | Class_type_expanded | Formal_generic_name | Anchored | Bit_type

This means that a specimen of Type is either a specimen of Class_type, or a specimen of Class_type_expanded etc.

Finally, a repetition production defines a construct whose specimens are sequences of zero or more specimens of a given construct (called the base of the repetition construct), separated from each other by a separator. For example, the production for construct Compound in an Eiffel grammar may read Compound [=] {Instruction ";" ...}

This means that a specimen of Compound is made of zero or more specimens of Instruction, each separated from the next (if any) by a semicolon.

These three mechanisms - aggregate, choice and repetition - suffice to describe the structure of a wide array of practical languages. Properties which cannot be handled by them should be dealt with through semantic actions, as explained below.

An example grammar

The example directory included in the delivery implements a processor for a grammar describing a simple language for expressing polynomials. A typical "document" in this language is the line

x; y: x * (y + 8 - (2 * x))

The beginning of the line, separated from the rest by a colon, is the list of variables used in the polynomial, separated by semicolons. The rest of the line is the expression defining the polynomial.

Using the conventions defined above, the grammar may be written as:Line [=] Variables ":" Sum Variables [=] {Identifier ";" ...} Sum [=] {Diff "+" ...} Diff [=] {Product "-" ...} Product [=] {Term " * " ...} Term [=] Simple_var Int_constant Nested Nested [=] "(" Sum ")"

This grammar assumes a terminal Identifier, which must be defined as a token type in the lexical grammar. The other terminals are keywords, shown as strings appearing in double quotes, for example "+".

PARSING CONCEPTS

The EiffelParse library supports a parsing mechanism based on the concepts of object-oriented software construction.

Class CONSTRUCT

The deferred class CONSTRUCT describes the general notion of construct; instances of this class and its descendants are specimens of the constructs of a grammar.

Deferred though it may be, CONSTRUCT defines some useful general patterns; for example, its procedure process appears as:
parse if parsed then semantics end
where procedures parse and semantics are expressed in terms of some more specific procedures, which are deferred. This defines a general scheme while leaving the details to descendants of the class.

Such descendants, given in the library, are classes AGGREGATE , CHOICE , REPETITION and TERMINAL . They describe the corresponding types of construct, with features providing the operations for parsing their specimens and applying the associated semantic actions.

Building a processor

To build a processor for a given grammar, you write a class, called a construct class, for every construct of the grammar, terminal or non-terminal. The class should inherit from AGGREGATE , CHOICE , REPETITION or TERMINAL depending on the nature of the construct. It describes the production for the construct and any associated semantic actions.

To complete the processor, you must choose a "top construct" for that particular processor, and write a root class. In accordance with the object-oriented method, which implies that "roots" and "tops" should be chosen last, these steps are explained at the end of this chapter.

The next sections explain how to write construct classes, how to handle semantics, and how to interface parsing with the lexical analysis process. All these tasks rely on a fundamental data abstraction, the notion of abstract syntax tree.

Abstract syntax trees

The effect of processing a document with a processor built from a combination of construct classes is to build an abstract syntax tree for that document, and to apply any requested semantic actions to that tree.

The syntax tree is said to be abstract because it only includes important structural information and does not retain the concrete information such as keywords and separators. Such concrete information, sometimes called "syntactic sugar", serves only external purposes but is of no use for semantic processing.

The combination of Eiffel techniques and libraries yields a very simple approach to building and processing abstract syntax trees. Class CONSTRUCT is a descendant of the Data Structure Library class TWO_WAY_TREE , describing a versatile implementation of trees; so, as a consequence, are CONSTRUCT's own descendants. The effect of parsing any specimen of a construct is therefore to create an instance of the corresponding construct class. This instance is (among other things) a tree node, and is automatically inserted at its right place in the abstract syntax tree.

As noted in the discussion of trees, class TWO_WAY_TREE makes no formal distinction between the notions of tree and tree node. So you may identify the abstract syntax tree with the object (instance of CONSTRUCT ) representing the topmost construct specimen in the structure of the document being analyzed.

The production function

A construct class describes the syntax of a given construct through a function called production, which is a direct representation of the corresponding production. This function is declared in CONSTRUCT as production: LINKED_LIST [CONSTRUCT] -- Right-hand side of the production for the construct deferred end

Function production remains deferred in classes AGGREGATE , CHOICE and REPETITION . Every effective construct class that you write must provide an effecting of that function. It is important for the efficiency of the parsing process that every effective version of production be a once function. Several examples of such effectings are given below.

Classes AGGREGATE , CHOICE , REPETITION and TERMINAL also have a deferred function construct_name of type STRING, useful for tracing and debugging. This function should be effected in every construct class to return the string name of the construct, such as "INSTRUCTION" or "CLASS" for construct classes in a grammar of Eiffel. For efficiency reasons, the construct_name function should also be a once function. The form of such a function will always be the same, as illustrated by the following example which may appear in the construct class INSTRUCTION in a processor for Eiffel: construct_name: STRING -- Symbolic name of the construct once Result := "INSTRUCTION" end

The examples of the next few sections, which explain how to write construct classes, are borrowed from the small "Polynomial" language mentioned above, which may be found in the examples directory in the ISE Eiffel delivery.

PREPARING GRAMMARS

Having studied the EiffelParse library principles, let us see how to write grammar productions for various kinds of construct. The main task is to write the production function for each construct class.

The production function for a descendant of AGGREGATE will describe how to build a specimen of the corresponding function from a sequence of specimens of each of the constituent constructs. Writing this function from the corresponding production is straightforward.

As an example, consider the production function of class LINE for the Polynomial example language. The corresponding production is
Line [=] Variables ":" Sum
where Variables and Sum are other constructs, and the colon ":" is a terminal. This means that every specimen of Line consists of a specimen of Variables, followed by a colon, followed by a specimen of Sum.

Here is the corresponding production function as it appears in class LINE: production: LINKED_LIST [CONSTRUCT] local var: VARIABLES sum: SUM once create Result.make Result.forth create var.make put (var) keyword (":") create sum.make put (sum) end

As shown by this example, the production function for an aggregate construct class should declare a local entity (here var and sum) for each non-keyword component of the right-hand side, the type of each entity being the corresponding construct class (here VARIABLES and SUM).

The body of the function should begin with create Result.make Result.forthto create the object containing the result. Then for each non-keyword component, represented by the local entity component (this applies to var and sum in the example), there should be a sequence of two instructions, of the form create component.make put (component)

For any keyword of associated string symbol, such as the colon ":" in the example, there should be a call to keyword (symbol)

The order of the various calls to put (for non-keywords) and keyword (for keywords) must be the order of the components in the production. Also, every create component.make instruction must occur before the corresponding call to put (symbol).

All components in the above example are required. In the general case an aggregate production may have optional components. To signal that a component component of the right-hand side is optional, include a call of the form component.set_optional

This call may appear anywhere after the corresponding create component instruction. The recommended place is just after the call to put, as in create component put (symbol) component.set_optional

Choices

The production function for a descendant of CHOICE will describe how to build a specimen of the corresponding function as a specimen of one of the alternative constructs.

As an example, consider the production function of class TERM for the Polynomial example language. The corresponding production isTerm [=] Simple_var Poly_integer Nested
where Simple_var, Poly_integer and Nested are other constructs. This means that every specimen of Term consists of one specimen of any one of these three constructs. Here is the corresponding production function as it appears in class TERM: production: LINKED_LIST [CONSTRUCT] local id: SIMPLE_VAR val: POLY_INTEGER nest: NESTED once create Result.make Result.forth createid.make put (id) create val.make put (val) create nest.make put (nest) end

As shown by this example, the production function for a choice construct class must declare a local entity - here id, val and nest - for each alternative component of the right-hand side. The type of each entity is the corresponding construct class - here SIMPLE_VAR, POLY_INTEGER and NESTED.

The body of the function must begin by create Result.make Result.forth

Then for each alternative component represented by a local entity component (in the example this applies to id, val and nest) there should be two instructions of the form create component.make put (component)

Caution: The order of the various calls to put is irrelevant in principle. When a document is parsed, however, the choices will be tried in the order given; so if you know that certain choices occur more frequently than others it is preferable to list them first to speed up the parsing process.

Repetitions

The production function for a descendant of REPETITION will describe how to build a specimen of the corresponding function as a sequence or zero or more (or, depending on the grammar, one or more) specimens of the base construct. The class must also effect a feature separator of type STRING, usually as a constant attribute. (This feature is introduced as deferred in class REPETITION .)

As an example, consider the construct Variables in the Polynomial example language. The right-hand side of the corresponding production is
Variables [=] {Identifier ";" ...}
where Identifier is another construct, and the semicolon ";" is a terminal. This means that every specimen of Variables consists of zero or more specimens of Identifier, separated from each other (if more than one) by semicolons.

Here are the corresponding production function and separator attribute as they appear in class VARIABLES: production: LINKED_LIST [IDENTIFIER] local base: VAR once create Result.make Result.forth create base.make put (base) end separator: STRING = ";"

As shown by this example, function production is built along the same ideas as for aggregates and choices, except that here only one component, base, is required; its type must be the class corresponding to the construct serving as the base of the repetition, VAR in the example.

INTERFACE TO LEXICAL ANALYSIS

One more type of construct class remains to be discussed: terminal construct classes. Since terminal constructs serve to elevate lexical tokens (regular expressions and keywords) to the dignity of syntactical construct, we must first take a look at how the EiffelParse library classes collaborate with their counterparts in the EiffelLex library.

The notion of lexical interface class

To parse a document, you need to get tokens from a lexical analyzer. This is achieved by making some construct classes, in particular those describing terminals, descendants of one of the lexical classes.

The best technique is usually to write a class covering the lexical needs of the language at hand, from which all construct classes that have some lexical business will inherit. Such a class is called a lexical interface class.

Lexical interface classes usually follow a common pattern. To take advantage of this uniformity, the EiffelParse library includes a deferred class L_INTERFACE which describes that pattern. Specific lexical interface classes may be written as descendants of L_INTERFACE.

L_INTERFACE is a simple deferred class, with a deferred procedure obtain_analyzer. It is an heir of METALEX.

Obtaining a lexical analyzer

An effective descendant of L_INTERFACE must define procedure obtain_analyzer so that it records into the lexical analyzer the regular expressions and keywords of the language at hand. In writing obtain_analyzer you may use any one of three different techniques, each of which may be the most convenient depending on the precise context, to obtain the required lexical analyzer:

  • You may build the lexical analyzer by defining its regular expressions one by one, using the procedures described in the presentation of METALEX, in particular put_expression and put_keyword.
  • You may use use procedure retrieve_analyzer from METALEX to retrieve an analyzer which a previous session saved into a file.
  • Finally, you may write a lexical grammar file (or reuse an existing one) and process it on the spot by using procedure read_grammar from METALEX.

A lexical interface class

An example of a lexical interface class is POLY_LEX for the Polynomial example language. Here is the complete text of that class:indexing description: "Lexical interface class for the Polynomial language" class POLY_LEX inherit L_INTERFACE CONSTANTS undefine consistent, copy, is_equal, setup end feature {NONE} obtain_analyzer -- Create lexical analyzer for the Polynomial language do ignore_case keywords_ignore_case build_expressions build_keywords set_separator_type (blanks) end build_expressions -- Define regular expressions -- for the Polynomial language do put_expression (special_expression, special, "Special") put_expression ("*('a'..'z')", simple_identifier, "Simple_identifier") put_expression ("+('0'..'9')", integer_constant, "Integer_constant") put_expression ("+('\t'|'\n'|' ')", blanks, "Blanks") end special_expression: STRING -- Regular expression describing Special once create Result.make (80) Result.append ("('\050'..'\057')|") Result.append ("('\072'..'\076')|") Result.append ("'['|']'|'|'|'{'|'}'|%"->%"|%":=%"") end build_keywords -- Define keywords (special symbols) -- for the Polynomial language do put_keyword ("+", special) put_keyword ("-", special) put_keyword (";", special) put_keyword (":", special) put_keyword ("(", special) put_keyword (")", special) put_keyword ("*", special) end end

This class illustrates the straightforward scheme for writing lexical interface classes. It introduces constants such as Special to represent the regular expressions supported, and effects procedure obtain_analyzer. The role of this procedure is to define lexical conventions (here through calls to ignore_case and keywords_ignore_case), to record the regular expressions (through calls to put_expression, packaged in a procedure build_expressions for clarity), and records the keywords (through calls to put_keyword, packaged in build_keywords).

All the classes of a document processor that need to interact with the lexical analysis should inherit from a lexical interface class such as POLY_LEX. This is true in particular of the root class of a processor, as discussed below.

More on terminal constructs

Terminal construct classes are examples of classes that need to interact with the lexical analysis, and should thus inherit from the lexical interface class.

Class TERMINAL includes a deferred function token_type of type INTEGER. Every effective descendant of TERMINAL should effect this feature as a constant attribute, whose value is the code for the associated regular expression, obtained from the lexical interface class. As every other construct class, such a descendant should also effect construct_name as a once function. For example, in the Polynomial language, class INT_CONSTANT has the following text:class INT_CONSTANT inherit TERMINAL CONSTANTS feature token_type: INTEGER do Result := integer_constant end feature {NONE} construct_name: STRING once Result := "INT_CONSTANT" end end

SPECIFYING THE SEMANTICS

As mentioned at the beginning of this chapter, parsing is usually done not for itself but as a way to perform some semantic processing. The EiffelParse Library classes define the general framework for grafting such semantics onto a syntactical stem.

Semantic procedures

The principal procedures for defining semantic actions are pre_action and post_action. These are features of class CONSTRUCT. Procedure pre_action describes the actions to be performed before a construct has been recognized; post_action, the actions to be performed after a construct has been recognized.

As defined in CONSTRUCT, both pre_action and post_action do nothing by default. Any construct class which is a descendant of CONSTRUCT may redefine one or both so that they will perform the semantic actions that the document processor must apply to specimens of the corresponding construct. These procedures are called automatically during processing, before and after the corresponding structures have been parsed.

For TERMINAL, only one semantic action makes sense. To avoid any confusion, post_action is renamed action in that class and pre_action is renamed unused_pre_action to indicate that it is irrelevant.

Often, the semantic procedures need to compute various elements of information. These may be recorded using appropriate attributes of the corresponding construct classes.

Note: Readers familiar with the theory of parsing and compiling will see that this scheme, using the attributes of Eiffel classes, provides a direct implementation of the "attribute grammar" mechanism.

Polynomial semantics

As an example let us examine the semantics of the Product construct for the polynomial language. It is a repetition construct, with Term as the base construct; in other words a specimen of Product is a sequence of one or more terms, representing the product term1 * term2 ... * termn. Here is the post_action procedure in the corresponding class PRODUCT:


post_action local int_value: INTEGER do if not no_components then from child_start if not child_after then int_value := 1 end until child_after loop child.post_action nt_value := int_value * info.child_value child_forth end info.set_child_value (int_value) end end

Here each relevant construct class has an attribute info used to record the semantic information associated with polynomials and their components, such as child_value, an INTEGER. The post_action takes care of computing the product of all child_values for the children. First, of course, post_action must recursively be applied to each child, to compute its own child_value.

Note: Recall that an instance of CONSTRUCT is also a node of the abstract syntax tree, so that all the TWO_WAY_TREE features such as child_value, child_start, child_after and many others are automatically available to access the syntactical structure.

Keeping syntax and semantics separate

For simple examples such as the Polynomial language, it is convenient to use a single class to describe both the syntax of a construct (through the production function and associated features) and its semantics (action routines and associated features).

For more ambitious languages and processors, however, it is often preferable to keep the two aspects separate. Such separation of syntax and semantics, and in particular the sharing of the same syntax for different processors with different semantic actions, is hard or impossible to obtain with traditional document processing tools such as Yacc on Unix. Here is how to achieve it with the EiffelParse library:

  • First write purely syntactic classes, that is to say construct classes which only effect the syntactical part (in particular function production). As a consequence, these classes usually remain deferred. The recommended convention for such syntactic classes is to use names beginning with S_, for example S_INSTRUCTION or S_LOOP.
  • Then for each construct for which a processor defines a certain semantics, define another class, called a semantic class, which inherits from the corresponding syntactic class. The recommended convention for semantic classes is to give them names which directly reflect the corresponding construct name, as in INSTRUCTION or LOOP.

To build a semantic class in in step 2 it is often convenient to use multiple inheritance from a syntactic class and a "semantics-only" class. For example in a processor for Eiffel class INSTRUCTION may inherit from both IS_INSTRUCTION and from a semantics-only class INSTRUCTION_PROPERTIES which introduces the required semantic features.

One of the advantages of this scheme is that it makes it easy to associate two or more types of processing with a single construct, by keeping the same syntactic class (such as IS_INSTRUCTION) but choosing a different pure-semantics class each time.

As noted earlier in this chapter, this is particularly useful in an environment where different processors need to perform differents actions on specimens of the same construct. In an Eiffel environment, for example, processors that manipulate classes and other Eiffel construct specimens may include a compiler, an interpreter, a flattener (producing the flat form), a class abstracter (producing the short or flat-short form), and various browsing tools such as those provided by Eiffel Software.

For obvious reasons of convenience and ease of maintenance, it is desirable to let these processors share the same syntactic descriptions. The method just described, relying on multiple inheritance, achieves this goal.

HOW PARSING WORKS

Classes AGGREGATE, CHOICE, TERMINAL and REPETITION are written in such a way that you do not need to take care of the parsing process. They make it possible to parse any language built according to the rules given - with one limitation, left recursion, discussed below. You can then concentrate on writing the interesting part - semantic processing.

To derive the maximum benefit from the EiffelParse library, however, it is useful to gain a little more insight into the way parsing works. Let us raise the veil just enough to see any remaining property that is relevant to the building of parsers and document processors.

The parsing technique

The EiffelParse library relies on a general approach known as recursive descent, meaning that various choices will be tried in sequence and recursively to recognize a certain specimen.

If a choice is attempted and fails (because it encounters input that does not conform to what is expected), the algorithm will try remaining choices, after having moved the input cursor back to where it was before the choice that failed. This process is called backtracking. It is handled by the parsing algorithms in an entirely automatic fashion, without programmer intervention.

Left recursion

Recursive descent implies the danger of infinite looping when parsing is attempted for left-recursive productions of the form
A [=] A ...
or, more generally, cases in which the left recursion is indirect, as in
A [=] B ... B [=] C ... ... L [=] A ...

Direct left recursion is easy to avoid, but indirect recursion may sneak in in less trivial ways.

To determine whether the production for a construct is directly or indirectly left-recursive, use the query left_recursion from class CONSTRUCT.

Backtracking and the commit procedure

Another potential problem may arise from too much backtracking. In contrast with left recursion, this is a performance issue, not a threat to the correctness of the parsing algorithm. Automatic backtracking is in fact essential to the generality and flexibility of the recursive descent parsing algorithm; but too much of it may degrade the efficiency of the parsing mechanism.

Two techniques are available to minimize backtracking. One, mentioned above, is to organize the production functions for choice construct classes so that they list the most frequent cases first. The other is to use the commit procedure in the production functions for aggregate constructs.

A call to commit in an aggregate A is a hint to the parser, which means:

If you get to this point in trying to recognize a specimen of A as one among several possible choices for a choice construct C, and you later fail to obtain an A, then forget about other choices for C: you won't be able to find a C here. You may go back to the next higher-level choice before C - or admit failure if there is no such choice left.

Such a hint is useful when you want to let the parser benefit from some higher-level knowledge about the grammar, which is not directly deducible from the way the productions have been written.

Here is an example. The production function for NESTED in the Polynomial language, which attempts to parse specimens of the form
(s)where s is a specimen of SUM, is written as production: LINKED_LIST [CONSTRUCT] local expression: SUM once create Result.make Result.forth keyword ("(") commit create expression.make put (expression) keyword (")") end

The commit after the recognition of the keyword "(" is there to use the following piece of higher-level knowledge:

No choice production of the grammar that has NESTED as one of its alternatives has another alternative construct whose specimens could begin with an opening parenthesis "(".

Because of this property, if the parser goes so far as to recognize an opening parenthesis as part of parsing any construct C for which NESTED is an alternative, but further tokens do not match the structure of NESTED specimens, then we will have failed to recognize not only a NESTED but also a C.

Note: Some readers will have recognized commit as being close to the Prolog "cut" mechanism.

In this example, NESTED is used in only one right-hand side production: the choice production for TERM, for which the other alternatives are SIMPLE_VAR and POLY_INTEGER, none of whose specimens can include an opening parenthesis.

The use of commit assumes global knowledge about the grammar and its future extensions, which is somewhat at odds with the evolutionary approach suggested by the Eiffel method. Applied improperly, this mechanism could lead to the rejection of valid texts as invalid. Used with care, however, it helps in obtaining high-performance parsing without impairing too much the simplicity of preparing parsers and other document processors.

BUILDING A DOCUMENT PROCESSOR

We are ready now to put together the various elements required to build a document processor based on the EiffelParse library.

The overall picture

The documents to be processed will be specimens of a certain construct. This construct is called the top construct for that particular processing.

Caution: Be sure to note that with the EiffelParse library there is no room for a concept of top construct of a grammar: the top construct is only defined with respect to a particular processor for that grammar.
Attempting to define the top of a grammar would be contrary to the object-oriented approach, which de-emphasizes any notion of top component of a system.
Different processors for the same grammar may use different top constructs.

A document processor will be a particular system made of construct classes, complemented by semantic classes, and usually by other auxiliary classes. One of the construct classes corresponds to the top construct and is called the top construct class.

Note: This notion of top construct class has a natural connection to the notion of root class of a system, as needed to get executable software. The top construct class could indeed be used as root of the processor system. In line with the previous discussion, however, it appears preferable to keep the top construct class (which only depends on the syntax and remains independent of any particular processor) separate from the system's root class. With this approach the root class will often be a descendant of the top construct class.
This policy was adopted for the Polynomial language example as it appears in the delivery: the processor defined for this example uses LINE as the top construct class; the root of the processor system is a class PROCESS, which inherits from LINE.

Steps in the execution of a document processor

As any root class of a system, the root of a document processor must have a creation procedure which starts the execution. Here the task of this class is the following:

  1. Define an object representing the input document to be processed; this will be an instance of class INPUT.
  2. Obtain a lexical analyzer applicable to the language, and connect it with the document.
  3. Select an input file, containing the actual document to be processed.
  4. Process the document: in other words, parse it and, if parsing is successful, apply the semantics.
  5. To execute these steps in a simple and convenient fashion, it is useful to declare the root class as a descendant of the lexical interface class. The root class, being an heir to the top construct class, will also be a descendant of CONSTRUCT.

Connecting with lexical analysis

To achieve the effect of steps 1 and 2 , a simple call instruction suffices: just call the procedure build, inherited from L_INTERFACE using as argument document, a feature of type INPUT, obtained from METALEX (the lexical analyzer generator class) through L_INTERFACE. The call, then, is just: build (document)

Although you may use this line as a recipe with no need for further justification, it is interesting to see what build does. Feature document describes the input document to be processed; it is introduced as a Once function in class CONSTRUCT to ensure that all instances of CONSTRUCT share a single document - in other words, that all processing actions apply to the same document. The text of build is: build (doc: INPUT) -- Create lexical analyzer and set doc -- to be the input document. require document_exists: doc /= void do metalex_make obtain_analyzer make_analyzer doc.set_lexical (analyzer) end

The call to obtain_analyzer defines the regular grammar for the language at hand. Recall that obtain_analyzer is deferred in L_INTERFACE; its definition for the POLY_LEX example was given above. The call to make_analyzer freezes the regular grammar and produces a usable lexical analyzer, available through the attribute analyzer obtained from METALEX. Finally, the call to set_lexical, a procedure of class INPUT, ensures that all lexical analysis operations will use analyzer as the lexical analyzer.

Starting the actual processing

The call build ( document takes care of steps 1 and 2 of the root's creation procedure. Step 3 selects the file containing the input document; this is achieved through the call
document.set_input_file (some_file_name)
where set_input_file, from class INPUT, has a self-explanatory effect.

Finally, step 4 (processing the document) is simply a call to procedure process, obtained from CONSTRUCT . Recall that this procedure simply executes
parse if parsed then semantics end

The structure of a full example

The polynomial example provides a simple example of a full document processor, which you may use as a guide for your own processors. The root class of that example is PROCESS. Its creation procedure, make, follows the above scheme precisely; here is its general form: root_line: LINE make local text_name: STRING do create root_line.make build (root_line.document) ... Instructions prompting the user for the name of the file to be parsed, and assigning it to text_name ... root_line.document.set_input_file (text_name) root_line.process end

Although it covers a small language, this example may serve as a blueprint for most applications of the EiffelParse library.

FUTURE WORK

It was mentioned at the beginning of this chapter that further work is desirable to make the EiffelParse library reach its full bloom. Here is a glimpse of future improvements.

Expressions

Many languages include an expression construct having the properties of traditional arithmetic expressions:

  • An expression is a succession of basic operands and operators.
  • The basic operands are lexical elements, such as identifiers and constants.
  • Operators are used in prefix mode (as in - a) or infix mode (as in b - a).
  • Each operator has a precedence level; precedence levels determine the abstract syntactic structure of expressions and consequently their semantics. For example the abstract structure of a + b * c shows this expression to be the application of the operator + to a and to the application of the operator * to b and c. That this is the correct interpretation of the instruction follows from the property that * has a higher precedence ("binds more tightly") than +.
  • Parentheses pairs, such as ( ) or [ ], can be used to enforce a structure different from what the precedence rules would imply, as in (a + b) * c.
  • Some infix operators may be applied to more than two arguments; in this case it must be clear whether they are right-associative (in other words, a ^ b ^ c means a ^ (b ^ c), the conventional interpretation if ^ denotes the power operator) or left-associative.

It is of course possible to apply the EiffelParse library in its current state to support expressions, as illustrated by this extract from the Polynomial grammar given in full above:Variables [=] {Identifier ";" ...} Sum [=] {Diff "+" ...} Diff [=] {Product "-" ...} Product [=] {Term "*" ...}

The problem then is not expressiveness but efficiency. For such expressions the recursive descent technique, however well adapted to the higher-level structures of a language, takes too much time and generates too many tree nodes. Efficient bottom-up parsing techniques are available for this case.

The solution is straightforward: write a new heir EXPRESSION to class CONSTRUCT. The preceding discussion of expressions and their properties suggests what kinds of feature this class will offer: define a certain terminal as operator, define a terminal as operand type, set the precedence of an operator, set an operator as left-associative or right-associative and so on. Writing this class based on this discussion is indeed a relatively straightforward task, which can be used as a programming exercise.

Beyond the addition of an EXPRESSION class, some changes in the data structures used by EiffelParse may also help improve the efficiency of the parsing process.

Yooc

To describe the syntax of a language, it is convenient to use a textual format such as the one that has served in this chapter to illustrate the various forms of production. The correspondence between such a format and the construct classes is straightforward; for example, as explained above, the production
Line [=] Variables ":" Sum
will yield the classclass LINE inherit AGGREGATE feature production: LINKED_LIST [CONSTRUCT] local var: VARIABLES sum: SUM once create Result.make Result.forth create var.make put (var) keyword (":") create sum.make put (sum) end ... end

This transformation of the textual description of the grammar into its equivalent Eiffel form is simple and unambiguous; but it is somewhat annoying to have to perform it manually.

A tool complementing the EiffelParse library and known as YOOC ("Yes! an Object-Oriented Compiler", a name meant as an homage to the venerable Yacc) has been planned for future releases of EiffelParse. Yooc, a translator, will take a grammar specification as input and transform it into a set of parsing classes, all descendants of CONSTRUCT and built according to the rules defined above. The input format for syntax specification, similar to the conventions used throughout this chapter, is a variant of LDL (Language Description Language), a component of the ArchiText structural document processing system.

Further reading

The following article describes some advanced uses of the EiffelParse library as well as a Yooc-like translator called PG: Per Grape and Kim Walden: Automating the Development of Syntax Tree Generators for an Evolving Language, in Proceedings of TOOLS 8 (Technology of Object-Oriented Languages and Systems), Prentice Hall, 1992, pages 185-195.