unixdev.net


Switch to SpeakEasy.net DSL

The Modular Manual Browser

Home Page
Manual: (SunOS-5.10)
Page:
Section:
Apropos / Subsearch:
optional field

lex(1)                           User Commands                          lex(1)



NAME
       lex - generate programs for lexical tasks

SYNOPSIS
       lex [-cntv] [-e | -w]  [ -V -Q
        [y | n] ] [file...]

DESCRIPTION
       The  lex  utility generates C programs to be used in lexical processing
       of character input, and that can be used as an interface to yacc. The C
       programs  are  generated  from lex source code and conform to the ISO C
       standard. Usually, the lex utility writes the program it  generates  to
       the  file  lex.yy.c. The state of this file is unspecified if lex exits
       with a non-zero exit status. See EXTENDED DESCRIPTION  for  a  complete
       description of the lex input language.

OPTIONS
       The following options are supported:

       -c              Indicates C-language action (default option).



       -e              Generates  a  program  that  can  handle EUC characters
                       (cannot be used with the -w  option).  yytext[]  is  of
                       type unsigned char[].



       -n              Suppresses  the  summary  of statistics usually written
                       with the -v option. If no table sizes are specified  in
                       the lex source code and the -v option is not specified,
                       then -n is implied.



       -t              Writes the resulting program to standard output instead
                       of lex.yy.c.



       -v              Writes  a  summary  of  lex  statistics to the standard
                       error. (See the discussion of lex table sizes under the
                       heading  Definitions in lex.) If table sizes are speci-
                       fied in the lex source code, and if the  -n  option  is
                       not specified, the -v option may be enabled.



       -w              Generates  a  program  that  can  handle EUC characters
                       (cannot be used with the  -e  option).  Unlike  the  -e
                       option, yytext[] is of type wchar_t[].



       -V              Prints out version information on standard error.



       -Q[y|n]         Prints  out version information to output file lex.yy.c
                       by using -Qy. The -Qn option does not print out version
                       information and is the default.



OPERANDS
       The following operand is supported:

       file     A  pathname  of  an  input file. If more than one such file is
                specified, all files will be concatenated to produce a  single
                lex  program.  If no file operands are specified, or if a file
                operand is -, the standard input will be used.



OUTPUT
       The lex output files are described below.

   Stdout
       If the -t option is specified, the text file of C source code output of
       lex will be written to standard output.

   Stderr
       If the -t option is specified informational, error and warning messages
       concerning the contents of lex source code input will be written to the
       standard error.

       If the -t option is not specified:

       1.  Informational error and warning messages concerning the contents of
           lex source code input will be written to either the standard output
           or standard error.


       2.  If  the  -v option is specified and the -n option is not specified,
           lex statistics will also be written to standard error.  These  sta-
           tistics may also be generated if table sizes are specified with a %
           operator in the Definitions in lex section (see  EXTENDED  DESCRIP-
           TION), as long as the -n option is not specified.


   Output Files
       A text file containing C source code will be written to lex.yy.c, or to
       the standard output if the -t option is present.

EXTENDED DESCRIPTION
       Each input file contains lex source code, which is a table  of  regular
       expressions  with  corresponding actions in the form of C program frag-
       ments.

       When lex.yy.c is compiled and linked with the lex  library  (using  the
       -l l  operand  with  c89  or cc), the resulting program reads character
       input from the standard input and partitions it into strings that match
       the given expressions.

       When an expression is matched, these actions will occur:

         o  The input string that was matched is left in yytext as a null-ter-
            minated string; yytext is either an external character array or  a
            pointer to a character string. As explained in Definitions in lex,
            the type can be explicitly selected using the %array  or  %pointer
            declarations, but the default is %array.

         o  The  external  int  yyleng  is  set  to the length of the matching
            string.

         o  The expression's corresponding program  fragment,  or  action,  is
            executed.


       During  pattern matching, lex searches the set of patterns for the sin-
       gle longest possible match. Among rules that match the same  number  of
       characters, the rule given first will be chosen.

       The general format of lex source is:


       Definitions
       %%
       Rules
       %%
       User Subroutines



       The  first  %%  is required to mark the beginning of the rules (regular
       expressions and actions); the second %% is required only if  user  sub-
       routines follow.

       Any line in the Definitions in lex section beginning with a blank char-
       acter will be assumed to be a C program fragment and will be copied  to
       the  external definition area of the lex.yy.c file. Similarly, anything
       in the Definitions in lex section included between delimiter lines con-
       taining  only  %{  and %} will also be copied unchanged to the external
       definition area of the lex.yy.c file.

       Any such input (beginning with a blank character or within  %{  and  %}
       delimiter lines) appearing at the beginning of the Rules section before
       any rules are specified will be written to lex.yy.c after the  declara-
       tions  of variables for the yylex function and before the first line of
       code in yylex. Thus, user variables local  to  yylex  can  be  declared
       here, as well as application code to execute upon entry to yylex.

       The  action  taken  by lex when encountering any input beginning with a
       blank character or within %{ and %} delimiter lines  appearing  in  the
       Rules  section  but  coming  after  one or more rules is undefined. The
       presence of such input may result in an  erroneous  definition  of  the
       yylex function.

   Definitions in lex
       Definitions  in  lex  appear before the first %% delimiter. Any line in
       this section not contained between %{ and %} lines  and  not  beginning
       with  a blank character is assumed to define a lex substitution string.
       The format of these lines is:

       name   substitute

       If a name does not meet the requirements for identifiers in the  ISO  C
       standard,  the  result is undefined. The string substitute will replace
       the string { name } when it is used in a rule. The name string is  rec-
       ognized  in  this context only when the braces are provided and when it
       does not appear within a bracket expression or within double-quotes.

       In the Definitions in lex section, any line beginning with a % (percent
       sign)  character  and  followed  by an alphanumeric word beginning with
       either s or S defines a set of start  conditions.  Any  line  beginning
       with  a % followed by a word beginning with either x or X defines a set
       of exclusive start conditions. When the generated scanner is  in  a  %s
       state,  patterns  with  no state specified will be also active; in a %x
       state, such patterns will not be active. The rest of  the  line,  after
       the  first  word, is considered to be one or more blank-character-sepa-
       rated names of start conditions. Start condition names are  constructed
       in  the  same  way as definition names. Start conditions can be used to
       restrict the matching of regular expressions to one or more  states  as
       described in Regular expressions in lex.

       Implementations  accept  either of the following two mutually exclusive
       declarations in the Definitions in lex section:

       %array          Declare the type of  yytext  to  be  a  null-terminated
                       character array.



       %pointer        Declare  the  type of yytext to be a pointer to a null-
                       terminated character string.



       Note: When using the %pointer option, you may not also use  the  yyless
       function to alter yytext.

       %array  is  the  default. If %array is specified (or neither %array nor
       %pointer is specified), then the correct way to make an external refer-
       ence to yyext is with a declaration of the form:

              extern char yytext[]


       If %pointer is specified, then the correct external reference is of the
       form:

              extern char *yytext;


       lex will accept declarations in the Definitions in lex section for set-
       ting  certain  internal  table sizes. The declarations are shown in the
       following table.

              Table Size Declaration in lex



       tab()  box;  cw(1.277778i)  cw(2.944444i)  cw(1.277778i)  cw(1.277778i)
       lw(2.944444i)  dw(1.277778i).   DeclarationDescriptionDefault %pnNumber
       of positions2500 %nnNumber of states500 %a nNumber  of  transitions2000
       %enNumber  of  parse  tree  nodes1000  %knNumber  of  packed  character
       classes10000 %onSize of the output array3000


       Programs generated by lex need either the -e or  -w  option  to  handle
       input that contains EUC characters from supplementary codesets. If nei-
       ther of these options is specified, yytext is of the type  char[],  and
       the generated program can handle only ASCII characters.

       When  the  -e option is used, yytext is of the type unsigned char[] and
       yyleng gives the total number of bytes in the matched string. With this
       option,  the  macros input(), unput(c), and output(c) should do a byte-
       based I/O in the same way as with the regular ASCII lex. Two more vari-
       ables  are  available  with  the  -e option, yywtext and yywleng, which
       behave the same as yytext and yyleng would under the -w option.

       When the -w option is used, yytext is of the type wchar_t[] and  yyleng
       gives  the  total  number  of characters in the matched string.  If you
       supply your own  input(),  unput(c),  or  output(c)  macros  with  this
       option,  they  must return or accept EUC characters in the form of wide
       character (wchar_t). This allows a  different  interface  between  your
       program and the lex internals, to expedite some programs.

   Rules in lex
       The Rules in lex source files are a table in which the left column con-
       tains regular expressions and the right column contains actions (C pro-
       gram fragments) to be executed when the expressions are recognized.

       ERE action
       ERE action
       ...


       The  extended  regular  expression (ERE) portion of a row will be sepa-
       rated from action by one or more blank characters. A regular expression
       containing  blank  characters  is recognized under one of the following
       conditions:

         o  The entire expression appears within double-quotes.

         o  The blank characters appear within double-quotes or square  brack-
            ets.

         o  Each blank character is preceded by a backslash character.


   User Subroutines in lex
       Anything  in  the  user  subroutines section will be copied to lex.yy.c
       following yylex.

   Regular Expressions     in lex
       The lex utility supports the set of Extended Regular Expressions (EREs)
       described  on  regex(5)  with the following additions and exceptions to
       the syntax:

       ...

           Any string enclosed in double-quotes will represent the  characters
           within  the  double-quotes  as  themselves,  except  that backslash
           escapes (which appear in the following table) are  recognized.  Any
           backslash-escape  sequence  is terminated by the closing quote. For
           example, "\01""1" represents a single string:  the  octal  value  1
           followed by the character 1.



       <state>r

       <&lt;state1, state2, ...>r

           The  regular  expression r will be matched only when the program is
           in one of the start conditions indicated by state, state1,  and  so
           forth. For more information, see Actions in lex. As an exception to
           the typographical conventions of the rest of this document, in this
           case  <state>  does  not  represent a metavariable, but the literal
           angle-bracket characters surrounding a symbol. The start  condition
           is  recognized  as  such only at the beginning of a regular expres-
           sion.



       r/x

           The regular expression r will be matched only if it is followed  by
           an occurrence of regular expression x. The token returned in yytext
           will only match r. If the trailing portion of r matches the  begin-
           ning  of  x,  the  result  is  unspecified. The r expression cannot
           include further trailing context or the $ (match-end-of-line) oper-
           ator;  x  cannot  include the ^ (match-beginning-of-line) operator,
           nor trailing context, nor the $ operator. That is, only one  occur-
           rence  of  trailing context is allowed in a lex regular expression,
           and the ^ operator only can be used at the  beginning  of  such  an
           expression.   A  further  restriction  is that the trailing-context
           operator / (slash) cannot be grouped within parentheses.



       {name}

           When name is one of the substitution symbols from  the  Definitions
           section,  the  string,  including  the  enclosing  braces,  will be
           replaced by the substitute value.  The  substitute  value  will  be
           treated  in  the extended regular expression as if it were enclosed
           in parentheses. No substitution will occur if {name} occurs  within
           a bracket expression or within double-quotes.



       Within  an  ERE, a backslash character (\\, \a, \b, \f, \n, \r, \t, \v)
       is considered to begin an escape  sequence.  In  addition,  the  escape
       sequences in the following table will be recognized.

       A  literal  newline  character  cannot  occur within an ERE; the escape
       sequence \n can be used to represent a  newline  character.  A  newline
       character cannot be matched by a period operator.

       Escape Sequences in lex


       tab()  box;  cw(1.222222i)|  sw(2.916667i)| sw(1.361111i) lw(1.222222i)
       lw(2.916667i)  lw(1.361111i).    Escape   Sequences   in   lex   Escape
       SequenceDescription Meaning \digitsT{ A backslash character followed by
       the longest sequence  of  one,  two  or  three  octal-digit  characters
       (01234567).  Ifall of the digits are 0, (that is, representation of the
       NUL character), the behavior is undefined.  T}T{  The  character  whose
       encoding is represented by the one-, two- or three-digit octal integer.
       Multi-byte characters require multiple, concatenated  escape  sequences
       of  this  type, including the leading \ for each byte.  T} \xdigitsT{ A
       backslash character followed by the longest  sequence  of  hexadecimal-
       digit  characters  (01234567abcdefABCDEF).  If all of the digits are 0,
       (that is, representation of the NUL character), the behavior  is  unde-
       fined.   T}T{  The character whose encoding is represented by the hexa-
       decimal integer.  T} \cT{ A backslash character followed by any charac-
       ter  not  described  in this table.  (\\, \a, \b, \f, \en, \r, \t, \v).
       T}The character c, unchanged.


       The order of precedence given to extended regular expressions  for  lex
       is as shown in the following table, from high to low.

       Note:    The  escaped characters entry is not meant to imply that these
                are operators, but they are included  in  the  table  to  show
                their  relationships  to  the true operators. The start condi-
                tion, trailing context and anchoring notations have been omit-
                ted  from  the  table  because  of  the placement restrictions
                described in this section; they can only appear at the  begin-
                ning or ending of an ERE.




       tab()  box;  cw(2.750000i)|  sw(2.750000i) lw(2.750000i) lw(2.750000i).
       ERE Precedence in lex collation-related bracket symbols[= =]  [: :]  [.
       .]   escaped charactersT{ \<&lt;special character> T} bracket expression[ ]
       quoting"..."  grouping() definition{name} single-character RE  duplica-
       tion* + ?  concatenation interval expression{m,n} alternation|


       The  ERE anchoring operators (^ and $) do not appear in the table. With
       lex regular expressions, these operators are restricted in  their  use:
       the  ^  operator can only be used at the beginning of an entire regular
       expression, and the $ operator only at the end. The operators apply  to
       the   entire   regular  expression.  Thus,  for  example,  the  pattern
       (^abc)|(def$) is undefined; it can instead be written as  two  separate
       rules,  one  with  the regular expression ^abc and one with def$, which
       share a common action via the special | action (see below). If the pat-
       tern  were  written ^abc|def$, it would match either of abc or def on a
       line by itself.

       Unlike the general ERE rules, embedded anchoring is not allowed by most
       historical  lex implementations. An example of embedded anchoring would
       be for patterns such as (^)foo($) to match foo when it exists as a com-
       plete  word. This functionality can be obtained using existing lex fea-
       tures:

       ^foo/[ \n]|
       " foo"/[ \n]    /* found foo as a separate word */


       Notice also that $ is a form of trailing context (it is  equivalent  to
       /\n  and  as  such  cannot  be used with regular expressions containing
       another instance of the  operator  (see  the  preceding  discussion  of
       trailing context).

       The  additional regular expressions trailing-context operator / (slash)
       can be used as an ordinary character if presented within double-quotes,
       "/";  preceded by a backslash, \/; or within a bracket expression, [/].
       The start-condition <&lt; and >&gt; operators are special only in a start  con-
       dition at the beginning of a regular expression; elsewhere in the regu-
       lar expression they are treated as ordinary characters.

       The following examples clarify  the  differences  between  lex  regular
       expressions  and  regular expressions appearing elsewhere in this docu-
       ment. For regular expressions of the form r/x, the string matching r is
       always  returned;  confusion  may arise when the beginning of x matches
       the trailing portion of r. For example, given  the  regular  expression
       a*b/cc  and  the  input aaabcc, yytext would contain the string aaab on
       this match. But given the regular expression x*/xy and the input  xxxy,
       the  token xxx, not xx, is returned by some implementations because xxx
       matches x*.

       In the rule ab*/bc, the b* at the end of r will extend r's  match  into
       the beginning of the trailing context, so the result is unspecified. If
       this rule were ab/bc, however, the rule matches the text ab when it  is
       followed  by the text bc. In this latter case, the matching of r cannot
       extend into the beginning of x, so the result is specified.

   Actions in lex
       The action to be taken when an ERE is matched can be a C program  frag-
       ment  or  the special actions described below; the program fragment can
       contain one or more C statements, and can also include special actions.
       The  empty  C statement ; is a valid action; any string in the lex.yy.c
       input that matches the pattern portion of such a  rule  is  effectively
       ignored or skipped. However, the absence of an action is not valid, and
       the action lex takes in such a condition is undefined.

       The specification for an action, including  C  statements  and  special
       actions, can extend across several lines if enclosed in braces:

       ERE <one or more blanks> { program statement
       program statement }

       The  default action when a string in the input to a lex.yy.c program is
       not matched by any expression is to copy  the  string  to  the  output.
       Because  the  default behavior of a program generated by lex is to read
       the input and copy it to the output, a minimal lex source program  that
       has  just  %% generates a C program that simply copies the input to the
       output unchanged.

       Four special actions are available:

       |       ECHO;      REJECT;      BEGIN


       |               The action | means that the action for the next rule is
                       the  action  for  this  rule.  Unlike  the  other three
                       actions, | cannot be enclosed in  braces  or  be  semi-
                       colon-terminated.  It  must be specified alone, with no
                       other actions.



       ECHO;           Writes the contents of the string yytext on the output.



       REJECT;         Usually only a single expression is matched by a  given
                       string in the input. REJECT means "continue to the next
                       expression that matches the current input," and  causes
                       whatever  rule  was the second choice after the current
                       rule to be executed for the same input. Thus,  multiple
                       rules  can be matched and executed for one input string
                       or overlapping input strings. For  example,  given  the
                       regular  expressions xyz and xy and the input xyz, usu-
                       ally only the regular expression xyz would  match.  The
                       next  attempted  match would start after z. If the last
                       action in the xyz rule is REJECT , both this  rule  and
                       the xy rule would be executed. The REJECT action may be
                       implemented in such a fashion that flow of control does
                       not  continue  after  it, as if it were equivalent to a
                       goto to another part of yylex. The use  of  REJECT  may
                       result in somewhat larger and slower scanners.



       BEGIN           The action:


                       BEGIN newstate;


                       switches  the  state  (start condition) to newstate. If
                       the string newstate has not been declared previously as
                       a  start  condition  in the Definitions in lex section,
                       the results are unspecified. The initial state is indi-
                       cated by the digit 0 or the token INITIAL.



       The  functions  or  macros  described below are accessible to user code
       included in the lex input. It is unspecified whether they appear in the
       C  code  output of lex, or are accessible only through the -l l operand
       to c89 or cc (the lex library).

       int yylex(void)         Performs lexical analysis on the input; this is
                               the primary function generated by the lex util-
                               ity. The function returns zero when the end  of
                               input is reached; otherwise it returns non-zero
                               values (tokens) determined by the actions  that
                               are selected.



       int yymore(void)        When called, indicates that when the next input
                               string is recognized, it is to be  appended  to
                               the current value of yytext rather than replac-
                               ing it; the value in yyleng is adjusted accord-
                               ingly.



       intyyless(int n)        Retains  n  initial  characters in yytext, NUL-
                               terminated, and treats the remaining characters
                               as  if  they  had  not  been read; the value in
                               yyleng is adjusted accordingly.



       int input(void)         Returns the next character from the  input,  or
                               zero  on end-of-file. It obtains input from the
                               stream pointer yyin, although possibly  via  an
                               intermediate  buffer.  Thus,  once scanning has
                               begun, the effect of altering the value of yyin
                               is  undefined.  The  character  read is removed
                               from the input stream of  the  scanner  without
                               any processing by the scanner.



       int unput(int c)        Returns  the  character  c to the input; yytext
                               and yyleng are undefined until the next expres-
                               sion  is matched. The result of using unput for
                               more characters than have been input is unspec-
                               ified.



       The  following  functions  appear  only  in  the lex library accessible
       through the -l l operand; they can therefore be redefined by a portable
       application:

       int yywrap(void)

           Called  by  yylex  at  end-of-file;  the default yywrap always will
           return 1. If the application requires yylex to continue  processing
           with  another  source  of input, then the application can include a
           function yywrap, which associates another file  with  the  external
           variable FILE *yyin and will return a value of zero.



       int main(int argc, char *argv[])

           Calls  yylex to perform lexical analysis, then exits. The user code
           can contain main to perform application-specific operations,  call-
           ing yylex as applicable.



       The  reason  for  breaking  these functions into two lists is that only
       those functions in libl.a can  be  reliably  redefined  by  a  portable
       application.

       Except  for input, unput and main, all external and static names gener-
       ated by lex begin with the prefix yy or YY.

USAGE
       Portable applications are warned that in the Rules in lex  section,  an
       ERE  without  an  action is not acceptable, but need not be detected as
       erroneous by lex. This may result in compilation or run-time errors.

       The purpose of input is to take characters off  the  input  stream  and
       discard  them as far as the lexical analysis is concerned. A common use
       is to discard the body of a comment once the beginning of a comment  is
       recognized.

       The lex utility is not fully internationalized in its treatment of reg-
       ular expressions in the lex source code or generated lexical  analyzer.
       It would seem desirable to have the lexical analyzer interpret the reg-
       ular expressions given in the lex source according to  the  environment
       specified when the lexical analyzer is executed, but this is not possi-
       ble with the current lex technology. Furthermore, the  very  nature  of
       the lexical analyzers produced by lex must be closely tied to the lexi-
       cal requirements of the input language being described, which will fre-
       quently  be  locale-specific  anyway. (For example, writing an analyzer
       that is used for French text will not automatically be useful for  pro-
       cessing other languages.)

EXAMPLES
       Example 1: Using lex

       The following is an example of a lex program that implements a rudimen-
       tary scanner for a Pascal-like syntax:

       %{
       /* need this for the call to atof() below */
       #include <math.h>
       /* need this for printf(), fopen() and stdin below */
       #include <stdio.h>
       %}

       DIGIT    [0-9]
       ID       [a-z][a-z0-9]*
       %%

       {DIGIT}+  {
                                  printf("An integer: %s (%d)\n", yytext,
                                  atoi(yytext));
                                  }

       {DIGIT}+"."{DIGIT}*        {
                                  printf("A float: %s (%g)\n", yytext,
                                  atof(yytext));
                                  }

       if|then|begin|end|procedure|function        {
                                  printf("A keyword: %s\n", yytext);
                                  }

       {ID}                       printf("An identifier: %s\n", yytext);

       "+"|"-"|"*"|"/"            printf("An operator: %s\n", yytext);

       "{"[^}\n]*"}"              /* eat up one-line comments */

       [ \t\n]+                   /* eat up white space */

       .                          printf("Unrecognized character: %s\n", yytext);

       %%

       int main(int argc, char *argv[])
       {
                                 ++argv, --argc;  /* skip over program name */
                                 if (argc > 0)
                                        yyin = fopen(argv[0], "r");
                                 else
                                 yyin = stdin;

                                 yylex();
       }

ENVIRONMENT VARIABLES
       See environ(5) for descriptions of the following environment  variables
       that  affect  the execution of lex: LANG, LC_ALL, LC_COLLATE, LC_CTYPE,
       LC_MESSAGES, and NLSPATH.

EXIT STATUS
       The following exit values are returned:

       0        Successful completion.



       >&gt;0       An error occurred.



ATTRIBUTES
       See attributes(5) for descriptions of the following attributes:


       tab()    allbox;    cw(2.750000i)|     cw(2.750000i)     lw(2.750000i)|
       lw(2.750000i).   ATTRIBUTE  TYPEATTRIBUTE  VALUE  AvailabilitySUNWbtool
       Interface StabilityStandard


SEE ALSO
       yacc(1), attributes(5), environ(5), regex(5), standards(5)

NOTES
       If routines such as yyback(), yywrap(), and yylock() in .l (ell)  files
       are  to be external C functions, the command line to compile a C++ pro-
       gram must define the __EXTERN_C__ macro. For example:

       example%  CC -D__EXTERN_C__ ... file



SunOS 5.10                        22 Aug 1997                           lex(1)