unixdev.net


Switch to SpeakEasy.net DSL

The Modular Manual Browser

Home Page
Manual: (OSF1-V5.1-alpha)
Page:
Section:
Apropos / Subsearch:
optional field



lex(1)								       lex(1)



NAME

  lex -	Generates programs for lexical tasks

SYNOPSIS

  lex [-ct] [-n	 | -v] [file...]

  [Tru64 UNIX]	The following syntax applies when the CMD_ENV environment
  variable is set to svr4:

  lex [-crt] [-n  | -v]	[-V] [-Qy  | -Qn] [file...]

STANDARDS

  Interfaces documented	on this	reference page conform to industry standards
  as follows:

  lex:	XPG4, XPG4-UNIX

  Refer	to the standards(5) reference page for more information	about indus-
  try standards	and associated tags.

OPTIONS

  -c  Writes C code to the file	lex.yy.c. This is the default.

  -n  Suppresses the statistics	summary. When you set your own table sizes
      for the finite state machine, lex	automatically produces this summary
      if you do	not select this	flag.

  -r  [Tru64 UNIX]  Writes RATFOR code to the file lex.yy.r. (There is no
      RATFOR compiler for Tru64	UNIX.)

  -t  Writes to	standard output	instead	of writing to a	file.

  -v  Provides a summary of the	generated finite state machine statistics.

  -V  [Tru64 UNIX]  Outputs lex	version	number to standard error. Requires
      the environment variable CMD_ENV to be set to svr4.

  -Q[y|n]
      [Tru64 UNIX]  Determines whether the lex version number is written to
      the output file. The -Qn option does not do so and is the	default.
      Requires the environment variable	CMD_ENV	to be set to svr4.









DESCRIPTION

  The lex command uses the rules and actions contained in file to generate a
  program, lex.yy.c, which can be compiled with	the cc command.	 That program
  can then receive input, break	the input into the logical pieces defined by
  the rules in file, and run program fragments contained in the	actions	in
  file.

  The generated	program	is a C Language	function called	yylex(). The lex com-
  mand stores yylex() in a file	named lex.yy.c.	 You can use yylex() alone to
  recognize simple, 1-word input, or you can use it with other C Language
  programs to perform more difficult input analysis functions.	For example,
  you can use lex to generate a	program	that tokenizes an input	stream before
  sending it to	a parser program generated by the yacc command.

  The yylex() function analyzes	the input stream using a program structure
  called a finite state	machine. This structure	allows the program to exist
  in only one state (or	condition) at a	time.  A finite	number of states are
  allowed. The rules in	file determine how the program moves from one state
  to another in	response to the	input that the program receives.

  The lex command reads	its skeleton finite state machine from the file
  /usr/ccs/lib/ncpform or /usr/ccs/lib/ncform. Use the environment variable
  LEXER	to specify another location for	lex to read from.

  If you do not	specify	a file,	lex reads standard input. It treats multiple
  files	as a single file.

  Input	File Format


  The input file can contain three sections:  definitions, rules, and user
  subroutines. Each section must be separated from the others by a line	con-
  taining only the delimiter, %%.  The format is as follows:

       definitions
       %%
       rules
       %%
       user_subroutines

  The purpose and format of each of these sections are described under the
  headings that	follow.

  Definitions Section


  If you want to use variables in rules, you must define them in the defini-
  tions	section. The variables make up the left	column,	and their definitions
  make up the right column.  For example, to define D as a numerical digit,
  enter:

       D       [0-9]

  You can use a	defined	variable in the	rules section by enclosing the vari-
  able name in braces, {D}.

  In the definitions section, you can set either of the	following two mutu-
  ally exclusive declarations:

  %array
      Declare the type of yytext to be a null-terminated character array.

  %pointer
      Declare the type of yytext to be a pointer to a null-terminated
      character	string.	Use of the %pointer definition selects the
      /usr/ccs/lib/ncpform skeleton.

  In the definitions section, you can also set table sizes for the resulting
  finite state machine.	The default sizes are large enough for small pro-
  grams.  You may want to set larger sizes for more complex programs:

  %p  number
      Number of	positions is number (default 5000)

  %n  number
      Number of	states is number (default 2500)

  %e  number
      Number of	parse tree nodes is number (default 2000)

  %a  number
      Number of	transitions is number (default 5000)

  %k  number
      Number of	packed character classes is number (default 2000)

  %o  number
      Number of	output slots is	number (default	5000)

  If extended characters appear	in regular expression strings, you may need
  to reset the output array size with the %o parameter (possibly to array
  sizes	in the range 10,000 to 20,000).	 This reset reflects the much larger
  number of extended characters	relative to the	number of ASCII	characters.

  Rules	Section


  The rules section is required, and it	must be	preceded by the	%% delimiter,
  even if you do not have a definitions	section. The lex command does not
  recognize rules without the delimiter.

  In this section, the left column contains the	pattern	to be recognized in
  an input file	to yylex().  The right column contains the C program fragment
  executed when	that pattern is	recognized.

  Patterns can include extended	characters with	one exception: extended	char-
  acters may not appear	in range specifications	within character class
  expressions surrounded by brackets.

  The columns are separated by a tab. For example, to search files for the
  word LEAD and	replace	it with	GOLD, perform the following steps:

   1.  Create a	file called transmute.l	containing the lines:


	    %%
	    (LEAD)  printf("GOLD");

   2.  Then issue the following	commands to the	shell:
	    lex	transmute.l
	    cc -o transmute lex.yy.c -ll

   3.  You can test the	resulting program with the command:
	    transmute <&lt;transmute.l



  This command echoes the contents of transmute.l, with	the occurrences	of
  LEAD changed to GOLD.

  Each pattern may have	a corresponding	action,	that is, a fragment of C
  source code to execute when the pattern is matched.  Each statement must
  end with a ; (semicolon).  If	you use	more than one statement	in an action,
  you must enclose all of them in {} (braces). A second	delimiter, %%, must
  follow the rules section if you have a user subroutine section.

  When yylex() matches a string	in the input stream, it	copies the matched
  text to an external character	array, yytext, before it executes any actions
  in the rules section.

  You can use the following operators to form patterns that you	want to
  match:

  x, y
      Matches the characters written.

  [ ] Matches any one character	in the enclosed	range ([.-.]) or the enclosed
      list ([...]). [abcx-z] matches a,b,c,x,y,	or z.

  " " Matches the enclosed character or	string even if it is an	operator.
      "$" prevents lex from interpreting the $ character as an operator.

  \   Acts the same as double quotes.  \$ prevents lex from interpreting the
      $	character as an	operator.

  *   Matches zero or more occurrences of the single-character regular
      expression immediately preceding it.  x* matches zero or more repeated
      literal characters x.

  +   Matches one or more occurrences of the single-character regular expres-
      sion immediately preceding it.

  ?   Matches either zero or one occurrence of the single-character regular
      expression immediately preceding it.

  ^   Matches the character only at the	beginning of a line.  ^x matches an x
      at the beginning of a line.

  [^] Matches any character except for the characters following	the ^.
      [^xyz] matches any character but x, y, or	z.

  .   Matches any character except the newline character.

  $   Matches the end of a line.

  |   Matches either of	two characters.	 x|y matches either x or y.

  /   Matches one extended regular expression (ERE) only when followed by a
      second ERE. It reads only	the first token	into yytext.  Given the	regu-
      lar expression a*b/cc and	the input aaabcc, yytext would contain the
      string aaab on this match.

  ( ) Matches the pattern in the ( ) (parentheses). This is used for group-
      ing. It reads the	whole pattern into yytext. A group in parentheses can
      be used in place of any single character in any other pattern.
      (xyz123) matches the pattern xyz123 and reads the	whole string into
      yytext.

  {}  Matches the character as defined in the definitions section.  If D is
      defined as numeric digits, {D} matches all numeric digits.

  {m,n}
      Matches m-to-n occurrences of the	specified character.  x{2,4} matches
      2, 3, or 4 occurrences of	x.


  If a line begins with	only a space, lex copies it to the lex.yy.c output
  file.	If the line is in the definitions section of file, lex copies it to
  the declarations section of lex.yy.c.	If the line is in the rules section,
  lex copies it	to the program code section of lex.yy.c.



  User Subroutines Section


  The lex library has three subroutines	defined	as macros that you can use in
  the rules.

  input( )
      Reads a character	from yyin.

  unput( )
      Replaces a character after it is read.

  output( )
      Writes a character to yyout.

  You can override these three macros by writing your own code for these rou-
  tines	in the user subroutines	section. But if	you write your own routines,
  you must undefine these macros in the	definitions section as follows:

       %{
       #undef input
       #undef unput
       #undef output
       }%

  When you are using lex as a simple transformer/recognizer for	stdin to
  stdout piping, you can avoid writing the framework by	using libl.a (the lex
  library). It has a main routine that calls yylex() for you.

  External names generated by lex all begin with the prefix yy,	as in yyin,
  yyout, yylex,	and yytext.

  Putting Spaces in an Expression


  Normally, spaces or tabs end a rule and, therefore, the expression that
  defines a rule.  However, you	can enclose the	spaces or tab characters in
  "" (double quotes) to	include	them in	the expression.	Use quotes around all
  spaces in expressions	that are not already within sets of [ ]	(brackets).

  Other	Special	Characters


  The lex program recognizes many of the normal	C language special charac-
  ters.	 These character sequences are as follows:

  Sequence   Meaning
  \n	     Newline
  \t	     Tab
  \b	     Backspace
  \\	     Backslash
  \digits

	     The character whose encoding is represented
	     by	the three-digit	octal number
  \xdigits

	     The character whose encoding is represented
	     by	the hexadecimal	integer

  Do not use the actual	newline	character in an	expression.


  When using these special characters in an expression,	you do not need	to
  enclose them in quotes.  Every character, except these special characters
  and the previously described operator	symbols, is always a text character.




  Matching Rules


  When more than one expression	can match the current input, lex chooses the
  longest match	first.	Among rules that match the same	number of characters,
  the rule that	occurs first is	chosen.	 For example:

       integer keyword action...;
       [a-z]+ identifier action...;

  If the preceding rules are given in that order and integers is the input
  word,	lex matches the	input as an identifier because [a-z]+ matches eight
  characters, while integer matches only seven.	 However, if the input is
  integer, both	rules match seven characters. The keyword rule is selected
  because it occurs first. A shorter input, such as int, does not match	the
  expression rule integer and causes lex to select the rule identifier.

  Matching a String with Wildcard Characters


  Because lex chooses the longest match	first, do not use rules	containing
  expressions like .* (for example: '.*').

  The preceding	rule might seem	like a good way	to recognize a string in sin-
  gle quotes.  However,	the lexical analyzer reads far ahead, looking for a
  distant single quote to complete the long match.  If a lexical analyzer
  with such a rule gets	the following input, it	matches	the whole string:

       'first' quoted string here, 'second' here

  To find the smaller strings, first and second, use the following rule:

       '[^'\n]*'

  This rule stops after	matching 'first'.

  Errors of this type are not far-reaching because the . (dot) operator	does
  not match a newline character.  Therefore, expressions like .* stop on the
  current line.	 Do not	try to defeat this with	expressions like [.\n] +. The
  lexical analyzer tries to read the entire input file,	and an internal
  buffer overflow occurs.

  Finding Strings within Strings


  The lex program partitions the input stream and does not search for all
  possible matches of each expression.	Each character is accounted for	once
  and only once.  For example, to count	occurrences of both she	and he in an
  input	text, try the following	rules:

       she   s++;
       he    h++;
       \n    |
       .     ;

  The last two rules ignore everything besides he and she. However, because
  she includes he, lex does not	recognize the instances	of he that are
  included in she.

  To override this choice, use the REJECT action.  This	directive tells	lex
  to go	to the next rule.  The lex command then	adjusts	the position of	the
  input	pointer	to where it was	before the first rule was executed, and	exe-
  cutes	the second choice rule.	For example, to	count the included instances
  of he, use the following rules:

       she    {s++; REJECT;}
       he     {h++; REJECT;}
       \n     |
       .      ;

  After	counting the occurrences of she, lex rejects the input stream and
  then counts the occurrences of he. In	this case, you can omit	the REJECT
  action on he because she includes he but not vice versa. In other cases, it
  may be difficult to determine	which input characters are in both classes.

  In general, REJECT is	useful whenever	the purpose of lex is not to parti-
  tion the input stream	but to detect all examples of some items in the
  input, and the instances of these items may overlap or include each other.

NOTES

  Because lex uses fixed names for intermediate	and output files, you can
  have only one	lex-generated program in a given directory. If the -t option
  is not specified, informational, error, and warning messages are written to
  stdout. If the -t option is specified, informational,	error, and warning
  messages are written to stderr.

  [Tru64 UNIX]	The yytext array has a default dimension of 200, controlled
  by the constant YYLMAX. If the programmer needs to allow a larger array,
  the YYLMAX constant may be redefined as follows from within the lex command
  file:

       {
       #undef YYLMAX
       #define YYLMAX 8192
       }

  Two other arrays use YYLMAX, yysubf, and yylstate.

  The lex program can be compiled as a C program with -std0, -std, or -std1
  mode.	It can also be compiled	as a C++ program. If YY_NOPROTO	is defined on
  the compilation command line,	function prototypes are	not generated.

EXAMPLES

   1.  The following command draws lex instructions from the file lexcommands
       and places the output in	lex.yy.c:
	    lex	lexcommands

   2.  The file	lexcommands contains an	example	of a lex program that would
       be put into a lex command file.	The following program converts upper-
       case to lowercase, removes spaces at the	end of a line, and replaces
       multiple	spaces with single spaces:


	    %%
	    [A-Z] putchar(tolower(yytext[0]));
	    [ ]+$ ;
	    [ ]+ putchar(' ');







ENVIRONMENT VARIABLES

  The following	environment variables affect the behavior of lex():

  LANG
      Provides a default value for the locale category variables that are not
      set or null.

  LC_ALL
      If set, overrides	the values of all other	locale variables.

  LC_COLLATE
      Determines the order in which output is sorted for the -x	option.

  LC_CTYPE
      Determines the locale for	the interpretation of byte sequences as	char-
      acters (single-byte or multi-byte) in input parameters and files.

  LC_MESSAGES
      Determines the locale used to affect the format and contents of diag-
      nostic messages displayed	by the command.

  NLSPATH
      Determines the location of message catalogs for the processing of
      LC_MESSAGES.

FILES

  /usr/ccs/lib/libl.a
      Run-time library.

  /usr/ccs/lib/ncform
      Default C	language skeleton finite state machine for lex.

  /usr/ccs/lib/ncpform
      Default C	language skeleton finite state machine for lex,	implemented
      with the pointer definition of yytext.

  /usr/ccs/lib/nrform
      Default RATFOR language skeleton finite state machine for	lex.

SEE ALSO

  Commands:  yacc(1)

  Standards:  standards(5)

  Programming Support Tools