unixdev.net


Switch to SpeakEasy.net DSL

The Modular Manual Browser

Home Page
Manual: (OSF1-V5.1-alpha)
Page:
Section:
Apropos / Subsearch:
optional field



flex(1)								      flex(1)



NAME

  flex - Generates a C Language	lexical	analyzer

SYNOPSIS

  flex [-bcdfinpstvFILT8] -C[efmF] [-Sskeleton]	[file...]

OPTIONS

  -b  Generates	backtracking information to lex.backtrack. This	is a list of
      scanner states that require backtracking and the input characters	on
      which they do so.	 By adding rules you can remove	backtracking states.
      If all backtracking states are eliminated	and -f or -F is	used, the
      generated	scanner	will run faster.

  -d  Makes the	generated scanner run in debug mode.  Whenever a pattern is
      recognized and the global	yy_lex_debug is	nonzero	(which is the
      default),	the scanner writes to stderr a line of the form:


	   --accepting rule at line 53 ("the matched text")

      The line number refers to	the location of	the rule in the	file defining
      the scanner (the input to	lex).  Messages	are also generated when	the
      scanner backtracks, accepts the default rule, reaches the	end of its
      input buffer (or encounters a NULL), or reaches an End-of-File.

  -f  Specifies	full table (no table compression is done). The result is
      large but	fast. This option is equivalent	to -Cf.

  -i  Instructs	flex to	generate a case-insensitive scanner.  The case of
      letters given in the flex	input patterns will be ignored,	and tokens in
      the input	will be	matched	regardless of case.  The matched text given
      in yytext	will have the original case (as	read by	the scanner).

  -p  Generates	a performance report to	stderr.	 This identifies features of
      the flex input file that will cause a loss of performance	in the
      resulting	scanner.

  -s  Causes the default rule (that unmatched scanner input is echoed to
      stdout) to be suppressed.	 If the	scanner	encounters input that does
      not match	any of its rules, it aborts with an error.

  -t  Instructs	flex to	write the scanner it generates to standard output
      instead of lex.yy.c.

  -v  Specifies	that flex should write to stderr a summary of statistics
      regarding	the scanner it generates.

  -F  Specifies	that the fast scanner table representation should be used.
      This representation is about as fast as the full table representation
      (-f), and	for some sets of patterns will be considerably smaller (and
      for others, larger).  This option	is equivalent to -CF.

  -I  Instructs	flex to	generate an interactive	scanner; that is, a scanner
      that stops immediately rather than looking ahead if it knows that	the
      currently	scanned	text cannot be part of a longer	rule's match. Note,
      -I cannot	be used	in conjunction with full or fast tables; that is, the
      -f, -F, -Cf, or -CF options.

  -L  Instructs	flex not to generate #line directives in lex.yy.c. The
      default is to generate such directives so	error messages in the actions
      will be correctly	located	with respect to	the original lex input file.

  -T  Makes flex run in	trace mode.  It	will generate a	lot of messages	to
      stdout concerning	the form of the	input and the resultant	nondeter-
      ministic and deterministic finite	automata.  This	option is mostly for
      use in maintaining flex.

  -8  Instructs	flex to	generate an 8-bit scanner (which is the	default).

  -C[efmF]
      Controls the degree of table compression.	The default setting is -Cem
      which provides the highest degree	of table compression.  Faster-
      executing	scanners can be	traded off at the cost of larger tables	with
      the following generally being true:

      Slowest and smallest


	   -Cem
	   -Cm
	   -Ce
	   -C
	   -C{f,F}e
	   -C{f,F}

      Fastest and largest

      The -C options are not cumulative; whenever the option is	encountered,
      the previous -C settings are forgotten.  The -f or -F and	-Cm options
      do not make sense	together; there	is no opportunity for meta-
      equivalence classes if the table is not being compressed.	 Otherwise,
      the options may be freely	mixed.

      -C  A lone -C specifies that the scanner tables should be	compressed
	  and neither equivalence classes nor meta-equivalence classes should
	  be used.

      -Ce Directs flex to construct equivalence	classes; for example, sets of
	  characters that have identical lexical properties. Equivalence
	  classes usually give dramatic	reductions in the final	table/object
	  file sizes (typically	a factor of 2 to 5) and	are inexpensive
	  performance-wise (one	array look-up per character scanned).

      -Cm Directs flex to construct meta-equivalence classes, which are	sets
	  of equivalence classes (or characters, if equivalence	classes	are
	  not being used) that are commonly used together.  Meta-equivalence
	  classes are often a big win when using compressed tables, but	they
	  have a moderate performance impact (one or two "if" tests and	one
	  array	look-up	per character scanned).

      -Cf Specifies that the full scanner tables should	be generated; flex
	  should not compress the tables by taking advantage of	similar	tran-
	  sition functions for different states.

      -CF Specifies that the alternative fast scanner representation should
	  be used.

  -Sskeleton_file
      Overrides	the default skeleton file from which flex constructs its
      scanners.	 This is useful	for flex maintenance or	development.

  -c  Specifies	table-compression options.  (Obsolescent)

  -n  Suppresses the statistics	summaries that the -v option typically gen-
      erates.  (Obsolete)

DESCRIPTION

  The flex command is a	tool for generating scanners: programs which recog-
  nize lexical patterns	in text. The flex command reads	the given input
  files, or its	standard input if no filenames are given or if a file operand
  is - (dash) for a description	of a scanner to	generate. The description is
  in the form of pairs of regular expressions and C code, called rules.	 The
  flex command generates as output a C source file, lex.yy.c, which defines a
  routine yylex(). This	file is	compiled and linked with the -ll library to
  produce an executable. When the executable is	run, it	scans its input	and
  the regular expressions in its rules looking for the best match (longest
  input). When it has selected a rule it executes the associated C code	which
  has access to	the matched input sequence (commonly referred to as a token).
  This process then repeats until input	is exhausted.

  The flex command treats multiple input files as one.

  Syntax for Input


  This section contains	a description of the flex input	file, which is nor-
  mally	named with a .l	suffix.	 The section provides a	listing	of the spe-
  cial values, macros, and functions recognized	by flex.

  The flex input file consists of three	sections, separated by a line with
  just %% in it:

       [ definitions ]
       %%
       [ rules ]
       [ %%
       [ user functions	]]

  definitions
      Contains declarations to simplify	the scanner specification, and
      declarations of start states which are explained below.

  rules
      Describes	what the scanner is to do.

  user functions
      Contains user-supplied functions that copied straight through to
      lex.yy.c.

      With the exception of the	first %% sequence all sections are optional.
      The minimal scanner %%, copies its input to standard output.

  Each line in the definitions section can be:

  name regexp
      Defines name to expand to	regexp.	 name is a word	beginning with a
      letter or	an underscore (_) followed by zero or more letters, digits,
      underscores or dashes (-). In the	regular-expression parts of the	rules
      section, flex substitutes	regexp wherever	you refer to {name} (name
      within braces).

  %x state [ state ... ]

  %s state [ state ... ]
      Defines names for	states used in the rules section. A rule may be	made
      conditionally active based on the	current	scanner	state. Multiple	lines
      defining states can appear, and each can contain multiple	state names,
      separated	by white space.	The name of a state follows the	same syntax
      as that of regexp	names except that dashes ('-') are not permitted.
      Unlike regexp names, state names share the C #define namespace. In the
      rules section states are recognized as <state> (state within angle
      brackets).

      The %x directive names exclusive states.	When a scanner is in an
      exclusive	state, only rules prefixed with	that state are active.
      Inclusive	states are named with the %s directive.

  %{

  %}  When placed on lines by themselves, these	symbols	enclose	C code to be
      passed verbatim into the global definitions of the output	file.  Such
      lines commonly include preprocessor directives and declarations of
      external variables and functions.

  space

  tab Lines beginning with a space or tab in the definitions section are
      passed directly into the lex.yy.c	output file, as	part of	the initial
      global definitions.

  The rules section follows the	definitions, separated by a line consisting
  of %%.  The rules section contains rules for matching	input and taking
  actions, in the following format:

  pattern [action]

  The pattern starts in	the first column of the	line and extends until the
  first	non-escaped white space	character. The flex command attempts to	find
  the pattern that matches the longest input sequence and execute the associ-
  ated action. If two or more patterns match the same input the	one which
  appears first	in the rules section is	chosen.	If no action exists the
  matched input	is discarded. If no pattern matches the	input the default is
  to copy it to	standard output.

  All action code is placed in the yylex() function. Text (C code or declara-
  tions) placed	at the beginning of the	rules section is copied	to the begin-
  ning of the yylex() function and may be used in actions. This	text must
  begin	with a space or	a tab (to distinguish it from rules).  In addition,
  any input (beginning with a space or within %{ and %}	delimiter lines)
  appearing at the beginning of	the rules section before any rules are speci-
  fied will be written to lex.yy.c after the declarations of variables for
  the yylex() function and before the first line of code in yylex().

  Elements of each rule	are:

  state
      A	pattern	may begin with a comma separated list of state names enclosed
      by angle brackets	(<&lt; state [,state...] >&gt;).  These	states are entered
      via the BEGIN statement. If a pattern begins with	a state, the scanner
      can only recognize it when in that state.	 The initial state is 0
      (zero).

  regexp
      A	regular	expression to match against the	input stream. The regular
      expressions in flex provide a rich character matching syntax.

      The following characters,	shown in order of decreasing precedence	have
      special meanings:

      x	  Matches the character	x.

      (double quotes)
	  Enclose characters and treat them as literal strings.	 For example,
	  "*+" is treated as the asterisk character followed by	the plus
	  character.

      \str (backslash)
	  If str is one	of the characters a, b,	f, n, r, t, or v, then the
	  ANSI C interpretation	is adopted (for	example, \n is a newline).
	  If str is a string of	octal digits it	is interpreted as a character
	  with octal value str.	If str is a string of hexadecimal digits with
	  a leading x it is interpreted	as a character with that value.	Oth-
	  erwise, it is	interpreted literally with no special meaning. For
	  example, x\*yz represents the	four characters	x*yz.

      [	] (brackets)
	  Represents a character class in the enclosed range ([.-.]) or	the
	  enclosed list	([...]). The dash character is used to define a	range
	  of characters	from the ASCII value or	the 8-bit class	of the char-
	  acter	that comes before it to	the ASCII value	or the 8-bit class of
	  the character	that follows it. For example, [abcx-z] matches a, b,
	  c, x,	y, or z.

	  The circumflex when it appears as the	first character	in a charac-
	  ter class, indicates the complement of the set of characters within
	  that class.  For example, [^abc] matches any character except	a, b
	  or c,	including special characters like newline.

      (	) (parentheses)
	  Groups regular expressions. For example, (ab)	will be	considered as
	  a single regular expression.

      {	} (braces)
	  When enclosing numbers, indicates a number of	consecutive
	  occurrences of the expression	that comes before it.  For example,
	  (ab){1,5} indicates a	match for from 1 to 5 occurrences of the
	  string ab.

	  When enclosing a name, the name represents a regular expression
	  defined in the definitions section. For example, {digit} is
	  replaced by the defined regular expression for digit.	Note that the
	  expansion takes place	as if the definition were enclosed in
	  parentheses.

      .	(period)
	  Matches any single character except newline.

      ?	(question mark)
	  Matches zero or one of the preceding expressions. For	example, ab?c
	  matches both ac and abc.

      *	(asterisk)
	  Matches zero or more of the preceding	expressions. For example, a*
	  is zero or more consecutive a	characters.  The utility of matching
	  zero occurrences is more obvious in complicated expressions.	For
	  example, the expression, [A-Za-z][A-Za-z0-9]*	indicates all
	  alphanumeric strings with a leading alphabetic character, including
	  strings that are only	one alphabetic character.

      +	(plus sign)
	  Matches one or more of the preceding expressions. For	example, [a-
	  z]+ is all strings of	lowercase letters.

      xy (concatenation)
	  Matches the expression x followed by the expression y.

      (br (vertical bar)
	  Matches either the preceding expression or the following expres-
	  sion.	 For example, a(br matches either ab or	cd.

      x/y (slash)
	  Matches expression x only if expression y (trailing context)
	  immediately follows it. For example, ab/cd matches the string	ab
	  but only if followed by cd. Only one trailing	context	is permitted
	  per pattern.

      ^	(circumflex)
	  When it appears at the beginning of the pattern matches the begin-
	  ning of a line. For example, ^abc will match the string abc if it
	  is found at the beginning of a line.

      $	(dollar	sign)
	  When it appears at the end of	a pattern matches the end of a line.
	  It is	equivalent to /\n. For example,	abc$ will match	the string
	  abc if it is found at	the end	of a line.

      <&lt;<&lt;EOF>&gt;>&gt;
	  Matches an End-of-File.

      <&lt;x>&gt; (angle bracket)
	  Identifies a state name (see above) and may only appear at the
	  beginning of a pattern. For example, <&lt;done>&gt;<&lt;<&lt;EOF>&gt;>&gt; matches an	End-
	  of-File, but only if it is in	state done.

      In addition, the following rules apply for bracket expressions:

      Equivalence class	expressions
	  These	represent the set of collating elements	in an equivalence
	  class	and are	enclosed within	bracket-equal delimiters ([= =]). An
	  equivalence class generally is designed to deal with primary-
	  secondary sorting; that is, for languages like French	that define
	  groups of characters as sorting to the same primary location,	and
	  then have a tie-breaking, secondary sort. For	example, if a, `, and
	  ^ belong to the same equivalence class, then [[=a=]b], [[=`=]b],
	  and [[=^=]b] are each	equivalent to [a`^b].

      Character	class expressions
	  These	represent the set of characters	in the current locale belong-
	  ing to the named ctype class.	These are expressed as a ctype class
	  name enclosed	in bracket-colon delimiters ([:	:]).

	  In the C or POSIX locale,  this operating system supports the	fol-
	  lowing character class expressions: [:alpha:], [:upper:],
	  [:lower:], [:digit:],	[:alnum:], [:xdigit:], [:space:], [:print:],
	  [:punct:], [:graph:],	[:cntrl:].

      Other locales may	define additional character classes.

      Letters and digits never have special meanings.  A character such	as ^
      or -, which has a	special	meaning	in particular contexts,	refers simply
      to itself	when found outside that	context.  Spaces and tabs must be
      escaped to appear	in a regular expression; otherwise they	indicate the
      end of the expression.

  action
      Each pattern in a	rule has a corresponding action, which can be any
      arbitrary	C statement. The pattern ends at the first non-escaped white
      space character; the remainder of	the line is its	action.	If the action
      is empty,	then when the pattern is matched the input which matched it
      is discarded.

      If the action contains a {, then the action spans	till the balancing }
      is found,	and the	action may cross multiple lines. Using a return
      statement	in an action returns from yylex().

      An action	consisting solely of a vertical	bar (|)	means same as the
      action for the next rule.

      The flex variables which can be used within actions are:

      yytext
	  A string (char *) containing the current matched input. It cannot
	  be modified.

      yyleng
	  The length (int) of the current matched input. It cannot be modi-
	  fied.

      yyin
	  A stream (FILE *) that flex reads from (stdin	by default). It	may
	  be changed but because of the	buffering flex uses this makes sense
	  only before scanning begins. Once scanning terminates	because	an
	  End-of-File was seen,	void yyrestart (FILE *new_file)	may be called
	  to point yyin	at a new input file. Alternatively, yyin may be
	  changed whenever a new or different buffer is	selected (see
	  yy_switch_to_buffer()).

      yyout
	  A stream (FILE *) to which ECHO output is written (stdout by
	  default). It can be changed by the user.

      YY_CURRENT_BUFFER
	  Returns the current buffer (YY_BUFFER_STATE) used for	scanner
	  input.

      The flex command macros and functions that may be	used within actions
      are:

      ECHO
	  Copies yytext	to the scanner's output.

      BEGIN state
	  Changes the scanner state to be state.  This affects which rules
	  are active. The state	must be	defined	in a %s, or %x definition.
	  The initial state of the scanner is INITIAL or 0 (zero).

      REJECT
	  Directs the scanner to proceed immediately to	the next best pattern
	  that matches the input (which	may be a prefix	of the current
	  match).  yytext and yyleng are reset appropriately.  Note that
	  REJECT is a particularly expensive feature in	terms of scanner per-
	  formance; if it is used in any of the	scanner's actions, it will
	  slow down all	of the scanner's pattern matching operations.  REJECT
	  cannot be used if flex is invoked with either	-f or -F options.

      yymore()
	  Indicates that the next matched text should be appended to the
	  currently matched text in yytext (rather than	replace	it).

      yyless(n)
	  Returns all but the first n characters of the	current	token back to
	  the input stream, where they will be rescanned when the scanner
	  looks	for the	next match.  yytext and	yyleng are adjusted accord-
	  ingly.

      yywrap()
	  Returns 0 (zero) if there is more input to scan or 1 if there	is
	  not. The default yywrap() always returns 1. Currently	it is imple-
	  mented as a macro, however in	future implementations it may become
	  a function.

      yyterminate()
	  Can be used in lieu of a return statement in an action.  It ter-
	  minates the scanner and returns a 0 (zero) to	the scanner's caller.

	  yyterminate()	is automatically called	when an	End-of-File is
	  encountered. It is a macro and may be	redefined.

      yy_create_buffer(file, size)
	  Returns a YY_BUFFER_STATE handle to a	new input buffer large enough
	  to accommodate size characters and associated	with the given file.
	  When in doubt, use YY_BUF_SIZE for the size.

      yy_switch_to_buffer(new_buffer)
	  Switches the scanner's processing to scan for	tokens from the	given
	  buffer, which	must be	a YY_BUFFER_STATE.

      yy_delete_buffer(buffer)
	  Deletes the given buffer.

      YY_NEW_FILE
	  Enables scanning to continue after yyin has been pointed at a	new
	  file to process.

      YY_DECL
	  Controls how the scanning function, yylex() is declared. By
	  default, it is int yylex(), or, if prototypes	are being used,	int
	  yylex(void).	This definition	may be changed by redefining the
	  YY_DECL macro.  This macro is	expanded immediately before the	{...}
	  (braces) that	delimit	the scanner function body.

      YY_INPUT(buf,result,max_size)
	  Controls scanner input. By default, YY_INPUT reads from the file-
	  pointer yyin.	 Its action is to place	up to max_size characters in
	  the character	array buf and return in	the integer variable result
	  either the number of characters read or the constant YY_NULL to
	  indicate EOF.	Following is a sample redefinition of YY_INPUT,	in
	  the definitions section of the input file:


	       %{
	       #undef YY_INPUT
	       #define YY_INPUT(buf,result,max_size)\
		  {\
		      int c = getchar();\
		      result = (c == EOF) ? YY_NULL : (buf[0] =	c, 1);\
		  }
	       %}

	  When the scanner receives an End-of-File indication from YY_INPUT,
	  it checks the	yywrap() function. If yywrap() returns zero, it	is
	  assumed that the yyin	has been set up	to point to another input
	  file,	and scanning continues.	If it returns non-zero,	then the
	  scanner terminates, returning	zero to	its caller.

      YY_USER_ACTION
	  Redefinable to provide an action which is always executed prior to
	  the matched pattern's	action.

      YY _USER_INIT
	  Redefinable to provide an action which is always executed before
	  the first scan.

      YY_BREAK
	  Is used in the scanner to separate different actions.	By default,
	  it is	simply a break,	but may	be redefined if	necessary.

  The user functions section consists of complete C functions, which are
  passed directly into the lex.y.cc output file	(the effect is similar to
  defining the functions in separate .c	files and linking them with
  lex.y.cc).  This section is separated	from the rules section by the %% del-
  imiter.

  Comments, in C syntax, can appear anywhere in	the user functions or defini-
  tions	sections.  In the rules	section, comments can be embedded within
  actions. Empty lines or lines	consisting of white space are ignored.

  The following	macros are not normally	called explicitly within an action,
  but are used internally by flex to handle the	input and output streams.

  input()
      Reads the	next character from the	input stream. You cannot redefine
      input().

  output()
      Writes the next character	to the output stream.

  unput(c)
      Puts the character c back	onto the input stream. It will be the next
      character	scanned. You cannot redefine unput().

      The libl.a contains default functions to support testing or quick	use
      of a flex	program	without	yacc; these functions can be linked in
      through -ll.  They can also be provided by the user.

  main()
      A	simple wrapper that simply calls setlocale() and then calls the
      yylex() function.

  yywrap()
      The function called when the scanner reaches the end of an input
      stream.  The default definition simply returns 1,	which causes the
      scanner in turn to return	0 (zero).

NOTES

    +  Some trailing context patterns cannot be	properly matched and generate
       warning messages


	    Dangerous trailing context

       These are patterns where	the ending of the first	part of	the rule
       matches the beginning of	the second part, such as zx*/xy*, where	the
       x* matches the x	at the beginning of the	trailing context.

    +  For some	trailing context rules,	parts that are actually	fixed length
       are not recognized as such, leading to the previously mentioned per-
       formance	loss. In particular, patterns using {n}	(such as test{3}) are
       always considered variable length.

       Combining trailing context with the special | (vertical bar) action
       can result in fixed trailing context being turned into the more expen-
       sive variable trailing context.	This happens in	the following exam-
       ple:


	    %%
	    abc|
	    xyz/def

    +  Use of unput() invalidates the contents of yytext and yyleng within
       the current flex	action.

    +  Use of unput() to push back more	text than was matched can result in
       the pushed-back text matching a beginning-of-line (^) rule even though
       it did not come at the beginning	of the line.

    +  Pattern matching	of NULLs is substantially slower than matching other
       characters.

    +  The flex	command	does not generate correct #line	directives for code
       internal	to the scanner;	thus, bugs in flex.skel	yield invalid line
       numbers.

    +  Due to both buffering of	input and read-ahead, you cannot intermix
       calls to	<&lt;stdio.h>&gt; routines, such as, for example, getchar(), with
       flex rules and expect it	to work.  Call input() instead.

    +  The total table entries listed by the -v	option excludes	the number of
       table entries needed to determine what rule was matched.	 The number
       of entries is equal to the number of deterministic finite-state auto-
       maton (DFA) states if the scanner does not use REJECT, and somewhat
       greater than the	number of states if it does.

    +  REJECT cannot be	used with the -f or -F options.

EXAMPLES

   1.  The following command processes the file	lexcommands to produce the
       scanner file lex.yy.c:
	    flex lexcommands

       This is then compiled and linked	by the command:
	    cc -oscanner lex.yy.c -ll

       This produces a program scanner.

   2.  The scanner program converts uppercase to lowercase letters, removes
       spaces at the end of a line, and	replaces multiple spaces with single
       spaces. The lexcommands command contains:


	    %%
	    [A-Z]   putchar(tolower(yytext[0]));
	    [ ]+$
	    [ ]+  putchar(' ');



FILES

  flex.skel
      Skeleton scanner.

  lex.yy.c
      Generated	scanner	C source.

  lex.backtrack
      Backtracking information generated from -b option.




SEE ALSO

  Commands:  yacc(1), sed(1), awk(1)


  Files:  locale(4)