Switch to SpeakEasy.net DSL

The Modular Manual Browser

Home Page
Manual: (OSF1-V5.1-alpha)
Apropos / Subsearch:
optional field

Unicode(5)							   Unicode(5)


  Unicode, unicode, universal.utf8, UCS-2, UCS-4, UTF-8, UTF-16, UTF-32,
  iso10646 - Support for the Unicode and ISO/IEC 10646 standards


  The operating	system provides	locales	and codeset converters that support
  the following	standards:

    +  The Unicode Standard, Version 3.0, Unicode, Inc., 1999

    +  Information Technology-Universal	Multiple-Octet Coded Character Set,
       ISO/IEC 10646:1993

       The Basic Multilingual Plane defined by this standard is	identical
       with the	main body of Unicode character encoding.

  These	standards define generalized character encoding	rules that can be
  applied to characters	in most	native language	scripts. The Unicode Standard
  specifies a universal	character set (UCS) that contains definitions in Ver-
  sion 3.0 for 49,194 characters and also includes a Private Use Area for
  vendor- or user-defined characters. The following list summarizes the	main
  features of this character set:

    +  All characters are treated as 16-bit units.

    +  Each 16-bit unit	has an abstract	character identity.

    +  Certain sequences of 16-bit characters in a text	stream are
       transformed into	other characters, called composed characters.

    +  Characters have properties, such	as base, numeric, spacing, combina-
       tion, and directionality. The Unicode standard provides rules for ord-
       ering characters	with different properties so that parsing of charac-
       ter sequences is	unambiguous.

    +  The relationship	between	Unicode	characters and the glyphs in the
       native language script that users see, type, or print is	not neces-
       sarily one-to-one. A glyph may be mapped	to a single abstract charac-
       ter or a	composed character. Conversely,	more than one glyph can	be
       mapped to a character.

    +  The ISO 8859-1 character	set occupies the first 256 code	positions
       (and the	ASCII character	set the	first 128 positions) of	the UCS.

  The ISO/IEC 10646 standard specifies both 16-	and 32-bit units for each
  abstract character defined in	the the	UCS.  The 16-bit character values in
  Unicode are zero-extended through a second 16-bit unit in the	larger encod-
  ing format. The second, or low-surrogate, 16-bit unit	is reserved for
  future use in	both standards.

  The Unicode and ISO/IEC 10646	standards specify a uniform character size
  and allow character units to be processed for	all languages by using the
  same set of rules. Therefore,	system support for the universal character
  set does not need to include multiple	algorithms (one	or more	per language)
  for converting between file code and internal	process	code. However, the
  two different	character sizes	(16-bit	or 32-bit) that	the standards support
  require different parsing schemes for	data input and output. Universal
  character encoding that an implementation parses in 16-bit units (2 octets)
  is known as UCS-2.  This is the canonical Unicode encoding in	wide use on
  PC systems. Universal	character encoding that	an implementation parses in
  32-bit units (4 octets) is known as UCS-4. This is the canonical ISO/IEC
  10646	encoding that is in use	on systems that	can support the	larger data
  unit size.

  The operating	system supports	UCS-2 with codeset converters and UCS-4	with
  both codeset converters and locales. The locales whose names include the
  string @ucs4 allow use of UCS-4 for internal process code with proprietary
  file encoding	formats.

  The standards	define a number	of transformation formats for the universal
  character set.  For the most part, the following UCS transformation formats
  (UTFs) exist to transform UCS	values into sequences of bytes for handling
  by various byte-oriented protocols:

    +  UTF-8, the standard method for transforming UCS-4 process encoding
       into a sequence of 8-bit	bytes and ensuring interchange transparency
       for characters in C0 code positions (0 to 31), the SPACE	(32) charac-
       ter, and	the DEL	(127) character

       The operating system supports UTF-8 with	both codeset converters	and

    +  UTF-7, an obsolete interchange format for environments that strip the
       eighth bit from each byte

       The operating system does not support UTF-7.

    +  UTF-1, an obsolete interchange format that is similar to	UTF-8 but
       also ensures interchange	transparency of	characters in C1 code posi-
       tions (128 to 159)

       The operating system does not support UTF-1.

    +  UTF-16, which handles the surrogate character extensions	defined	by
       Version 2.0 of the Unicode Standard and represents characters in
       2-byte units

       The surrogate character extensions are characters whose values in
       UCS-4 are outside the range normally allowed by a 16-bit	length res-
       triction.  When data includes these characters, the UTF-16 transforma-
       tion format enables data	exchange between applications using UCS-4 and
       applications that require the data to be	in UCS-2 (2-byte) format.
       Although	UTF-16 does not	support	representation of the entire UCS-4
       code space, it supports all characters (except those in certain
       private-use ranges) that	have been currently defined for	the languages
       covered by both standards.

       Byte orientation	in file	code can differ	and, depending on the plat-
       form on which the file was generated, can be little-endian (LE) or
       big-endian (BE).	 UTF-16	uses a byte order mark (BOM), which is not
       part of the file	text data, to indicate byte orientation. The code
       point of	the BOM	is U+FEFF. The Unicode Standard	also defines UTF-16LE
       and UTF-16BE, which are specific	to the little-endian and big-endian
       orientations, respectively, and do not include a	byte order mark.

       The operating system supports UTF-16, UTF-16LE, and UTF-16BE through
       codeset converters. In terms of codeset converter names,	UTF-16*	is
       recognized as an	alias for UCS-2	but also enables codeset conversion
       of surrogate character extensions.


	 By default, the operating system uses UTF-16 rather than UTF-16LE or
	 UTF-16BE. That	is, in an input	file, the software first looks for a
	 BOM. If a BOM is not found, the converter assumes UTF-16LE. This
	 means that you	must explicitly	specify	UTF-16BE to the	converter
	 (convert files	manually) when UTF-16BE	applies	to an input file. For
	 an output file, the converter automatically inserts a BOM. This
	 means that you	must explicitly	specify	UTF-16LE or UTF-16BE (convert
	 files manually) when you want conversion output to be UTF-16LE	or
	 UTF-16BE rather than UTF-16.

    +  UTF-32, which also supports the surrogate character extensions defined
       by the Unicode Standard but allows character representation in 4-byte
       encoding	units

       In addition, UTF-32 is restricted in values to the range	0 to 10FFFF,
       which precisely matches the range of character values defined in	the
       Unicode Standard. Unlike	UTF-16,	UTF-32 does not	support	private-use
       ranges for character values and therefore promotes interoperability
       among Unicode encoding formats.

       UTF-32 uses a byte order	mark to	indicate little-endian or big-endian
       byte orientation. The Unicode standard also defines UTF-32LE and	UTF-
       32BE , which are	specific to the	little-endian and big-endian orienta-
       tions, respectively, and	do not include a byte order mark.

       UTF-32 is almost	the same as UCS-4, so you can use UCS-4	codeset	con-
       verters to process UTF-32. However, the UCS-4 converter software	has
       not yet been changed to support UTF-32, UTF-32LE, or UTF-32BE as	alias
       names in	the way	that the UTF-16* strings are supported by the UCS-2

  Codeset Conversion

  Codeset converters are available to convert data in all the major encoding
  formats that the operating system supports to	and from UCS-2,	UCS-4, and
  UTF-8.  If the worldwide support subsets are installed on your system, you
  can enter the	following commands to find the names of	these converters:

       % cd /usr/lib/nls/loc/iconv
       % ls | grep UTF
       % ls | grep UCS

  Among	the converters listed, you will	find some that handle conversion of
  data in the code-page	format used on PC systems. See the code_page(5)
  reference page for more information about converting between codeset and
  code-page formats.  All codeset converters can be used with the iconv	com-
  mand and associated library functions.


       There was a change in mapping of	Korean Hangul characters between Ver-
       sion 1.1	and Version 2.0	of the Unicode Standard. By default, UCS-2,
       UCS-4, and UTF-8	conversion assumes Version 2.0 character mapping for
       Hangul characters.  Therefore, if data is in Version 1.1	format,	the
       data must first be converted to Version 2.0 format before converting
       from UCS-2, UCS-4, or UTF-8 to an entirely different format. The	for-
       mat of a	codeset	converter name is from-codeset_to-codeset.  In con-
       verter names, the Version 1.1 codeset formats for UCS-2,	UCS-4, and
       UTF-8 are represented by	UNICODE-1-1, UNICODE-1-1-UCS-4,	and UNICODE-
       1-1-UTF-8, respectively.	The Version 2.0	codeset	names are represented
       by UCS-2, UCS-4,	and UTF-8. For example,	if Korean data is currently
       in UCS-4	Version	1.1 format, the	data must first	be processed by	the
       UNICODE-1-1-UCS-4_UCS-4 converter before	being processed	by the UCS-
       4_deckorean converter.

  See the iconv_intro(5) reference page	for general information	on codeset


  The following	locales	use UCS-4 as internal processing code:

    +  universal.UTF-8

       This locale converts data in UTF-8 file format to UCS-4 process code.
       The locale can be used to test any UCS-4	character to determine if it
       is included in one of the following classes defined for the LC_CTYPE
       category: alnum,	alpha, blank, cntrl, digit, graph, lower, print,
       punct, space, upper, or xdigit.

       In the universal.utf8@ucs4 locale, the LC_MESSAGES, LC_MONETARY,
       LC_NUMERIC, and LC_TIME category	definitions match those	for the	POSIX
       (C) locale.

    +  native_locale_name@ucs4

       These locales (for example, fr_FR.ISO8859-1@ucs4) perform the same
       function	as the universal.UTF-8 locale but are different	in the fol-
       lowing ways:

	 -- The	file code is specified by the codeset portion (for example,
	    ISO8859-1) of native_locale_name.

	 -- Classification information is not provided for the full set	of
	    UCS-4 characters, but only for those in a particular native
	    language (for example, French).

	 -- Country-specific data is also available to the application.	 The
	    category definitions match those defined in	native_locale_name.

    +  language_territory.UTF-8

       These locales (for example, fr_FR.UTF-8)	are similar to the @ucs4
       locales in limiting classification information to the characters	in a
       particular native language and making country-specific data available
       to the application. However, the	.UTF-8 locales assume file data	fol-
       lows UTF-8 encoding rules and are the only locales that support the
       euro monetary character (C=).


	 The X locale database file used by applications running in the
	 universal.UTF-8, en_US.UTF-8, or Asian	locales	(Chinese, Japanese,
	 Korean) contains font definitions that	include	all the	various	fonts
	 used with the operating system. This enables applications under
	 en_US.UTF-8 to	display	all the	font characters	installed with World-
	 wide Language Support (WLS). Applications under the Asian locales
	 display all the font characters installed with	WLS, except for
	 ISO8859-2, -4,	-5, -7,	-8, -9,	and TACTIS.

  CDE desktop users can	select .UTF-8 locales by choosing names	followed by
  (Unicode) from the CDE language menu at session startup. In this case, the
  locale setting applies by default to all applications	run during the CDE

  Unicode Character Database

  For the convenience of programmers, the source file for the Unicode charac-
  ter database (Version	3.0.0) is available online. This source	file is	the
  one used to build the	.UTF-8 locales provided	in optional software subsets
  included with	the operating system product. If the .UTF-8 locales are
  installed on your system, both the Unicode character database	and an asso-
  ciated ReadMe	file are also installed	in the /usr/share/unidata directory.
  The ReadMe file discusses the	character properties supported by Unicode.

  Font Support

  The operating	system provides	the following types of bitmap fonts for	UCS

    +  Public domain Unicode fonts:


    +  Composite fonts that the	libfr_FGC font renderer	creates	by combining
       fonts available for other codesets

  These	fonts currently	cover only a subset of the characters in UCS.  Each
  of the ETL public domain fonts supports about	1000 characters, but does not
  include any characters for Chinese, Japanese,	or Korean. The composite
  fonts	created	by the font renderer are generated only	from fonts available
  for the ISO 8859-1 (Latin-1) and ISO 8859-15 (Latin-9) codesets.

  Refer	to iso8859-1(5)	and iso8859-15(5) for the names	of fonts available
  for Latin-1 and Latin-9 characters. Note that	the Latin-9 fonts, which
  include glyphs for the euro character, provide the best support for the
  language_territory.UTF-8 locales, which also support this character.

  For information on printer support and converting bitmap font	encoding to
  PostScript, see i18n_printing(5) and wwpsof(8).


  Commands: locale(1), wwpsof(8)

  Others: ascii(5), code_page(5), iso8859-1(5),	iso8859-15(5), i18n_intro(5),
  i18n_printing(5), iconv_intro(5), l10n_intro(5)