unixdev.net


Switch to SpeakEasy.net DSL

The Modular Manual Browser

Home Page
Manual: (OpenBSD-3.6)
Page:
Section:
Apropos / Subsearch:
optional field



Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


NAME
       Locale::Maketext::TPJ13 -- article about software local-
       ization

SYNOPSIS
         # This an article, not a module.

DESCRIPTION
       The following article by Sean M. Burke and Jordan Lachler
       first appeared in The Perl Journal #13 and is copyright
       1999 The Perl Journal. It appears courtesy of Jon Orwant
       and The Perl Journal.  This document may be distributed
       under the same terms as Perl itself.

Localization and Perl: gettext breaks, Maketext fixes
       by Sean M. Burke and Jordan Lachler

       This article points out cases where gettext (a common sys-
       tem for localizing software interfaces -- i.e., making
       them work in the user's language of choice) fails because
       of basic differences between human languages.  This arti-
       cle then describes Maketext, a new system capable of cor-
       rectly treating these differences.

       A Localization Horror Story: It Could Happen To You

           "There are a number of languages spoken by human
           beings in this world."

           -- Harald Tveit Alvestrand, in RFC 1766, "Tags for the
           Identification of Languages"

       Imagine that your task for the day is to localize a piece
       of software -- and luckily for you, the only output the
       program emits is two messages, like this:

         I scanned 12 directories.

         Your query matched 10 files in 4 directories.

       So how hard could that be?  You look at the code that pro-
       duces the first item, and it reads:

         printf("I scanned %g directories.",
                $directory_count);

       You think about that, and realize that it doesn't even
       work right for English, as it can produce this output:

         I scanned 1 directories.

       So you rewrite it to read:





perl v5.8.5                 2002-11-06                          1





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


         printf("I scanned %g %s.",
                $directory_count,
                $directory_count == 1 ?
                  "directory" : "directories",
         );

       ...which does the Right Thing.  (In case you don't recall,
       "%g" is for locale-specific number interpolation, and "%s"
       is for string interpolation.)

       But you still have to localize it for all the languages
       you're producing this software for, so you pull
       Locale::gettext off of CPAN so you can access the "get-
       text" C functions you've heard are standard for localiza-
       tion tasks.

       And you write:

         printf(gettext("I scanned %g %s."),
                $dir_scan_count,
                $dir_scan_count == 1 ?
                  gettext("directory") : gettext("directories"),
         );

       But you then read in the gettext manual (Drepper, Miller,
       and Pinard 1995) that this is not a good idea, since how a
       single word like "directory" or "directories" is trans-
       lated may depend on context -- and this is true, since in
       a case language like German or Russian, you'd may need
       these words with a different case ending in the first
       instance (where the word is the object of a verb) than in
       the second instance, which you haven't even gotten to yet
       (where the word is the object of a preposition, "in %g
       directories") -- assuming these keep the same syntax when
       translated into those languages.

       So, on the advice of the gettext manual, you rewrite:

         printf( $dir_scan_count == 1 ?
                  gettext("I scanned %g directory.") :
                  gettext("I scanned %g directories."),
                $dir_scan_count );

       So, you email your various translators (the boss decides
       that the languages du jour are Chinese, Arabic, Russian,
       and Italian, so you have one translator for each), asking
       for translations for "I scanned %g directory." and "I
       scanned %g directories.".  When they reply, you'll put
       that in the lexicons for gettext to use when it localizes
       your software, so that when the user is running under the
       "zh" (Chinese) locale, gettext("I scanned %g directory.")
       will return the appropriate Chinese text, with a "%g" in
       there where printf can then interpolate $dir_scan.




perl v5.8.5                 2002-11-06                          2





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       Your Chinese translator emails right back -- he says both
       of these phrases translate to the same thing in Chinese,
       because, in linguistic jargon, Chinese "doesn't have num-
       ber as a grammatical category" -- whereas English does.
       That is, English has grammatical rules that refer to "num-
       ber", i.e., whether something is grammatically singular or
       plural; and one of these rules is the one that forces
       nouns to take a plural suffix (generally "s") when in a
       plural context, as they are when they follow a number
       other than "one" (including, oddly enough, "zero").  Chi-
       nese has no such rules, and so has just the one phrase
       where English has two.  But, no problem, you can have this
       one Chinese phrase appear as the translation for the two
       English phrases in the "zh" gettext lexicon for your pro-
       gram.

       Emboldened by this, you dive into the second phrase that
       your software needs to output: "Your query matched 10
       files in 4 directories.".  You notice that if you want to
       treat phrases as indivisible, as the gettext manual wisely
       advises, you need four cases now, instead of two, to cover
       the permutations of singular and plural on the two items,
       $dir_count and $file_count.  So you try this:

         printf( $file_count == 1 ?
           ( $directory_count == 1 ?
            gettext("Your query matched %g file in %g directory.") :
            gettext("Your query matched %g file in %g directories.") ) :
           ( $directory_count == 1 ?
            gettext("Your query matched %g files in %g directory.") :
            gettext("Your query matched %g files in %g directories.") ),
          $file_count, $directory_count,
         );

       (The case of "1 file in 2 [or more] directories" could, I
       suppose, occur in the case of symlinking or something of
       the sort.)

       It occurs to you that this is not the prettiest code
       you've ever written, but this seems the way to go.  You
       mail off to the translators asking for translations for
       these four cases.  The Chinese guy replies with the one
       phrase that these all translate to in Chinese, and that
       phrase has two "%g"s in it, as it should -- but there's a
       problem.  He translates it word-for-word back: "In %g
       directories contains %g files match your query."  The %g
       slots are in an order reverse to what they are in English.
       You wonder how you'll get gettext to handle that.

       But you put it aside for the moment, and optimistically
       hope that the other translators won't have this problem,
       and that their languages will be better behaved -- i.e.,
       that they will be just like English.




perl v5.8.5                 2002-11-06                          3





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       But the Arabic translator is the next to write back.
       First off, your code for "I scanned %g directory." or "I
       scanned %g directories."  assumes there's only singular or
       plural.  But, to use linguistic jargon again, Arabic has
       grammatical number, like English (but unlike Chinese), but
       it's a three-term category: singular, dual, and plural.
       In other words, the way you say "directory" depends on
       whether there's one directory, or two of them, or more
       than two of them.  Your test of "($directory == 1)" no
       longer does the job.  And it means that where English's
       grammatical category of number necessitates only the two
       permutations of the first sentence based on "directory
       [singular]" and "directories [plural]", Arabic has three
       -- and, worse, in the second sentence ("Your query matched
       %g file in %g directory."), where English has four, Arabic
       has nine.  You sense an unwelcome, exponential trend tak-
       ing shape.

       Your Italian translator emails you back and says that "I
       searched 0 directories" (a possible English output of your
       program) is stilted, and if you think that's fine English,
       that's your problem, but that just will not do in the lan-
       guage of Dante.  He insists that where $directory_count is
       0, your program should produce the Italian text for "I
       didn't scan any directories.".  And ditto for "I didn't
       match any files in any directories", although he says the
       last part about "in any directories" should probably just
       be left off.

       You wonder how you'll get gettext to handle this; to acco-
       modate the ways Arabic, Chinese, and Italian deal with
       numbers in just these few very simple phrases, you need to
       write code that will ask gettext for different queries
       depending on whether the numerical values in question are
       1, 2, more than 2, or in some cases 0, and you still
       haven't figured out the problem with the different word
       order in Chinese.

       Then your Russian translator calls on the phone, to per-
       sonally tell you the bad news about how really unpleasant
       your life is about to become:

       Russian, like German or Latin, is an inflectional lan-
       guage; that is, nouns and adjectives have to take endings
       that depend on their case (i.e., nominative, accusative,
       genitive, etc...) -- which is roughly a matter of what
       role they have in syntax of the sentence -- as well as on
       the grammatical gender (i.e., masculine, feminine, neuter)
       and number (i.e., singular or plural) of the noun, as well
       as on the declension class of the noun.  But unlike with
       most other inflected languages, putting a number-phrase
       (like "ten" or "forty-three", or their Arabic numeral
       equivalents) in front of noun in Russian can change the
       case and number that noun is, and therefore the endings



perl v5.8.5                 2002-11-06                          4





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       you have to put on it.

       He elaborates:  In "I scanned %g directories", you'd
       expect "directories" to be in the accusative case (since
       it is the direct object in the sentnce) and the plural
       number, except where $directory_count is 1, then you'd
       expect the singular, of course.  Just like Latin or Ger-
       man.  But!  Where $directory_count % 10 is 1 ("%" for mod-
       ulo, remember), assuming $directory count is an integer,
       and except where $directory_count % 100 is 11, "directo-
       ries" is forced to become grammatically singular, which
       means it gets the ending for the accusative singular...
       You begin to visualize the code it'd take to test for the
       problem so far, and still work for Chinese and Arabic and
       Italian, and how many gettext items that'd take, but he
       keeps going...  But where $directory_count % 10 is 2, 3,
       or 4 (except where $directory_count % 100 is 12, 13, or
       14), the word for "directories" is forced to be genitive
       singular -- which means another ending... The room begins
       to spin around you, slowly at first...  But with all other
       integer values, since "directory" is an inanimate noun,
       when preceded by a number and in the nominative or
       accusative cases (as it is here, just your luck!), it does
       stay plural, but it is forced into the genitive case --
       yet another ending...  And you never hear him get to the
       part about how you're going to run into similar (but maybe
       subtly different) problems with other Slavic languages
       like Polish, because the floor comes up to meet you, and
       you fade into unconsciousness.

       The above cautionary tale relates how an attempt at local-
       ization can lead from programmer consternation, to program
       obfuscation, to a need for sedation.  But careful evalua-
       tion shows that your choice of tools merely needed further
       consideration.

       The Linguistic View

           "It is more complicated than you think."

           -- The Eighth Networking Truth, from RFC 1925

       The field of Linguistics has expended a great deal of
       effort over the past century trying to find grammatical
       patterns which hold across languages; it's been a constant
       process of people making generalizations that should apply
       to all languages, only to find out that, all too often,
       these generalizations fail -- sometimes failing for just a
       few languages, sometimes whole classes of languages, and
       sometimes nearly every language in the world except
       English.  Broad statistical trends are evident in what the
       "average language" is like as far as what its rules can
       look like, must look like, and cannot look like.  But the
       "average language" is just as unreal a concept as the



perl v5.8.5                 2002-11-06                          5





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       "average person" -- it runs up against the fact no lan-
       guage (or person) is, in fact, average.  The wisdom of
       past experience leads us to believe that any given lan-
       guage can do whatever it wants, in any order, with appeal
       to any kind of grammatical categories wants -- case, num-
       ber, tense, real or metaphoric characteristics of the
       things that words refer to, arbitrary or predictable clas-
       sifications of words based on what endings or prefixes
       they can take, degree or means of certainty about the
       truth of statements expressed, and so on, ad infinitum.

       Mercifully, most localization tasks are a matter of find-
       ing ways to translate whole phrases, generally sentences,
       where the context is relatively set, and where the only
       variation in content is usually in a number being
       expressed -- as in the example sentences above.  Translat-
       ing specific, fully-formed sentences is, in practice,
       fairly foolproof -- which is good, because that's what's
       in the phrasebooks that so many tourists rely on.  Now, a
       given phrase (whether in a phrasebook or in a gettext lex-
       icon) in one language might have a greater or lesser
       applicability than that phrase's translation into another
       language -- for example, strictly speaking, in Arabic, the
       "your" in "Your query matched..." would take a different
       form depending on whether the user is male or female; so
       the Arabic translation "your[feminine] query" is applica-
       ble in fewer cases than the corresponding English phrase,
       which doesn't distinguish the user's gender.  (In prac-
       tice, it's not feasable to have a program know the user's
       gender, so the masculine "you" in Arabic is usually used,
       by default.)

       But in general, such surprises are rare when entire sen-
       tences are being translated, especially when the func-
       tional context is restricted to that of a computer inter-
       acting with a user either to convey a fact or to prompt
       for a piece of information.  So, for purposes of localiza-
       tion, translation by phrase (generally by sentence) is
       both the simplest and the least problematic.

       Breaking gettext

           "It Has To Work."

           -- First Networking Truth, RFC 1925

       Consider that sentences in a tourist phrasebook are of two
       types: ones like "How do I get to the marketplace?" that
       don't have any blanks to fill in, and ones like "How much
       do these ___ cost?", where there's one or more blanks to
       fill in (and these are usually linked to a list of words
       that you can put in that blank: "fish", "potatoes", "toma-
       toes", etc.)  The ones with no blanks are no problem, but
       the fill-in-the-blank ones may not be really



perl v5.8.5                 2002-11-06                          6





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       straightforward. If it's a Swahili phrasebook, for exam-
       ple, the authors probably didn't bother to tell you the
       complicated ways that the verb "cost" changes its inflec-
       tional prefix depending on the noun you're putting in the
       blank.  The trader in the marketplace will still under-
       stand what you're saying if you say "how much do these
       potatoes cost?" with the wrong inflectional prefix on
       "cost".  After all, you can't speak proper Swahili, you're
       just a tourist.  But while tourists can be stupid, comput-
       ers are supposed to be smart; the computer should be able
       to fill in the blank, and still have the results be gram-
       matical.

       In other words, a phrasebook entry takes some values as
       parameters (the things that you fill in the blank or
       blanks), and provides a value based on these parameters,
       where the way you get that final value from the given val-
       ues can, properly speaking, involve an arbitrarily complex
       series of operations.  (In the case of Chinese, it'd be
       not at all complex, at least in cases like the examples at
       the beginning of this article; whereas in the case of Rus-
       sian it'd be a rather complex series of operations.  And
       in some languages, the complexity could be spread around
       differently: while the act of putting a number-expression
       in front of a noun phrase might not be complex by itself,
       it may change how you have to, for example, inflect a verb
       elsewhere in the sentence.  This is what in syntax is
       called "long-distance dependencies".)

       This talk of parameters and arbitrary complexity is just
       another way to say that an entry in a phrasebook is what
       in a programming language would be called a "function".
       Just so you don't miss it, this is the crux of this arti-
       cle: A phrase is a function; a phrasebook is a bunch of
       functions.

       The reason that using gettext runs into walls (as in the
       above second-person horror story) is that you're trying to
       use a string (or worse, a choice among a bunch of strings)
       to do what you really need a function for -- which is
       futile.  Preforming (s)printf interpolation on the strings
       which you get back from gettext does allow you to do some
       common things passably well... sometimes... sort of; but,
       to paraphrase what some people say about "csh" script pro-
       gramming, "it fools you into thinking you can use it for
       real things, but you can't, and you don't discover this
       until you've already spent too much time trying, and by
       then it's too late."

       Replacing gettext

       So, what needs to replace gettext is a system that sup-
       ports lexicons of functions instead of lexicons of
       strings.  An entry in a lexicon from such a system should



perl v5.8.5                 2002-11-06                          7





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       not look like this:

         "J'ai trouv\xE9 %g fichiers dans %g r\xE9pertoires"

       [\xE9 is e-acute in Latin-1.  Some pod renderers would
       scream if I used the actual character here. -- SB]

       but instead like this, bearing in mind that this is just a
       first stab:

         sub I_found_X1_files_in_X2_directories {
           my( $files, $dirs ) = @_[0,1];
           $files = sprintf("%g %s", $files,
             $files == 1 ? 'fichier' : 'fichiers');
           $dirs = sprintf("%g %s", $dirs,
             $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
           return "J'ai trouv\xE9 $files dans $dirs.";
         }

       Now, there's no particularly obvious way to store anything
       but strings in a gettext lexicon; so it looks like we just
       have to start over and make something better, from
       scratch.  I call my shot at a gettext-replacement system
       "Maketext", or, in CPAN terms, Locale::Maketext.

       When designing Maketext, I chose to plan its main features
       in terms of "buzzword compliance".  And here are the buz-
       zwords:

       Buzzwords: Abstraction and Encapsulation

       The complexity of the language you're trying to output a
       phrase in is entirely abstracted inside (and encapsulated
       within) the Maketext module for that interface.  When you
       call:

         print $lang->maketext("You have [quant,_1,piece] of new mail.",
                              scalar(@messages));

       you don't know (and in fact can't easily find out) whether
       this will involve lots of figuring, as in Russian (if
       $lang is a handle to the Russian module), or relatively
       little, as in Chinese.  That kind of abstraction and
       encapsulation may encourage other pleasant buzzwords like
       modularization and stratification, depending on what
       design decisions you make.

       Buzzword: Isomorphism

       "Isomorphism" means "having the same structure or form";
       in discussions of program design, the word takes on the
       special, specific meaning that your implementation of a
       solution to a problem has the same structure as, say, an
       informal verbal description of the solution, or maybe of



perl v5.8.5                 2002-11-06                          8





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       the problem itself.  Isomorphism is, all things consid-
       ered, a good thing -- it's what problem-solving (and solu-
       tion-implementing) should look like.

       What's wrong the with gettext-using code like this...

         printf( $file_count == 1 ?
           ( $directory_count == 1 ?
            "Your query matched %g file in %g directory." :
            "Your query matched %g file in %g directories." ) :
           ( $directory_count == 1 ?
            "Your query matched %g files in %g directory." :
            "Your query matched %g files in %g directories." ),
          $file_count, $directory_count,
         );

       is first off that it's not well abstracted -- these ways
       of testing for grammatical number (as in the expressions
       like "foo == 1 ?  singular_form : plural_form") should be
       abstracted to each language module, since how you get
       grammatical number is language-specific.

       But second off, it's not isomorphic -- the "solution"
       (i.e., the phrasebook entries) for Chinese maps from these
       four English phrases to the one Chinese phrase that fits
       for all of them.  In other words, the informal solution
       would be "The way to say what you want in Chinese is with
       the one phrase 'For your question, in Y directories you
       would find X files'" -- and so the implemented solution
       should be, isomorphically, just a straightforward way to
       spit out that one phrase, with numerals properly interpo-
       lated.  It shouldn't have to map from the complexity of
       other languages to the simplicity of this one.

       Buzzword: Inheritance

       There's a great deal of reuse possible for sharing of
       phrases between modules for related dialects, or for shar-
       ing of auxiliary functions between related languages.  (By
       "auxiliary functions", I mean functions that don't produce
       phrase-text, but which, say, return an answer to "does
       this number require a plural noun after it?".  Such auxil-
       iary functions would be used in the internal logic of
       functions that actually do produce phrase-text.)

       In the case of sharing phrases, consider that you have an
       interface already localized for American English (probably
       by having been written with that as the native locale, but
       that's incidental).  Localizing it for UK English should,
       in practical terms, be just a matter of running it past a
       British person with the instructions to indicate what few
       phrases would benefit from a change in spelling or possi-
       bly minor rewording.  In that case, you should be able to
       put in the UK English localization module only those



perl v5.8.5                 2002-11-06                          9





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       phrases that are UK-specific, and for all the rest,
       inherit from the American English module.  (And I expect
       this same situation would apply with Brazilian and Conti-
       nental Portugese, possbily with some very closely related
       languages like Czech and Slovak, and possibly with the
       slightly different "versions" of written Mandarin Chinese,
       as I hear exist in Taiwan and mainland China.)

       As to sharing of auxiliary functions, consider the problem
       of Russian numbers from the beginning of this article;
       obviously, you'd want to write only once the hairy code
       that, given a numeric value, would return some specifica-
       tion of which case and number a given quanitified noun
       should use.  But suppose that you discover, while localiz-
       ing an interface for, say, Ukranian (a Slavic language
       related to Russian, spoken by several million people, many
       of whom would be relieved to find that your Web site's or
       software's interface is available in their language), that
       the rules in Ukranian are the same as in Russian for quan-
       tification, and probably for many other grammatical func-
       tions.  While there may well be no phrases in common
       between Russian and Ukranian, you could still choose to
       have the Ukranian module inherit from the Russian module,
       just for the sake of inheriting all the various grammati-
       cal methods.  Or, probably better organizationally, you
       could move those functions to a module called "_E_Slavic"
       or something, which Russian and Ukranian could inherit
       useful functions from, but which would (presumably) pro-
       vide no lexicon.

       Buzzword: Concision

       Okay, concision isn't a buzzword.  But it should be, so I
       decree that as a new buzzword, "concision" means that sim-
       ple common things should be expressible in very few lines
       (or maybe even just a few characters) of code -- call it a
       special case of "making simple things easy and hard things
       possible", and see also the role it played in the
       MIDI::Simple language, discussed elsewhere in this issue
       [TPJ#13].

       Consider our first stab at an entry in our "phrasebook of
       functions":

         sub I_found_X1_files_in_X2_directories {
           my( $files, $dirs ) = @_[0,1];
           $files = sprintf("%g %s", $files,
             $files == 1 ? 'fichier' : 'fichiers');
           $dirs = sprintf("%g %s", $dirs,
             $dirs == 1 ? "r\xE9pertoire" : "r\xE9pertoires");
           return "J'ai trouv\xE9 $files dans $dirs.";
         }

       You may sense that a lexicon (to use a non-committal



perl v5.8.5                 2002-11-06                         10





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       catch-all term for a collection of things you know how to
       say, regardless of whether they're phrases or words) con-
       sisting of functions expressed as above would make for
       rather long-winded and repetitive code -- even if you
       wisely rewrote this to have quantification (as we call
       adding a number expression to a noun phrase) be a function
       called like:

         sub I_found_X1_files_in_X2_directories {
           my( $files, $dirs ) = @_[0,1];
           $files = quant($files, "fichier");
           $dirs =  quant($dirs,  "r\xE9pertoire");
           return "J'ai trouv\xE9 $files dans $dirs.";
         }

       And you may also sense that you do not want to bother your
       translators with having to write Perl code -- you'd much
       rather that they spend their very costly time on just
       translation.  And this is to say nothing of the near
       impossibility of finding a commercial translator who would
       know even simple Perl.

       In a first-hack implementation of Maketext, each lan-
       guage-module's lexicon looked like this:

        %Lexicon = (
          "I found %g files in %g directories"
          => sub {
             my( $files, $dirs ) = @_[0,1];
             $files = quant($files, "fichier");
             $dirs =  quant($dirs,  "r\xE9pertoire");
             return "J'ai trouv\xE9 $files dans $dirs.";
           },
         ... and so on with other phrase => sub mappings ...
        );

       but I immediately went looking for some more concise way
       to basically denote the same phrase-function -- a way that
       would also serve to concisely denote most phrase-functions
       in the lexicon for most languages.  After much time and
       even some actual thought, I decided on this system:

       * Where a value in a %Lexicon hash is a contentful string
       instead of an anonymous sub (or, conceivably, a coderef),
       it would be interpreted as a sort of shorthand expression
       of what the sub does.  When accessed for the first time in
       a session, it is parsed, turned into Perl code, and then
       eval'd into an anonymous sub; then that sub replaces the
       original string in that lexicon.  (That way, the work of
       parsing and evaling the shorthand form for a given phrase
       is done no more than once per session.)

       * Calls to "maketext" (as Maketext's main function is
       called) happen thru a "language session handle",



perl v5.8.5                 2002-11-06                         11





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       notionally very much like an IO handle, in that you open
       one at the start of the session, and use it for "sending
       signals" to an object in order to have it return the text
       you want.

       So, this:

         $lang->maketext("You have [quant,_1,piece] of new mail.",
                        scalar(@messages));

       basically means this: look in the lexicon for $lang (which
       may inherit from any number of other lexicons), and find
       the function that we happen to associate with the string
       "You have [quant,_1,piece] of new mail" (which is, and
       should be, a functioning "shorthand" for this function in
       the native locale -- English in this case).  If you find
       such a function, call it with $lang as its first parameter
       (as if it were a method), and then a copy of scalar(@mes-
       sages) as its second, and then return that value.  If that
       function was found, but was in string shorthand instead of
       being a fully specified function, parse it and make it
       into a function before calling it the first time.

       * The shorthand uses code in brackets to indicate method
       calls that should be performed.  A full explanation is not
       in order here, but a few examples will suffice:

         "You have [quant,_1,piece] of new mail."

       The above code is shorthand for, and will be interpreted
       as, this:

         sub {
           my $handle = $_[0];
           my(@params) = @_;
           return join '',
             "You have ",
             $handle->quant($params[1], 'piece'),
             "of new mail.";
         }

       where "quant" is the name of a method you're using to
       quantify the noun "piece" with the number $params[0].

       A string with no brackety calls, like this:

         "Your search expression was malformed."

       is somewhat of a degerate case, and just gets turned into:

         sub { return "Your search expression was malformed." }

       However, not everything you can write in Perl code can be
       written in the above shorthand system -- not by a long



perl v5.8.5                 2002-11-06                         12





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       shot.  For example, consider the Italian translator from
       the beginning of this article, who wanted the Italian for
       "I didn't find any files" as a special case, instead of "I
       found 0 files".  That couldn't be specified (at least not
       easily or simply) in our shorthand system, and it would
       have to be written out in full, like this:

         sub {  # pretend the English strings are in Italian
           my($handle, $files, $dirs) = @_[0,1,2];
           return "I didn't find any files" unless $files;
           return join '',
             "I found ",
             $handle->quant($files, 'file'),
             " in ",
             $handle->quant($dirs,  'directory'),
             ".";
         }

       Next to a lexicon full of shorthand code, that sort of
       sticks out like a sore thumb -- but this is a special
       case, after all; and at least it's possible, if not as
       concise as usual.

       As to how you'd implement the Russian example from the
       beginning of the article, well, There's More Than One Way
       To Do It, but it could be something like this (using
       English words for Russian, just so you know what's going
       on):

         "I [quant,_1,directory,accusative] scanned."

       This shifts the burden of complexity off to the quant
       method.  That method's parameters are: the numeric value
       it's going to use to quantify something; the Russian word
       it's going to quantify; and the parameter "accusative",
       which you're using to mean that this sentence's syntax
       wants a noun in the accusative case there, although that
       quantification method may have to overrule, for grammati-
       cal reasons you may recall from the beginning of this
       article.

       Now, the Russian quant method here is responsible not only
       for implementing the strange logic necessary for figuring
       out how Russian number-phrases impose case and number on
       their noun-phrases, but also for inflecting the Russian
       word for "directory".  How that inflection is to be car-
       ried out is no small issue, and among the solutions I've
       seen, some (like variations on a simple lookup in a hash
       where all possible forms are provided for all necessary
       words) are straightforward but can become cumbersome when
       you need to inflect more than a few dozen words; and other
       solutions (like using algorithms to model the inflections,
       storing only root forms and irregularities) can involve
       more overhead than is justifiable for all but the largest



perl v5.8.5                 2002-11-06                         13





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       lexicons.

       Mercifully, this design decision becomes crucial only in
       the hairiest of inflected languages, of which Russian is
       by no means the worst case scenario, but is worse than
       most.  Most languages have simpler inflection systems; for
       example, in English or Swahili, there are generally no
       more than two possible inflected forms for a given noun
       ("error/errors"; "kosa/makosa"), and the rules for produc-
       ing these forms are fairly simple -- or at least, simple
       rules can be formulated that work for most words, and you
       can then treat the exceptions as just "irregular", at
       least relative to your ad hoc rules.  A simpler inflection
       system (simpler rules, fewer forms) means that design
       decisions are less crucial to maintaining sanity, whereas
       the same decisions could incur overhead-versus-scalability
       problems in languages like Russian.  It may also be likely
       that code (possibly in Perl, as with Lingua::EN::Inflect,
       for English nouns) has already been written for the lan-
       guage in question, whether simple or complex.

       Moreover, a third possibility may even be simpler than
       anything discussed above: "Just require that all possible
       (or at least applicable) forms be provided in the call to
       the given language's quant method, as in:"

         "I found [quant,_1,file,files]."

       That way, quant just has to chose which form it needs,
       without having to look up or generate anything.  While
       possibly not optimal for Russian, this should work well
       for most other languages, where quantification is not as
       complicated an operation.

       The Devil in the Details

       There's plenty more to Maketext than described above --
       for example, there's the details of how language tags
       ("en-US", "i-pwn", "fi", etc.) or locale IDs ("en_US")
       interact with actual module naming ("Bogo-
       Query/Locale/en_us.pm"), and what magic can ensue; there's
       the details of how to record (and possibly negotiate) what
       character encoding Maketext will return text in (UTF8?
       Latin-1? KOI8?).  There's the interesting fact that Make-
       text is for localization, but nowhere actually has a ""use
       locale;"" anywhere in it.  For the curious, there's the
       somewhat frightening details of how I actually implement
       something like data inheritance so that searches across
       modules' %Lexicon hashes can parallel how Perl implements
       method inheritance.

       And, most importantly, there's all the practical details
       of how to actually go about deriving from Maketext so you
       can use it for your interfaces, and the various tools and



perl v5.8.5                 2002-11-06                         14





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


       conventions for starting out and maintaining individual
       language modules.

       That is all covered in the documentation for Locale::Make-
       text and the modules that come with it, available in CPAN.
       After having read this article, which covers the why's of
       Maketext, the documentation, which covers the how's of it,
       should be quite straightfoward.

       The Proof in the Pudding: Localizing Web Sites

       Maketext and gettext have a notable difference: gettext is
       in C, accessible thru C library calls, whereas Maketext is
       in Perl, and really can't work without a Perl interpreter
       (although I suppose something like it could be written for
       C).  Accidents of history (and not necessarily lucky ones)
       have made C++ the most common language for the implementa-
       tion of applications like word processors, Web browsers,
       and even many in-house applications like custom query sys-
       tems.  Current conditions make it somewhat unlikely that
       the next one of any of these kinds of applications will be
       written in Perl, albeit clearly more for reasons of custom
       and inertia than out of consideration of what is the right
       tool for the job.

       However, other accidents of history have made Perl a well-
       accepted language for design of server-side programs (gen-
       erally in CGI form) for Web site interfaces.  Localization
       of static pages in Web sites is trivial, feasable either
       with simple language-negotiation features in servers like
       Apache, or with some kind of server-side inclusions of
       language-appropriate text into layout templates.  However,
       I think that the localization of Perl-based search systems
       (or other kinds of dynamic content) in Web sites, be they
       public or access-restricted, is where Maketext will see
       the greatest use.

       I presume that it would be only the exceptional Web site
       that gets localized for English and Chinese and Italian
       and Arabic and Russian, to recall the languages from the
       beginning of this article -- to say nothing of German,
       Spanish, French, Japanese, Finnish, and Hindi, to name a
       few languages that benefit from large numbers of program-
       mers or Web viewers or both.

       However, the ever-increasing internationalization of the
       Web (whether measured in terms of amount of content, of
       numbers of content writers or programmers, or of size of
       content audiences) makes it increasingly likely that the
       interface to the average Web-based dynamic content service
       will be localized for two or maybe three languages.  It is
       my hope that Maketext will make that task as simple as
       possible, and will remove previous barriers to localiza-
       tion for languages dissimilar to English.



perl v5.8.5                 2002-11-06                         15





Locale::Maketext:Perl1Programmers RefeLocale::Maketext::TPJ13(3p)


        __END__

       Sean M. Burke (sburkeATcpan.org) has a Master's in linguis-
       tics from Northwestern University; he specializes in lan-
       guage technology.  Jordan Lachler (lachlerATunm.edu) is a
       PhD student in the Department of Linguistics at the Uni-
       versity of New Mexico; he specializes in morphology and
       pedagogy of North American native languages.

       References

       Alvestrand, Harald Tveit.  1995.  RFC 1766: Tags for the
       Identification of Languages.
       "ftp://ftp.isi.edu/in-notes/rfc1766.txt" [Now see RFC
       3066.]

       Callon, Ross, editor.  1996.  RFC 1925: The Twelve Net-
       working Truths.  "ftp://ftp.isi.edu/in-notes/rfc1925.txt"

       Drepper, Ulrich, Peter Miller, and Francois Pinard.
       1995-2001.  GNU "gettext".  Available in
       "ftp://prep.ai.mit.edu/pub/gnu/", with extensive docs in
       the distribution tarball.  [Since I wrote this article in
       1998, I now see that the gettext docs are now trying more
       to come to terms with plurality.  Whether useful conclu-
       sions have come from it is another question altogether. --
       SMB, May 2001]

       Forbes, Nevill.  1964.  Russian Grammar.  Third Edition,
       revised by J. C. Dumbreck.  Oxford University Press.



























perl v5.8.5                 2002-11-06                         16