Translation - Full Support for Multiple Languages

Overview of this Google Summer of Code 2012 Student Project at Outset - May 9, 2012

PL/Vesuvius and Multiple Languages

Mentors: Greg Miernicki, Glenn Pearson (main author of this)
Student: Ramindu Deshapriya

Background - During Last Year’s Google Summer of Code

Project Idea at the Outset: Reintegrate Pootle Translations

Pootle is an independent methodology for enabling and managing the distributed quick translations of software into multiple languages. It is particularly apt for short prompt words and enumerations, where context is not too important. The codebase used in Vesuvius has drifted away from the use of Pootle, instead working towards “Resource Page” infrastructure for custom pages per-event/per-language. This should work well for high-quality full-page translations, particularly when only a limited number of languages are involved. These two approaches can be seen as complimentary. This project thus involves re-invigorating the use of Pootle, and integrating it with the Resource Page system, to benefit content producers and translators.

Basics of Pootle

As noted in wikipedia, Pootle’s main focus is on localization of applications' graphical user interfaces as opposed to document translation. For PHP projects, the basic Pootle methodology for developers (and students assisting them) involves decorating the strings to be translated in the PHP source code with a _t(…) function call, e.g.:

$prompt = _t(“First name”) + “: “

(Such decoration sometimes requires code restructuring, e.g., if the string includes HTML attributes like bolding.) Then, at run time, if a language pack is installed and selected, the _t(…) function looks up and substitutes the translated phrase for the original phrase. On the backend, the _t() function uses a “gettext” library native to Sahana to performs the string lookup and substitution.

The overall process to deploy a Pootle-based system is as follows:

Whenever there’s new phrases to translate, the administrator runs a script over each module of the source code that extracts (using the “xgettext” function) the strings found within _t(). These go into *.pot files. (Let’s pretend we have just a single module to deal with, so a single .pot file, which stands for PO template.) Then, using a script that calls “msgmerge”, merge any new, not-yet-translated, strings with previously translated strings. It updates every appropriate .po file, e.g., “sp.po” for a Spanish translation.

The updated, posted .po files will then appear to the translators of particular to be not 100% done. They can step through the new phrases and translate them.

When a translation is done (or any time), the administrator runs another script that calls “msgfmt”, to create a “.mo” file (binary Machine Object) from the corresponding .po file. This format is optimized for runtime string substitution. Any missing translation phrases fall back on the English string.

General features of a Pootle server are described at SourceForge.

A 10-minute video from a translator’s perspective is on YouTube.

GSoC 2011 Work

Student Ramindu Deshapriya worked on this project; see the 2011 project page he created.

There are several ways to use the Pootle architecture without specifically using the traditional stand-alone Pootle server. In particular, there’s a version of Pootle that is integrated into Launchpad, and that seemed potentially advantageous to use in terms of simplified overall project management. The student was asked to investigate both approaches.

It was noticed that one of the existing limitations of Pootle as implemented is that all translations of a given English string are the same across all modules. A better approach would sometimes allow different translations, based on context. The student was asked to explore different ways of expressing context in the code, and test how they worked with Pootle server and Launchpad Pootle environments. The context string has to be visible to the developer and translator, but not explicitly shown in the runtime HTML.

While the original Sahana code used _t(…) functions, new code created by NLM largely neglected this. Adding _t(…) to a string varies in complexity from trivial to complicated (e.g., when HTML attributes are involved). While getting a working system going, only one module, Hospital Administration, was concentrated upon.

When the translation for a string is missing, instead of falling back to English, the Google Translate API could be called.

PL’s Resource Pages are more like documents, and so could lend themselves (at least for certain languages) to a higher-quality translation that takes the full text into account. Some initial consideration of how this might be done was attempted (at a time when the Resource Page editing functionality was being added, with contents moved into the database).

Subsequent Change to the Terms of Service of the Google Translate API

In Dec. 2011, this changed from a no-cost to a paid service, metered at $20 per 1 million characters.  

This Year’s Google Summer of Code Follow-on Work

Project Idea at the Outset: Full Support for Multiple Languages

The codebase from which Vesuvius evolved originally supported Pootle, an independent methodology for managing the distributed quick translations of software into multiple languages. Vesuvius now uses a “Resource Page” infrastructure for custom pages per-event/per-language. Pootle works well for short prompt words and enumerations where context is not too important, and the “Resource Page” works well for high-quality, full-page translations that contain a limited number of languages. These two approaches are complimentary, and use the Google Translate API to “fill in the blanks” for missing string translations (pending NLM obtaining a paid license to access this API).

During GSoC 2011, significant progress was made in re-integrating Pootle into the current Resource Page system. An experienced PHP programmer can help us make these mechanisms production-worthy and extend them to handle smart versioning. Another PHP task requiring less PHP experience is to add the required code markups (i.e., _t() functions) to translatable strings in the remaining modules, which sometimes involves code restructuring.

Student Proposal

Ramindu Deshapriya will continue the follow-on work and has written an extensive proposal, which will be summarized here. Principle proposed deliverables are:

  • A version of Vesuvius where Pootle-based translations will be fully implemented along with contexts which differentiate UI strings based on the needs of each module/situation. (That is, there will be no technical limitations to the participation of translators and incorporation of their work across the breadth of user-viewable GUI elements.)
  • For administrators, a translation module that runs on the back-end of Vesuvius, to:
    • check for available translation updates and files from the Launchpad repository;
    • alternatively configure a Pootle server to poll for updates, instead of Launchpad;
    • run the script to convert *.po files into *.mo files;
    • set a default locale (e.g., per install; per disaster event)
  • A revised Resource Pages system, that will preferentially use high-quality pre-translated versions for certain languages, stored in the Vesuvius database. Only when a needed specific-language version is missing will it fall back on the GSoC 2011 approach of using Google Translate (pending license acquisition).
  • A caching scheme for the output of the Google Translate API. Caching can minimize the number of API queries and thus potential cost.
  • Streamlining of the translation flow, with more locale features available in the Vesuvius User Interface.

Recent Developments (up to early May)
  • Ramindu fixed a bug left over from GSoC 2011 that was causing context strings to appear in the PL/Vesuvius GUI.
  • The proposal contains an initial timeline and milestones, which are now being refined, e.g., the order of adding _t() functions to modules is being established.
  • A convention for context strings, of “Module Name – Usage Type – (Optional) Purpose”, was just agreed and a draft enumeration started of the standardized module names and usage types (e.g., button, title, body text).
  • Possible cache implementations were discussed, e.g., to have two versions of each .po file, one done by humans and possibly incomplete, the other by the Google Translate API; the latter would serve as the cache.

QR Code
QR Code agasti:vesuvius:gsoc2012:multiple-languages-overview (generated for current page)