I have been playing about with LiquidJS and looking at why so few sites beyond major verticals are ever translated to enough locales, some none at all.
I am still proofing my argument and thought this may by a useful place to get some more feedback.
This is a link to a pdf on dropbox
[https://tinyurl.com/y288dgru]
I have copied that below, its just slightly prettier on the pdf.
Pangolin A Practical Translation Architecture
Preface
This isn’t a business proposal. It’s a technical discussion including working code to view as a prototype.
We aren’t translating anything like enough websites into anywhere near enough languages and I wanted to investigate how we might solve that.
Let me open with the undeniable fact is that it’s not working. And then an admission, nobody cares that its not working.
The internationalisation strategy we as an industry have been applying is not working. We are not translating enough websites into enough languages. That is just a fact. Review just about any transport booking site on the web. Few of the sites I visit are translated into more than a handful of languages, most none at all. Barely any are ever translated into even just the 5 UN designated World Languages It is frequently the case that a national rail provider has not even translated their website into all the official native languages of their own country, let alone all of the adjacent countries’ languages which you might regard as a minimum requirement for any such site.
This failure is having significant cultural and political impact. As an industry we are failing. i18n/l10n are two terms that sum the situation up. Terse in the extreme, they lend credence to a spray on compartmentalised process of doing this. This approach is both expensive, and as a whole it’s blatantly inadequate.
No Easy Way
There isn’t really any alternative in the process of internationalising your software than to mark up the text that you want translating. You may be able to use tools to scan legacy code where the engineers forgot to do this, to insert this markup. You will still have to manually check that it didn’t screw up, because it will. There’s really no magic solution to this. This business of marking up isn’t really that arduous. Ideally it is done at the outset when the code, which will typically be in the form of templates, are first produced. Templates are still code in this context. Even though we try to get the “code” out of them as much as we can,.
There is a trivial principle implemented in the solution I present. I treat these translation lookups as part of the application data. The resolution of the translations needs to be performed, when needed, on demand and as part of the production-level execution architecture, and then cached for further use. We want it to be executed on demand if it hasn’t been yet so we can service a brand new locale without any release process of any kind going on. This requires improved templating . Enter LiquidJS. This is the first and simplest half of what I propose.
The Status Quo Needs To Change
The internationalisation business currently consists of a legion of translation service providers. They are each collating a proprietary translation database which they essentially sell access to. Some use natural language processing to some degree to populate their databases, but all of them provide an override, it just wont work properly otherwise.
The prospect of having to pay a 3rd party for translation services is so onerous to most start up software projects that nobody ever considers having to do it anytime soon. This is a key reason behind engineers not addressing the internationalisation in the same way they might approach other aspects of their application architecture. We need to give even the smallest of projects the motivation to apply the mark-up in their product at the outset. The data needs to be open access (free) at least for small users.
I am proposing that we liberate this translation database in the interests of minority-language readers across the globe. This will provide the open resource that all developers can use as a default option of choice. I would hope that this would deliver a compelling rationale for making your systems translatable form the outset.
There will still always be a role for commercial, product specific, translation services and linguists/translators. The people who do these translations do not get paid a royalty for any subsequent use of their work in the current model. It is lost into a proprietary system belonging to their agent. If we can deploy a Stack Exchange review model, where translators gain points for popular translations, we might a provide better compensation scheme for these translators by way of the kudos they receive and subsequent assignments for translation work.
I also think we need to pursue a crowd sourcing approach whereby we can direct users to the translation database interface directly from the target product. If you aren’t prepared to indulge the idea of making products publicly editable, if you don’t really trust open source projects anyway, you wont like this.
1. There is no open data infrastructure to share these translations,
2. nor adequate open source software to edit it them with.
There does not exist a public translation database. There does not exist open technology to run such a thing. We need a centralised hub, a “Translation Exchange”, that’s 1. But we also want developers to be able to run their own DB instance 2. This is for two reasons.
Firstly their own production requirements. I have argued above, we require this data to be a part of the production-level architecture. We need to be able to translate a page into a new language on demand and on the fly the first time it is requested.
Secondly, as part of any crowd sourcing scheme, we need to run a version of the dictionary editing application deployed just for this site. We could conceivably use the centralised hub, the “Translation Exchange” but that might overheat eventually. We also might want to enforce different editorial standards to the global public databases. We need our own copy of the DB and the editing application.
LiquidJS Rocks
Finally, much of this essay is an advocation of the outstanding LiquidJS.
We all use templating engines. They are light and generally require that the data to be interpolated is loaded prior to execution. This is a serious problem. Essentially, they do not support asynchronous database call backs from within the template itself. You can’t request fresh data to be loaded in the middle of the execution. This is a painful restriction and results in cumbersome frameworks to manage our way out of the mess that results. We will need to work out all the data a template might need before we execute it. LiquidJS supports asynchronous callbacks (synchronous or blocking call backs would cripple the server). And hence you can extend it to call your database API to fetch data from a call inside your template using parameters passed to the engine at render time, or read from other data calls previously in the execution.
The Liquid project is of the highest standard. It is TypeScript compatible, heavily peer reviewed and has undergone multiple major version releases. It is easily extensible. The markup is adaptable. Plug-ins and bloc operators are easily implemented and there is a clear API to the internal tokeniser. It also supports the concept of filters on interpolation mark-up. If you take nothing else from this essay, take LiquidJS.
If I have to sell Node to you then you can skip the rest of this paper.
Why are we so bad at this ?
The core issue is that we don’t have a publicly maintained dictionary we can all use. My discussion is about how we can set about doing that. We don’t even have public/open-source database based technology for running the process. In order to drive the population of this public database I am proposing that we crowd source the data and get the users to compile our translations.
There are two issues that need to be addressed.
1. Architecture
The first is architectural. It’s a really simple issue as mentioned in the preface. In any existing implementation we are performing these translations in a totally separate process to the operation of the site. Typically we apply the translations to a source base of files, be they source code or template files, and generate a set of files for a specific target locale, using a publication process executed once only for any release. This model might be sold differently, but the point is its not done as part of the delivery of the page/resource.
This compartmentalisation of the process dislocates the operation of resolving these translations from the normal serving of the page. We do not want to be hitting a database for every translation on every page for every HTTP request. It is also a very convenient spray-on solution that managers might find appealing. They can convince themselves that they’ve factored out the internationalisation. In truth, the labour intensive tasks of mark-up and translation still exist, we’ve just hampered ourselves with this cumbersome model.
The down side of doing this is that we are treating these translations essentially as part of the code. They should are application data and should be treated as such. The consequence is that we have an inherent overhead for every language we choose to support. We have to execute what is effectively a release for every supported language. Worse still, we have to do this every time the dictionary is updated. The result is we are not supporting anything like the number of languages we reasonably should be doing in my opinion, and even those that we do are not as accurate as they could be.
These translations should be viewed as part of your application’s data source. We need an architecture which, upon the first request for a given resource or file for a specific target locale, the translations are performed and the results are stored, in a file naturally, for all subsequent requests. Irrespective of the crowd sourcing that is coming up in the next section, adoption of this kind of caching strategy for i18n duties will reduce the overhead and increase the potential of supporting a greater spread of languages. It’s a trivial and powerful thing to do in itself.
2. Open Data
The second and admittedly far more difficult issue we need to tackle is the absence of a public and essentially open and free set of these translations for all languages. Even if we can eradicate the overhead of supporting extra locales, as discussed above, we still need to get hold of this data. It needs to be publicly and freely available. This becomes increasingly important as we proceed down to less populated languages. For any of the major global languages it surely cannot be justified that this data be proprietary as it should cost so little to be impossible to charge for. If there is a universal and free to access global dictionary available it is reasonable to expect minority language communities to provide the translations at no cost.
I may be feasible to crowd source this data from a products own user base. We will need to develop a web application to allow the public editing and arbitration of translations. This would include technology for allowing the public to interact with a version of the site where they can actually point and shoot at things that need translating. The crowd sourcing public need to be able to see their own translation submissions in the context for which they are intended.
The business of modifying your product, to allow the user to actually click on phrases and be warped off to the instance of the dictionary editing application is easy enough. There are of course many caveats, we cant easily provide this functionality automatically for every part of the UI on every kind of platform, but we can do a enough to engage the user.
The really hard part which I believe is going to need a significant amount of investment and community engineering, is this dictionary editing application itself. It needs to support an arbitration model needed for any genuine crowd sourcing technology. It also ultimately needs to support a release control model. As said, I think it’s essential that we allow the crowd sourcing public to be able to witness, and indeed review, their translations in the context of the application itself. But we don’t necessarily want to give them the power to edit the page for everyone else without a review cycle before publishing their edits to the public at large.
Implementation
Here is some detail. It’s a very simple and a bit boring unless you actually do this stuff for a living.
There are two principal components of Pangolin
1. Foxbat
A LiquidJS extension to support multi phase template execution. Essentially to be able to execute and resolve translation markup on a per locale basis once only per “resource” (generally a file in truth).
2. Marmot
A translation dictionary database implementation. This supports input and output of industry standard gettext portable object files (.po files).
Foxbat
The Foxbat code itself is tiny, and indeed can be further simplified. It is nevertheless conceptually important. Foxbat is a trivial concept, not really a tool, its a handful of lines. As long as your templating engine supports asynchronous call backs you could easily implement this idea yourself.
You can see the code here
(I deleted this link as a new user i added in a comment below or yu can get form the pdf)
It creates a parent Liquid instance which overrides the renderFile method to perform the following operation based on the file name and the locale. It also instantiates different markup for this phase, so you can code operations to be execute once only per locale, and still perform CGI time and/or client soul templating.
1. Determine the intermediary target file.
Convert the file path
/
to
/.foxbat//
2. Generate the intermediate file
If that file does not exist then Foxbat generates it using the following syntax
The parent instance uses different markup syntax. The standard Liquid markup is {{variable}} for interpolation of variables, the principal purpose of a templating engine
{%tagname … %} extension interface for function call backs, these can be user specified
Foxbat uses
{? variable ?}
and
{!tagname …!}
as markup for this “once only per locale” execution. The call to marmot, described below, is “translate’, so all the to be translated strings in your source files need to marked up thus
{!translate “a phrase I want translating” !}
Most other templating engines, including the standard JS template engine, require data to be loaded prior to calling the template engine. These engines are intended to perform on a static data structure. This is painfully restrictive. When writing CGI HTML generation it is often far more natural to make a database call at point within the execution of the template. LiquidJS supports asynchronous call backs and hence I/O calls, to be initiated from within the templates.
3. Perform second phase execution
Once this “once only per locale” execution is performed, if necessary, the output file, which typically will have already existed and not need to be compiled, is then executed again. In order that {{ and {% can be preserved for client side templating, a further markup is defined for this “CGI time” execution, specifically
{ variable }
and
{@ tagname … @}
As mentioned, Liquid supports asynchronous call backs, essential for writing non-blocking performant server side code. So, further to the translation capabilities I am specifically demonstrating, Liquid supports this very natural pattern of being able to make a database call, using parameters either passed to the engine in the initial data structure, or itself read from a previous DB call, mid template. This obviates the need for a lot of unnecessary framework on the server side. Liquid is good.
This resource we have produced could be an executable file in another language such as PHP. You don’t or may not need this second phase execution, I am just demonstrating how you can support three different phases of Liquid execution, “once”, “every” and “client”, and specifically the need for this “once only per locale” execution.
Marmot
The standard translation library used for a quarter of a century is called gettext. It is designed to be
-
executed in an “offline” nature
-
for the entire dictionary to be loaded at start time
-
does not support inline updates of the dictionary
There are some complications with phrase mapping/translating. There can of course be ambiguities with a phrase in English having different translations based on context. Most practical implementations take the file or resource name being translated as a context value, as does Marmot. Another fearute od gettext are parametrised phrases. These are sentences that have a numeric variable in them. Curiously, all of the gettext translation dictionaries I have yet witnessed do not support an empty or “zeroth” case. You are expected to print something like
“You have 0 unread mails”.
In no language I am aware of is this grammatically correct. We would normally need to say something like “You have no unread emails”.
Both of these issues, the ambiguities and the parametrised phrases, are in practice minor or relatively rare, and have workarounds. I have nevertheless provided extensive support for both of them.
Marmot is a database implementation of a gettext like interface. The code currently stands at 374 lines. Not as trivial as the base Foxbat code, but it’s still quite light. The marmot and database schema definition are currently held within the Foxbat project at
(link deleted, added as a comment or get form pdf)
It should to be hived off as a genuinely independent piece of code durring the next refactoring pass.
One obvious reason for building a database implementation is so we can subsequently write tools to edit this data. This ability, for the production-team of a specific project to manage their translation database directly using a web application, is a critical piece of the translation/i18n process that is currently missing. We need an open source publicly maintained application so we can all edit our translations effectively, and ultimately recruit our own user-base to help out supporting languages we didn’t previously know we even had exposure to.
A database approach also allows for really large aggregated databases including many phrases you might not be using or languages nobody ever request on our application, to be loaded into your database. That simplifies the distribution, and subsequently sharing, of this data. It doesn’t matter that you aren’t using most of it in your DB, its not going to unexpectedly bloat your application or slow it down, as long as we do it right.
There are Load() and Dump() methods implemented which read and write the industry standard portable object files and use publicly maintained code available for converting a JavaScript gettext runtime structure to and from .po file format. These .po files themselves have been overloaded, with magic comment syntax being used to extend the original functionality. This gives us an industry accepted standard for data exchange. Commercial translation sources use these files and so we have a ready made interface for uploading and downloading data through these platforms.
Marmot supports implicit creation (insertion) of previously unseen phrases. This allows for data to be passed through the translation mechanism at a higher level, e.g. place names, sports team names can be translated at CGI delivery time from data within the content DB. Language editors can then witness new phrases in the dictionary and if required translate say Bayern München to Bayern Munch using this syntax.
{@translate team_name_variable @}
That will call Marmot at CGI delivery time, i.e upon every request of the page, using the value in the team_name_variable that has presumably been fetched from a content Database. We can even move the translation process to the front end and have lookups performed every time a page or view is rendered within the browser. That would presumably result in a remote REST call being made, which would typically be quite expensive, so it’s not a pattern to be used normally.
This is not the same as the parameterized phrase resolution of runtime variables, for that Marmot just bakes the multiple forms required into the served template itself using syntax like
{%transform empty=”you have no emails” singular=”you have one email” plural=”you have {{count}} emails” control=”count” %}
(You can extend your Marmot translations to cope with such things are Slavonic rank cases, just as you can in gettext. Nobody ever seems translate their websites into these languages, but if you want to be able to translate “you have 3 of something” differently to “you have 5 of something”, we can do, and we can cope with further extensions such as gender support.)
Marmot Editor
This is perhaps the most important part in all this. We aren’t going to get to a crowd sourcing panacea without such an application.
It is undoubtedly the heaviest part of the development work, and quite impractical to complete without broader support for the project. I include a brief video of the application to date. A simple trick, shown in the demonstration, is the matter of marking up a development version of your application such that a user can interact with the target application itself. It’s really quite trivial, we just get the translation substitution routine to stick
A mobile implementation would obviously work slightly differently both with the interaction behaviour and the edit interface, perhaps a pane containing a list of all currently viewable phrases, presented in the dictionary.
I have implemented the editor using the Backbone & Marionette MVC framework. It is important that the application is adopted and developed by a series of engineers and engineering groups, and alas Backbone has been abandoned by the engineering world at large. I think it’s important that we review the development of this.
What Next
We have to provide a technology whereby anyone can internationalise their website just as easily as they can use GitHub or publish to NPM.
Whether we can establish an open source editor, and a public/open data source as I propose, is highly debatable. Right now, having sketched this out and can now sit back and reassess, I don’t feel massively confident. Most people just don’t accept that there is an issue, or think there is some magic that solves all this.
It might be a great project for interns, especially in all those countries that never get translated into. I live in Asia and will attempt to engage Computer Science departments here in Thailand and Malaysia. I would be delighted to speak to anyone who wants to discuss this. If we’re going to write anything it’s going to have to be on Node. Sequelize and Liquid, and Express obviously, come immediately after you’ve made that call.
I am going to review what I am doing with this client. Vue seems the people’s choice on that. It’s not critical what client side templating engine is being used, but you know my views on this. I certainly don’t want <%…%> which is quite horrid especially if its within an HTML attribute.
Feedback
The response below illustrates the problem we face. He believes that other people use “tremendous” solutions to do this. No they don’t. They find a way of farming out a set of source files for each locale they wish to support. There are plenty of companies that provide glossy looking tools to do this boasting integration with all manner of platforms. Some may even support the model I propose. But they come both at a cost and all still obey this process of performing a one time only process of markup on your code or templates. They sell access to a proprietary translation database, it is their unifying business model, and nobody generates locales on demand. You pay by the letter.
Whatever you view, it is simply the fact that we aren’t translating what we should be doing, and it;s quite plain why this is so. We need a public database, and the modest architecture needed to cope with dynamic updates to arbitrary locales.
There is nothing particularly specific to console products n what I present. The output markup is the businesses of the client to process, mobile apps will just manage that behaviour differently as described.
“Whoever wrote this doc is a subpar and obsolete engineer who takes himself too seriously. Internationalization (sometimes shortened to “i18n” , meaning “i - eighteen letters -n”) has been a long existing problem in software industry and we already have a tremendous amount of solutions. What he proposes is not “next generation” but re-inventing the wheels of 1990s for console applications only. Adopting his “foxbat” and “marmot” tool in web or desktop applications will bring pain with no gain.”
Mark Lester: mc_lester@yahoo.co.uk