Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One other interesting problem with localization involves the use of printf. Even if you're looking up strings based on IDs in another file (which is a good pattern), sometimes you'll need to move things around based on language. For example, if you're doing right to left languages, you might put the number before, or after the string, and the other way for left to right languages. So like ("%d %s" vs "%s %d").

The way that we got around this was adding another level of indirection, and putting printf format strings also as localized data.



The format strings have to be localized data in any case, because they usually contain literal text, not just placeholders. The real problem here is that you need to change the order of arguments in a printf call - if the string changes from %s%d to %d%s, the order of arguments in the call must change, as well.

If you're on POSIX, you can use positional arguments for that:

   printf("%1$d %2$s", d, s);
   printf("%2$s %1$d", d, s);
Because it's not standard C, VC++ does not support it directly in printf, but it offers _printf_p with such support, and you can always #define printf _printf_p.


Or you can use GNU Gettext, which provides featureful replacements for printf() functions.


I came to say this. Gettext _is_ the right answer, this is a solved problem.


> Even if you're looking up strings based on IDs in another file (which is a good pattern) […] putting printf format strings also as localized data.

AFAIK localising "formatting literals" is the more normal method, it avoids redundancies as you don't need two different systems (ids and format strings) and provide more flexibility with respect to e.g. cardinalities. Most ID-based systems bundle formatting support as well, if you're using an ID-based system you basically shouldn't call the language's string formatting functions.

Furthermore translating literal sections individually (without formatting context) will often yield an incorrect result as the entire phrase needs to be shuffled around, or words need to be inflected, or a literal translation suitable for "standalone" expressions does not work for the entire phrase.

More granular is generally worse for translations.


> AFAIK localising "formatting literals" is the more normal method, it avoids redundancies as you don't need two different systems

I never understood why people think this is a good idea. The exact same sequence of letters in an English phrase, which you would like to use instead of IDs, can mean two different things in two different places - and those two different meaning could have different forms in other languages. Denormalizing translation database like that seems semantically incorrect (and strikes me as programmer laziness).

I agree that in general, more granular is worse for translations - there's too much risk your split will pierce the contextual whole that's required for some translations.


I'd expect this to happen when there is a split in the original case as well. E.g. you need to translate the template and the names. They are split in the original already, so they are translated as a split.

The same contextual issue then arises in the original, but apparently the designers thought the trade-of was worth it.


> I never understood why people think this is a good idea.

It makes for a more readable, comprehensible source and searchable source. It's also often possible to dispatch on location additionally to text (PO files store both).


The Unicode CLDR has a whole database of formatted strings for each locale, for more or less the reason you describe. Formatting dates or big numbers (12345 -> 12.3k) is impossible to achieve without a generic formatting language.

Pluralization is another nightmare of its own. Look into how Russian and similar languages pluralize. It has to do with the value of the number modulo 10, similar to English ordinals.

CLDR: http://cldr.unicode.org/


I went to a presentation by a company that dealt with translation, he mentioned this issue and his recommendation was to simply not try to be smart with it and have separate strings where pluralisation is done properly in each one.


That works when there are a small number of possible numerical values. But cldr has a table of plural rules for each of these languages and it's not bad to solve the problem for arbitrary integers.


What about gettext _N? I have used a library with a similar interface, and we didn't even get complaints for Slavic languages with very complex pluralisation rules.



It's the language grammar that may require particular order of formatting directives.

LTR vs RTL is about rendering text and unrelated to this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: