Warning: Can't synchronize with repository "(default)" (Unsupported version control system "svn": No module named svn). Look in the Trac log for more information.

Ticket #2257 (closed enhancement: migrated)

Opened 5 years ago

Last modified 3 years ago

.po files : same msgid records could be "deduplicated"

Reported by: rejoc Owned by: rejoc
Priority: normal Milestone: 1.1.x bugfix
Component: I18n Version: 1.1 HEAD
Severity: normal Keywords: i18n
Cc:

Description

When collecting/merging i18n, some msgids are found in multiple files/places. Then when you translate the messages in the .po file, only the last translation (empty strings does not count) of records with the same msgid will be compiled.

It would be useful to merge the records with the same msgid so instead of having something like

#: demo/templates/master.html:75
msgid "TurboGears"
msgstr ""

#: demo/templates/welcome.html:25
msgid "TurboGears"
msgstr ""

#: demo/templates/welcome.html:33
msgid "TurboGears"
msgstr ""

you could have

#: demo/templates/master.html:75
#: demo/templates/welcome.html:25
#: demo/templates/welcome.html:33
msgid "TurboGears"
msgstr ""

which is much easier to deal with...

Attachments

make-po.sh Download (12.2 KB) - added by Gustavo 5 years ago.
Rather old & ugly & old & ugly script to remove duplicate msgids from individual web site sections. Note: It's old & ugly

Change History

comment:1 Changed 5 years ago by Gustavo

That's how Gettext works and should continue working by default.

The example you're providing isn't valid: "TurboGears" a proper noun. But this is the default behavior because the same message is often translated differently depending on the context.

comment:2 follow-up: ↓ 3 Changed 5 years ago by rejoc

Well... I could have choose any msgid that gets duplicated...

Within a simple .po file, a single msgid, even if it is duplicated, will be translated by a single msgstr (seems to be the last non empty msgstr).

As you say, the translation of a same message can be translated differently and if you don't pay attention, you won't see that the message you are currently translating is used somewhere else in the same .po file, but with another meaning (very likely to append as your .pot file grows). Here, the context is the .po file and there is only one translation.

So having the msgids deduplicated (while keeping a trace of its different locations) looked like a good way to detect/avoid bad translations.

And it does not break anything in the logic of collect/add|merge/compile process

But there is another issue (bug?) :

if you have a .po file containing

#: demo/templates/master.html:75
msgid "some text"
msgstr "un petit texte"

#: demo/templates/welcome.html:25
msgid "some text"
msgstr ""

it compiles and gives the expected translation of "un petit texte" for all the occurrences of "some text".

Now you do a "tg-admin i18n merge" (usually after a collect because you modified something) and... you get :

#: demo/templates/master.html:75
msgid "some text"
msgstr ""

#: demo/templates/welcome.html:25
msgid "some text"
msgstr ""

You lost your translation because merge uses the last msgstr it finds in the .po for a given msgid.

(I have included a patch for that one... It keeps that last non empty message like compile does)

comment:3 in reply to: ↑ 2 ; follow-up: ↓ 5 Changed 5 years ago by Gustavo

rejoc:

All what you're describing is the intended behavior of Gettext. That's how things work on every single Gettext-powered application (this is, nearly all the applications you're using). And there's no bug in that behavior.

If you don't like this behavior and prefer to risk getting side effects in the future, then you can write your own script which adapts the PO template as you want. But this should not go in TG by default.

A long time ago I wrote a bash script to merge duplicate messages in a web site, using a very conservative approach: Having one POT per web site section and then merging the msgids duplicate in the individual POTs -- only in the individual POTs. This avoids context problems in different languages. I'll search for that old script and if I find it, I'll attach it.

Keep in mind that the Gettext folks are no ignorants. It's a very mature piece of software which is used in most free software projects, so everything you see is expected. At this point we can hardly find a bug.

Replying to rejoc:

Well... I could have choose any msgid that gets duplicated...

Within a simple .po file, a single msgid, even if it is duplicated, will be translated by a single msgstr (seems to be the last non empty msgstr).

As you say, the translation of a same message can be translated differently and if you don't pay attention, you won't see that the message you are currently translating is used somewhere else in the same .po file, but with another meaning (very likely to append as your .pot file grows). Here, the context is the .po file and there is only one translation.

So having the msgids deduplicated (while keeping a trace of its different locations) looked like a good way to detect/avoid bad translations.

And it does not break anything in the logic of collect/add|merge/compile process

But there is another issue (bug?) :

if you have a .po file containing

#: demo/templates/master.html:75
msgid "some text"
msgstr "un petit texte"

#: demo/templates/welcome.html:25
msgid "some text"
msgstr ""

it compiles and gives the expected translation of "un petit texte" for all the occurrences of "some text".

Now you do a "tg-admin i18n merge" (usually after a collect because you modified something) and... you get :

#: demo/templates/master.html:75
msgid "some text"
msgstr ""

#: demo/templates/welcome.html:25
msgid "some text"
msgstr ""

You lost your translation because merge uses the last msgstr it finds in the .po for a given msgid.

(I have included a patch for that one... It keeps that last non empty message like compile does)

comment:4 Changed 5 years ago by rejoc

Specific issue about loosing translations when merging is reported in ticket #2258. I should not have mixed both subjects.

Changed 5 years ago by Gustavo

Rather old & ugly & old & ugly script to remove duplicate msgids from individual web site sections. Note: It's old & ugly

comment:5 in reply to: ↑ 3 ; follow-up: ↓ 6 Changed 5 years ago by rejoc

Replying to Gustavo:

rejoc:

All what you're describing is the intended behavior of Gettext. That's how things work on every single Gettext-powered application (this is, nearly all the applications you're using). And there's no bug in that behavior.

We are not using the plain Gettext tools but their specific implementation in TG.

Gettext documentation specifies that having duplicated msgid in a .po file "is invalid input for other programs like msgfmt, msgmerge or msgcat" (9.5 Invoking the msguniq Program).

tg-admin i18n collect generates such duplicates. Is it the correct behavior as the .pot generated it supposed to be the input to "tg-admin i18n merge" (emulating msgmerge) and "tg-admin i18n compile" (msgfmt) ?

comment:6 in reply to: ↑ 5 ; follow-up: ↓ 7 Changed 5 years ago by Gustavo

Replying to rejoc:

Replying to Gustavo:

rejoc:

All what you're describing is the intended behavior of Gettext. That's how things work on every single Gettext-powered application (this is, nearly all the applications you're using). And there's no bug in that behavior.

We are not using the plain Gettext tools but their specific implementation in TG.

Gettext is not only a set of tools, it also defines an standard. So every implementation should comply with the standard.

(It's good to be aware that TG1 had its own implementation; I didn't know that)

Gettext documentation specifies that having duplicated msgid in a .po file "is invalid input for other programs like msgfmt, msgmerge or msgcat" (9.5 Invoking the msguniq Program).

Exactly. Did you see how I handled that in the script I attached? That's not a bug, it's the intended behavior.

tg-admin i18n collect generates such duplicates. Is it the correct behavior as the .pot generated it supposed to be the input to "tg-admin i18n merge" (emulating msgmerge) and "tg-admin i18n compile" (msgfmt) ?

I don't know how TG1 handles that, since you say it has its own implementation. What I do know is that xgettext (or equivalent) must *not* do anything with duplicate messages. Duplicate messages must be handled with msguniq and/or msgcomm, or equivalents.

Again, there's no such bugs. It's all part of the intended and documented behavior.

comment:7 in reply to: ↑ 6 Changed 5 years ago by rejoc

Replying to Gustavo:

I don't know how TG1 handles that, since you say it has its own implementation. What I do know is that xgettext (or equivalent) must *not* do anything with duplicate messages. Duplicate messages must be handled with msguniq and/or msgcomm, or equivalents.

I did a few more tests :

tg-admin i18n collect uses pygettext.py to handle the .py files and some other tools (regex parsing for genshi templates, ...) for other (.kid, .html, .js) files.

If there are duplicated messages in .py files, you get a single msgid with a context comment indicating the multiple lines like this :

#: demo/test.py:20 demo/test.py:25 demo/test.py:27
msgid "some duplicated text"
msgstr ""

This is what Gettext generates so it actually does something to duplicate messages.. (I did the same analysis with xgettext and it gives exactly the same result.

The collect command starts by analysing .py files. This gives the first part of the .pot file.

Then it analyses successively and independently the .kid, .html an .js files, appending the collected messages to the .pot file without any further (global) processing.

This is why (regex analysis + appending results from different sources) we get duplicated msgids. And again, this is not what merge or compile expect.

comment:8 Changed 5 years ago by Chris Arndt

  • Milestone changed from 1.1b4 to 1.1

comment:9 Changed 5 years ago by Chris Arndt

  • Milestone changed from 1.1 to 1.1.x bugfix

Moving to 1.1.x bugfix release in preparation for 1.1rc1 release.

comment:10 Changed 5 years ago by chrisz

Just for the records, TG2 (Babel) behaves as suggested in this ticket, by merging all translations with the same message id.

comment:11 Changed 3 years ago by chrisz

  • Status changed from new to closed
  • Resolution set to migrated

Moved to  https://sourceforge.net/p/turbogears1/tickets/18/.

May also be solved by using Babel (see #2042).

Note: See TracTickets for help on using tickets.