GlobalSight – How to exclude text from extraction in Word documents

There are two ways to handle text that do not (or should not) need translation. You can either have GlobalSight ignore it during extraction, or treat it as a tag and do not let translators touch it. I will cover how the first one works in this article.

A typical use case for this is subtitle content. Lets say you have a table with two columns in a Word document. First column contains the timestamp while the second has the translatable text:

00:01:32 Lorel ipsum gremo tenar
00:01:38 Gnore toneri lokumo bursanata khmo

You could upload this file directly to GlobalSight and tell your translators to ignore the timestamps. That would not require a blog post, however, so we will choose a more elegant solution.

Exclude text from extraction

In GlobalSight, you can define paragraph and character styles that should be excluded from extraction. To do this, login to GlobalSight as an administrator and go to Data Sources > Filter Configuration. Click on the Office 2010 filter and check DONOTTRANSLATE_phar and DONOTTRANSLATE_char under Unextractable styles. Click Save and exit.

GlobalSight do not extract text

By doing this, you tell GlobalSight to ignore text that belongs to one of these styles during extraction. For our subtitle content, you would select the first column of the table and apply the character style DONOTTRANSLATE_char. When you upload this file to GlobalSight after, translators will only see the text from the second column. Much nicer than asking them to ignore the timestamps.

Applying styles programmatically

For the example above, it is easy to select the first column and apply the style to all timestamps. What if your document is formatted in a way that does not make this possible? That’s why we have macros!

You can define a macro which finds text in a certain format using wildcard characters and applies the character style. Here is a simple example:

Sub ApplyNonTranslatabletoDigits()

‘ ApplyNonTranslatabletoDigits Macro

Selection.Find.Replacement.Style = ActiveDocument.Styles( _
With Selection.Find
.Text = “[0-9][0-9]:[0-9][0-9]:[0-9][0-9]”
.Replacement.Text = “”
.Forward = True
.Wrap = wdFindContinue
.Format = True
.MatchCase = False
.MatchWholeWord = False
.MatchAllWordForms = False
.MatchSoundsLike = False
.MatchWildcards = True
End With
Selection.Find.Execute Replace:=wdReplaceAll
End Sub

Using this macro you can automatically style all timecodes from our example with the DONOTTRANSLATE_char character style with a single click.

