AI translations: I18next & Deepl

14 June 2023

At Bitstillery we are working on an international spirits portal. Our customers come from all over the world, and not all of them are used to English or have a stronger cultural preference for their native language. That's why we aim to support at least 6 of the most popular languages in our products: English, German, French, Italian, Spanish, and Dutch. We have native speakers for all of these languages who can provide translations or check translations for accuracy. However, we have only recently finished converting most of the text to i18n strings, so translators would have to translate a lot of text at once. Also, due to the fast pace at which we change text and functionality in our products, it would be very difficult to keep the translation effort in sync across multiple translations and translators.

So what we need is a way to automatically provide good quality translations, and an easy way for translators to correct the few translations that may be incorrect. With advancements in machine learning, automatic translations have become much better at providing the most common context in which a translation will be used. For our use case, we tried Deepl. Our setup looks like this:

  • An english base file en.json and similar files for the other languages (fr.json, de.json, …)
"checkout": {
    "booked": "Booked",
    "cart": "Cart",
    "comment_add": "Add Comment",
    "comment_add_tip": "Add a comment about this product",
    "comment_delete": "Delete Comment",
    "comment_delete_tip": "Delete Comment",
    "comment_title": "Comment",
    "comment_update": "Update Comment",
    "comment_update_delete_tip": "Update or delete your comment",
    "delivery": {
        "asap": "As soon as possible",
        "date": "On a specific date"
    }
}
  • An I18Next config that looks like:
i18next.init({
    debug: process.env.NODE_ENV === 'development',
    fallbackLng: 'en',
    lng: 'en',
    resources: {en: enJson, fr: frJson},
})

// Convention to call translations using $t for translations: $t('checkout.cart')
const $t = i18next.t

To generate translations, we use a custom task-based build configuration. A translation task uses the English JSON as its source file and compares other languages for redundant or missing keys. To make translation easier to manage, we check for:

  • Keys in target language that are not in en.json; these will be removed from the target language
  • Keys available in en.json, but not in the target language; these will get translated.

Another optimization could be to also check for moved translations; this is a common action when regrouping the en.json source file, but we haven't gotten to that yet. The translation API for Deepl is quite simple to use. All we had to do was to wrap an xml element around the string substitution characters {{}} so that Deepl knows which parts to ignore when translating. This is what the translation code looks like:

function keyMod(reference, apply, refPath) {
    if (!refPath) refPath = []
    for (const key of Object.keys(reference)) {
        if (typeof reference[key] === 'object') {
            refPath.push(key)
            keyMod(reference[key], apply, refPath)
        } else {
            apply(reference, key, refPath)
        }
    }
    refPath.pop()
}

export async function translate(task, settings, packageName, targetLanguage, overwrite = false) {
    let sourcePath, targetPath, targetI18n
    const actions = {remove: [], update: []}
    const sourceI18n = JSON.parse(await fs.readFile(sourcePath, 'utf8'))
    const targetExists = await fs.pathExists(targetPath)

    if (targetExists && !overwrite) {
        targetI18n = JSON.parse(await fs.readFile(targetPath, 'utf8'))
        keyMod(targetI18n, (_, key, refPath) => {
            // The key in the target i18n does not exist in the source (e.g. obsolete)
            const sourceRef = keyPath(sourceI18n, refPath)
            if (!sourceRef[key]) {
                actions.remove.push([[...refPath], key])
            }
        })
    } else {
        // Use a copy of the en i18n scheme as blueprint for the new scheme.
        targetI18n = JSON.parse(JSON.stringify(sourceI18n))
    }

    const placeholderRegex = /{{[\w]*}}/g
    // Show a rough estimate of deepl translation costs...
    const stats = {total: {chars: 0, keys: 0}, costs: {chars: 0, keys: 0}}
    keyMod(sourceI18n, (sourceRef, key, refPath) => {
        const targetRef = keyPath(targetI18n, refPath)
        stats.total.keys += 1
        // Use xml tags to indicate placeholders for deepl.
        const preppedSource = sourceRef[key].replaceAll(placeholderRegex, (res) => {
            return res.replace('{{', '<x>').replace('}}', '</x>')
        })
        stats.total.chars += preppedSource.length
        if (overwrite || !targetRef || !targetRef[key]) {
            stats.costs.chars += preppedSource.length
            stats.costs.keys += 1
            actions.update.push([[...refPath], key, preppedSource])
        }
    })

    const costs= `(update: ${stats.costs.keys}/${stats.costs.chars})`
    const total = `(total: ${stats.total.keys}/${stats.total.chars})`

    for (const removeAction of actions.remove) {
        const targetRef = keyPath(targetI18n, removeAction[0])
        delete targetRef[removeAction[1]]
    }

    if (actions.update.length) {
        const authKey = process.env.TRANSLATOR_KEY
        if (!authKey) throw new Error('Deepl translator key required for auto-translate (process.env.MSI_TRANSLATOR_KEY)')

        const translator = new deepl.Translator(authKey)
        let res = await translator.translateText(actions.update.map((i) => i[2]), null, targetLanguage, {
            formality: 'prefer_less',
            ignoreTags: ['x'],
            tagHandling: 'xml',
        })

        const ignoreTagRegex = /<x>[\w]*<\/x>/g
        for (const [i, translated] of res.entries()) {
            // The results come back in the same order as they were submitted.
            // Restore the xml placeholders to the i18n format being use.
            const transformedText = translated.text.replaceAll(ignoreTagRegex, (res) => res.replace('<x>', '{{').replace('</x>', '}}'))
            const targetRef = keyPath(targetI18n, actions.update[i][0], true)
            // Deepl escapes html tags; e.g. < &lt; > &gt; We don't want to ignore
            // those, because its content must be translated as well. Instead,
            // decode these special html escape characters.
            targetRef[actions.update[i][1]] = decode(transformedText)
        }
    } 

    if (actions.update.length || actions.remove.length) {
        await fs.writeFile(targetPath, JSON.stringify(targetI18n, null, 4))
    }
}

These Deepl translations are working quite well so far! Translating hundreds of untranslated sentences for 6 target languages has become a matter of seconds instead of weeks. Also, our workflow for adding a new translation string has become much easier. Just add a new string to en.json and run the translation task, for example:

TRANSLATOR_KEY=<API_KEY> pnpm run i18n
iscream

It's important that first, the English source translation is tweaked, so it improves the chance that the target translation is provided within the right context. Cart could mean something you drive with, or something that is being used to finish your order with. Also, we add similar tests to make sure we don't have untranslated or redundant strings compared to the source language. An additional test looks through the sourch files and tries to match $t occurrences against the base en.json file. This doesn't work for $t(variable_name), but is a welcome check for possible missing translations in the source file. This looks like:

test('missing $t tags in base i18n file', async() => {
    const baseDir = path.resolve(path.join(path.dirname(new URL(import.meta.url).pathname), '..'))
    const missingKeys = []

    const translationMatch = /\$t\([\s]*'([a-zA-Z0-9_\s{}.,!?%\-:;"]+)'[(),)?]/g
    let globPattern = `${path.join(baseDir, 'src', 'code', '**', '{*.ts,*.tsx}')}`

    const files = await glob(globPattern)
    for (const filename of files) {
        const data = (await fs.readFile(filename)).toString('utf8')
        data.replace(translationMatch, function(pattern:any, $t:string) {
            let path = $t.replace(unescape, '')
            if (typeof path === 'string') {
                let i18nReference = keyPath(localeEn, path)
                if (!i18nReference) {
                    // Do a check whether this is a plural key
                    const oneTerm = keyPath(localeEn, `${path}_one`)
                    const otherTerm = keyPath(localeEn, `${path}_other`)
                    const pluralReference = oneTerm || otherTerm
                    if (!pluralReference) {
                        missingKeys.push(path)
                    }
                }
            }
        })
    }

    assert.equal(missingKeys.length, 0, `$t translations not in en scheme yet: ${missingKeys.join(' ')}`)
})

To run our tests, we use the testrunner in Nodejs, along with a Typescript loader. The test script:

"scripts": {
    "test": "node --no-warnings --loader=ts-node/esm --test ./test/i18n.ts"
}

With an automated way to manage translations and the tests to keep them in sync, all we need to do is to provide a translator workflow for curating translations. Our next post will describe what options there are for human translators, to modify automatically generated translations using tools like Weblate and how to integrate both workflows.

Copyright © 2024 Bitstillery - All Rights Reserved
menu-circle