iPify has developed a unique word counting methodology aiming at accurately reflecting the effort required to translate a patent
Context
Throughout the intellectual property industry, word count is a factor that drives the price of translation. Traditionally, this word count has been calculated by the usage of word processing tools, such as Microsoft Word.
Tools like Word, however, do not consider the fact that there can be large parts of a patent that do not require translation, such as numbers, formulas and special characters. Word considers them all as individual words.
As a large number of patents contain a variety of particular elements, such as number-heavy NMR formulas and chemical equations, the word count that can be established using Microsoft Word can paint a very different picture of what the effort required to translate those individual words actually is.
The iPify solution
To do this, we differentiate between translatable and non-translatable words in a patent.
All translatable words are considered full words, as they require translation effort, while non-translatable words are given a weighting percentage in line with the complexity of the patent.
We apply a weighting percentage because, while non-translatable words do not require the same translation effort as translatable ones, they still need to be carefully handled, especially in more complex patents such as those including sequence listings, tables and long chemical formulas. This results in a new non-translatable word count that more realistically reflects the effort required to transfer them into the target text.
We then add that new non-translatable word count to the translatable word count we initially found to generate the final word count for that patent.
More detailed explanation
-
Our algorithm detects the number of ‘translatable’ words in a text. That is to say, words that require a direct translation in the target language.
-
We define this using the following rules:
-
Any string containing only letters that is 3 characters or longer (excluding most strings that are fully in uppercase) - this can include substrings (strings that are part of larger strings, such as chemical names)
-
Strings that are a 1 or 2-letter word in the source language
-
Strings matching the above rules and are in translatable parts of sequence listings
- Non-translatable words that are between two translatable words (e.g. '2' in the phrase 'Figure 2 refers to...')
-
-
-
Our algorithm then detects the ‘total’ number of words in the text.
-
We subtract the ‘total’ words from the ‘translatable’ words to get the number of words that are ‘non-translatable’, or do not require translation effort.
-
We apply a weighting percentage to the non-translatable word count depending on the complexity of the patent.
-
The newly weighted non-translatable word count is added to the translatable word count to produce the final word count.
| Phrase | Number of translatable words | Weighted number of non-translatable words | Final word count iPify | Total word count in Microsoft Word | |
|
Consisting of 1,4-dimercaptobutane- 2,3-diol and tris(2-carboxylethyl)phosphine |
8
|
2
|
10
|
6
|
|
|
1 mixture of diastereomers δ 8.40 - 8.28 (m, 2H), 8.26 (dd,J=5.3, 3.7 Hz, 2H)
|
3
|
7
|
10
|
15
|
For words in drawings, we will extract any text from the drawings using a dedicated software, analyze the words in them, and provide a word count for translatable words only (as non-translatable words in the drawings would not be touched).
The iPify word count calculations do not currently take into account any CAT tool translation memory matches and/or internal repetitions.
For sequence listings, iPify's word count only counts words in translatable sequences, excluding any non-translatable ones (as per WIPO guidelines). In addition, any repeated translatable sequences only count once towards the word count, with further duplicate sequences being ignored.