Most people working with data believe that the entire data workflow originates from data extraction. It might be so, but I believe that the first and most crucial step is understanding the data. Quite literally. While numerical values represent the universal language of mathematics, what should we do when textual values are expressed in a language other than English/Polish? Of course, we can embark on an exciting adventure and enroll in a new language course, but if we don’t have several or a dozen months to spare, a simple Python script translating a given cell can be useful:
import pandas as pd
from googletrans import Translator
# read xlsx
df = pd.read_excel('input.xlsx')
# create Translator() obj
translator = Translator()
# translate column 'label' for all rows
for index, row in df.iterrows():
original_text = row['label']
translation = translator.translate(original_text, src='de', dest='en')
if translation.text:
translated_text = translation.text
else:
translated_text = original_text
df.at[index, 'label'] = translated_text
# writing all down to new xlsx
df.to_excel('output.xlsx', index=False)
The code above, of course, fetches the source file, selects the target column (“label“), and line by line translates the entire column. It is worth mentioning that using the translate method without indicating the target language will translate the text into the default language, which is English.
Of course, there will always be room for some improvements and, for example, the entire block of code responsible for the translation can be “shortened” to one lambda function.
import pandas as pd
from googletrans import Translator
df = pd.read_excel('input.xlsx')
translator = Translator()
df['label'] = df['label'].apply(lambda x: translator.translate(x, src='de', dest='en').text)
df.to_excel('output.xlsx', index=False)
Will this somehow speed up the script execution? I doubt it, but hey! The script was supposed to be simple, not fast : )
Happy translating!