As modern neural machine translation (NMT) systems are now in widespread deployment, their security vulnerabilities require close scrutiny. Most recently, NMT systems are found to suffer from targeted attacks which can cause them to produce specific, unsolicited, and even harmful translations. Such vulnerability is typically exploited in a white-box analysis of a known target system, where adversarial inputs causing targeted translations are discovered. However, this approach is less viable when the target system is black-box and unknown to the public (e.g., secured commercial systems). In this paper, we show that targeted attacks on black-box NMT systems are feasible, based on poisoning a small fraction of their parallel training data. We show that this attack can be achieved simply through targeted corruption of web documents which are crawled to form the system’s training data. We then analyse the effectiveness of poisoning two common NMT training scenarios, including the one-off training and pre-train & fine-tune paradigms. Our findings are alarming: even on the state-of-the-art systems trained with massive parallel data (tens of millions), the attacks are still successful (over 50% success rate) with only a 0.006% poisoning rate. Lastly, we discuss available defences to counter such attacks.

The Web Conference is announcing latest news and developments biweekly or on a monthly basis. We respect The General Data Protection Regulation 2016/679.