Existing research on cross-lingual retrieval can not take good advantage of large-scale pretrained language models such as multilingual BERT and XLM. In this paper, we hypothesize that the absence of cross-lingual passage-level relevance data for finetuning and the lack of query-document style pretraining are key factors. We propose to directly finetune language models on the evaluation collection by making Transformers capable of accepting longer sequences. We introduce two novel retrieval-oriented pretraining tasks to further pretrain cross-lingual language models for downstream retrieval tasks, such as cross-lingual ad-hoc retrieval (CLIR) and cross-lingual question answering (CLQA). We construct distant supervision data from multilingual Wikipedia using section alignment to support retrieval-oriented language model pretraining. Experiments on multiple benchmark datasets show that our proposed model can significantly improve upon general multilingual language models on both cross-lingual retrieval setting and cross-lingual transfer setting. We make our pretraining implementation and checkpoints publicly available for future research.

2021 THE WEB CONFERENCE NEWSLETTER
The Web Conference is announcing latest news and developments biweekly or on a monthly basis. We respect The General Data Protection Regulation 2016/679.