{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,8]],"date-time":"2026-04-08T15:41:47Z","timestamp":1775662907836,"version":"3.50.1"},"reference-count":49,"publisher":"MIT Press","license":[{"start":{"date-parts":[[2023,9,6]],"date-time":"2023-09-06T00:00:00Z","timestamp":1693958400000},"content-version":"vor","delay-in-days":248,"URL":"https:\/\/linproxy.fan.workers.dev:443\/https\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":["direct.mit.edu"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2023,9,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n               <jats:p>MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at https:\/\/linproxy.fan.workers.dev:443\/http\/miracl.ai\/.<\/jats:p>","DOI":"10.1162\/tacl_a_00595","type":"journal-article","created":{"date-parts":[[2023,9,6]],"date-time":"2023-09-06T14:08:06Z","timestamp":1694009286000},"page":"1114-1131","update-policy":"https:\/\/linproxy.fan.workers.dev:443\/https\/doi.org\/10.1162\/mitpressjournals.corrections.policy","source":"Crossref","is-referenced-by-count":34,"title":["<b>MIRACL<\/b>: A Multilingual Retrieval Dataset Covering 18 Diverse Languages"],"prefix":"10.1162","volume":"11","author":[{"given":"Xinyu","family":"Zhang","sequence":"first","affiliation":[{"name":"David R. Cheriton School of Computer Science, University of Waterloo, Canada"}]},{"given":"Nandan","family":"Thakur","sequence":"additional","affiliation":[{"name":"David R. Cheriton School of Computer Science, University of Waterloo, Canada"}]},{"given":"Odunayo","family":"Ogundepo","sequence":"additional","affiliation":[{"name":"David R. Cheriton School of Computer Science, University of Waterloo, Canada"}]},{"given":"Ehsan","family":"Kamalloo","sequence":"additional","affiliation":[{"name":"David R. Cheriton School of Computer Science, University of Waterloo, Canada"}]},{"given":"David","family":"Alfonso-Hermelo","sequence":"additional","affiliation":[{"name":"Huawei Noah\u2019s Ark Lab, Canada"}]},{"given":"Xiaoguang","family":"Li","sequence":"additional","affiliation":[{"name":"Huawei Noah\u2019s Ark Lab, China"}]},{"given":"Qun","family":"Liu","sequence":"additional","affiliation":[{"name":"Huawei Noah\u2019s Ark Lab, China"}]},{"given":"Mehdi","family":"Rezagholizadeh","sequence":"additional","affiliation":[{"name":"Huawei Noah\u2019s Ark Lab, Canada"}]},{"given":"Jimmy","family":"Lin","sequence":"additional","affiliation":[{"name":"David R. Cheriton School of Computer Science, University of Waterloo, Canada"}]}],"member":"281","published-online":{"date-parts":[[2023,9,1]]},"reference":[{"key":"2023090614075107500_bib1","doi-asserted-by":"publisher","first-page":"547","DOI":"10.18653\/v1\/2021.naacl-main.46","article-title":"XOR QA: Cross-lingual open-retrieval question answering","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Asai","year":"2021"},{"issue":"1","key":"2023090614075107500_bib2","doi-asserted-by":"publisher","first-page":"30","DOI":"10.1093\/llc\/fqu047","article-title":"Identifying translationese at the word and sub-word level","volume":"31","author":"Avner","year":"2016","journal-title":"Digital Scholarship in the Humanities"},{"key":"2023090614075107500_bib3","article-title":"MS MARCO: A human generated MAchine Reading COmprehension dataset","author":"Bajaj","year":"2018","journal-title":"arXiv: 1611.09268v3"},{"key":"2023090614075107500_bib4","article-title":"mMARCO: A multilingual version of the MS MARCO passage ranking dataset","author":"Bonifacio","year":"2021","journal-title":"arXiv:2108.13897"},{"key":"2023090614075107500_bib5","doi-asserted-by":"publisher","first-page":"454","DOI":"10.1162\/tacl_a_00317","article-title":"TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages","volume":"8","author":"Clark","year":"2020","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023090614075107500_bib6","doi-asserted-by":"publisher","first-page":"1566","DOI":"10.1145\/3404835.3462804","article-title":"MS MARCO: Benchmarking ranking models in the large-data regime","volume-title":"Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)","author":"Craswell","year":"2021"},{"key":"2023090614075107500_bib7","doi-asserted-by":"publisher","DOI":"10.26818\/9780814252703","article-title":"Morphological types of languages","volume-title":"Language Files: Materials for an Introduction to Language and Linguistics, 12th Edition","author":"Dawson","year":"2016"},{"key":"2023090614075107500_bib8","doi-asserted-by":"publisher","first-page":"159","DOI":"10.3115\/v1\/D14-1018","article-title":"Asymmetric features of human generated translation","volume-title":"Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Eetemadi","year":"2014"},{"key":"2023090614075107500_bib9","doi-asserted-by":"crossref","first-page":"2288","DOI":"10.1145\/3404835.3463098","article-title":"SPLADE: Sparse lexical and expansion model for first stage ranking","volume-title":"Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Formal","year":"2021"},{"key":"2023090614075107500_bib10","doi-asserted-by":"crossref","first-page":"3120","DOI":"10.1145\/3539618.3591805","article-title":"Tevatron: An efficient and flexible toolkit for Neural Retrieval","volume-title":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","author":"Gao","year":"2023"},{"key":"2023090614075107500_bib11","doi-asserted-by":"publisher","first-page":"316","DOI":"10.18653\/v1\/D18-1029","article-title":"On the relation between linguistic typology and (limitations of) multilingual language modeling","volume-title":"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing","author":"Gerz","year":"2018"},{"key":"2023090614075107500_bib12","doi-asserted-by":"publisher","first-page":"178","DOI":"10.1086\/464575","article-title":"A quantitative approach to the morphological typology of language","volume":"26","author":"Greenberg","year":"1960","journal-title":"International Journal of American Linguistics"},{"key":"2023090614075107500_bib13","first-page":"4411","article-title":"XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation","volume-title":"Proceedings of the 37th International Conference on Machine Learning","author":"Junjie","year":"2020"},{"key":"2023090614075107500_bib14","article-title":"Unsupervised dense information retrieval with contrastive learning","author":"Izacard","year":"2022","journal-title":"Transactions on Machine Learning Research"},{"key":"2023090614075107500_bib15","doi-asserted-by":"publisher","first-page":"5833","DOI":"10.18653\/v1\/2021.emnlp-main.471","article-title":"A massively multilingual analysis of cross-linguality in shared embedding space","author":"Jones","year":"2021","journal-title":"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing"},{"key":"2023090614075107500_bib16","doi-asserted-by":"publisher","first-page":"1601","DOI":"10.18653\/v1\/P17-1147","article-title":"TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension","volume-title":"Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)","author":"Joshi","year":"2017"},{"key":"2023090614075107500_bib17","doi-asserted-by":"publisher","first-page":"6282","DOI":"10.18653\/v1\/2020.acl-main.560","article-title":"The state and fate of linguistic diversity and inclusion in the NLP world","volume-title":"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics","author":"Joshi","year":"2020"},{"key":"2023090614075107500_bib18","doi-asserted-by":"publisher","first-page":"6769","DOI":"10.18653\/v1\/2020.emnlp-main.550","article-title":"Dense passage retrieval for open-domain question answering","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Karpukhin","year":"2020"},{"key":"2023090614075107500_bib19","doi-asserted-by":"publisher","first-page":"39","DOI":"10.1145\/3397271.3401075","article-title":"ColBERT: Efficient and effective passage search via contextualized late interaction over BERT","volume-title":"Proceedings of the 43rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020)","author":"Khattab","year":"2020"},{"key":"2023090614075107500_bib20","doi-asserted-by":"publisher","first-page":"452","DOI":"10.1162\/tacl_a_00276","article-title":"Natural Questions: A benchmark for question answering research","volume":"7","author":"Kwiatkowski","year":"2019","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023090614075107500_bib21","article-title":"Overview of the TREC 2022 NeuCLIR track","volume-title":"Proceedings of the 31st Text REtrieval Conference","author":"Lawrie","year":"2023"},{"issue":"4","key":"2023090614075107500_bib22","doi-asserted-by":"publisher","first-page":"799","DOI":"10.1162\/COLI_a_00111","article-title":"Language models for machine translation: Original vs. translated texts","volume":"38","author":"Lembersky","year":"2012","journal-title":"Computational Linguistics"},{"key":"2023090614075107500_bib23","doi-asserted-by":"publisher","first-page":"2939","DOI":"10.1145\/3477495.3531725","article-title":"Fostering coopetition while plugging leaks: The design and implementation of the MS MARCO leaderboards","volume-title":"Proceedings of the 45th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2022)","author":"Lin","year":"2022"},{"key":"2023090614075107500_bib24","doi-asserted-by":"publisher","first-page":"2356","DOI":"10.1145\/3404835.3463238","article-title":"Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations","volume-title":"Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)","author":"Lin","year":"2021"},{"key":"2023090614075107500_bib25","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-031-02181-7","volume-title":"Pretrained Transformers for Text Ranking: BERT and Beyond","author":"Lin","year":"2021"},{"key":"2023090614075107500_bib26","doi-asserted-by":"publisher","first-page":"1389","DOI":"10.1162\/tacl_a_00433","article-title":"MKQA: A linguistically diverse benchmark for multilingual open domain question answering","volume":"9","author":"Longpre","year":"2021","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023090614075107500_bib27","doi-asserted-by":"publisher","first-page":"246","DOI":"10.1007\/978-3-030-45442-5_31","article-title":"Teaching a new dog old tricks: Resurrecting multilingual retrieval using zero-shot learning","volume-title":"Proceedings of the 42nd European Conference on Information Retrieval, Part II (ECIR 2020)","author":"MacAvaney","year":"2020"},{"key":"2023090614075107500_bib28","doi-asserted-by":"publisher","first-page":"382","DOI":"10.1007\/978-3-030-99736-6_26","article-title":"Transfer learning approaches for building cross-language dense retrieval models","volume-title":"Proceedings of the 44th European Conference on Information Retrieval (ECIR 2022), Part I","author":"Nair","year":"2022"},{"key":"2023090614075107500_bib29","article-title":"Passage re-ranking with BERT","author":"Nogueira","year":"2019","journal-title":"arXiv:1901. 04085"},{"key":"2023090614075107500_bib30","doi-asserted-by":"publisher","first-page":"708","DOI":"10.18653\/v1\/2020.findings-emnlp.63","article-title":"Document ranking with a pretrained sequence-to-sequence model","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Nogueira","year":"2020"},{"key":"2023090614075107500_bib31","doi-asserted-by":"publisher","DOI":"10.1017\/9781108378291.011","article-title":"Inflectional morphology","volume-title":"The Cambridge Handbook of Germanic Linguistics","author":"N\u00fcbling","year":"2020"},{"key":"2023090614075107500_bib32","doi-asserted-by":"publisher","first-page":"pages 4996\u2013pages 5001","DOI":"10.18653\/v1\/P19-1493","article-title":"How multilingual is multilingual BERT?","volume-title":"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics","author":"Pires","year":"2019"},{"key":"2023090614075107500_bib33","doi-asserted-by":"publisher","first-page":"279","DOI":"10.1515\/lity.1999.3.3.279","article-title":"Split morphology: How agglutination and flexion mix","volume":"3","author":"Plank","year":"1999","journal-title":"Linguistic Typology"},{"issue":"3","key":"2023090614075107500_bib34","doi-asserted-by":"publisher","first-page":"559","DOI":"10.1162\/coli_a_00357","article-title":"Modeling language variation and universals: A survey on typological linguistics for natural language processing","volume":"45","author":"Ponti","year":"2019","journal-title":"Computational Linguistics"},{"key":"2023090614075107500_bib35","first-page":"5835","article-title":"RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering","volume-title":"Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies","author":"Yingqi","year":"2021"},{"key":"2023090614075107500_bib36","doi-asserted-by":"publisher","first-page":"419","DOI":"10.1162\/tacl_a_00148","article-title":"Unsupervised identification of translationese","volume":"3","author":"Rabinovich","year":"2015","journal-title":"Transactions of the Association for Computational Linguistics"},{"key":"2023090614075107500_bib37","doi-asserted-by":"publisher","first-page":"2383","DOI":"10.18653\/v1\/D16-1264","article-title":"SQuAD: 100,000+ questions for machine comprehension of text","volume-title":"Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing","author":"Rajpurkar","year":"2016"},{"issue":"4","key":"2023090614075107500_bib38","doi-asserted-by":"publisher","first-page":"333","DOI":"10.1561\/1500000019","article-title":"The probabilistic relevance framework: BM25 and beyond","volume":"3","author":"Robertson","year":"2009","journal-title":"Foundations and Trends in Information Retrieval"},{"key":"2023090614075107500_bib39","doi-asserted-by":"publisher","first-page":"3715","DOI":"10.18653\/v1\/2022.naacl-main.272","article-title":"ColBERTv2: Effective and efficient retrieval via lightweight late interaction","author":"Santhanam","year":"2021","journal-title":"Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies"},{"key":"2023090614075107500_bib40","doi-asserted-by":"publisher","first-page":"2768","DOI":"10.18653\/v1\/2020.findings-emnlp.249","article-title":"Cross-lingual training of neural models for document ranking","volume-title":"Findings of the Association for Computational Linguistics: EMNLP 2020","author":"Shi","year":"2020"},{"key":"2023090614075107500_bib41","doi-asserted-by":"publisher","first-page":"4160","DOI":"10.18653\/v1\/2020.emnlp-main.340","article-title":"CLIRMatrix: A massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval","volume-title":"Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)","author":"Sun","year":"2020"},{"key":"2023090614075107500_bib42","article-title":"BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models","volume-title":"Neural Information Processing Systems: Datasets and Benchmarks Track","author":"Thakur","year":"2021"},{"issue":"1","key":"2023090614075107500_bib43","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1093\/llc\/fqt031","article-title":"On the features of translationese","volume":"30","author":"Volansky","year":"2015","journal-title":"Digital Scholarship in the Humanities"},{"key":"2023090614075107500_bib44","doi-asserted-by":"publisher","first-page":"315","DOI":"10.1145\/290941.291017","article-title":"Variations in relevance judgments and the measurement of retrieval effectiveness","volume-title":"Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998)","author":"Voorhees","year":"1998"},{"key":"2023090614075107500_bib45","first-page":"4003","article-title":"CCNet: Extracting high quality monolingual datasets from web crawl data","volume-title":"Proceedings of the Twelfth Language Resources and Evaluation Conference","author":"Wenzek","year":"2020"},{"key":"2023090614075107500_bib46","article-title":"Approximate nearest neighbor negative contrastive learning for dense text retrieval","volume-title":"Proceedings of the 9th International Conference on Learning Representations (ICLR 2021)","author":"Xiong","year":"2021"},{"issue":"4","key":"2023090614075107500_bib47","doi-asserted-by":"publisher","first-page":"Article 16","DOI":"10.1145\/3239571","article-title":"Anserini: Reproducible ranking baselines using Lucene","volume":"10","author":"Yang","year":"2018","journal-title":"Journal of Data and Information Quality"},{"key":"2023090614075107500_bib48","doi-asserted-by":"publisher","first-page":"127","DOI":"10.18653\/v1\/2021.mrl-1.12","article-title":"Mr. TyDi: A multi-lingual benchmark for dense retrieval","volume-title":"Proceedings of the 1st Workshop on Multilingual Representation Learning","author":"Zhang","year":"2021"},{"key":"2023090614075107500_bib49","article-title":"Towards best practices for training multilingual dense retrieval models","author":"Zhang","year":"2022","journal-title":"arXiv:2204.02363"}],"container-title":["Transactions of the Association for Computational Linguistics"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/linproxy.fan.workers.dev:443\/https\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00595\/2157340\/tacl_a_00595.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/linproxy.fan.workers.dev:443\/https\/direct.mit.edu\/tacl\/article-pdf\/doi\/10.1162\/tacl_a_00595\/2157340\/tacl_a_00595.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,6]],"date-time":"2023-09-06T14:08:15Z","timestamp":1694009295000},"score":1,"resource":{"primary":{"URL":"https:\/\/linproxy.fan.workers.dev:443\/https\/direct.mit.edu\/tacl\/article\/doi\/10.1162\/tacl_a_00595\/117438\/MIRACL-A-Multilingual-Retrieval-Dataset-Covering"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023]]},"references-count":49,"URL":"https:\/\/linproxy.fan.workers.dev:443\/https\/doi.org\/10.1162\/tacl_a_00595","relation":{},"ISSN":["2307-387X"],"issn-type":[{"value":"2307-387X","type":"electronic"}],"subject":[],"published-other":{"date-parts":[[2023]]},"published":{"date-parts":[[2023]]}}}