Improve Knowledge Integrity

Overview

The strategic direction of "Knowledge as a Service" envisions a world in which platforms and tools are available to allies and partners to "organize and exchange free, trusted knowledge beyond Wikimedia". Achieving this goal requires not only new infrastructure for representing, curating, linking, and disseminating knowledge, but also efficient and scalable strategies to preserve the reliability and integrity of this knowledge. Technology platforms across the web are looking at Wikipedia as the neutral arbiter of information, but as Wikimedia aspires to extend its scope and scale, the possibility that parties with special interests will manipulate content, or bias to go undetected, becomes material.

We have been leading projects to help our communities represent, curate, and understand information provenance in Wikimedia projects more efficiently. We are conducting novel research on why editors source information, and how readers access sources; we are developing algorithms to identify statements in need of sources and gaps in information provenance; we are designing data structures to represent, annotate, and analyze source metadata in machine-readable formats as well as tools to monitor in real time changes made to references across the Wikimedia ecosystem.

More information can be found in our white paper.

Recent updates

Graph-Linguistic Fusion: Using Language Models for Wikidata Vandalism DetectionJul 2025

A new vandalism detection system for Wikidata using graph-linguistic fusion (Wikidata Revert Risk).

PaperA Comparative Study of Reference Reliability in Multiple Language Editions of WikipediaOct 2023

Quantifies the cross-lingual patterns of the perennial sources list, a collection of reliability labels for web domains identified and agreed upon by Wikipedia editors.

DOIFair multilingual vandalism detection system for WikipediaJun 2023

The next generation of ML tools for Knowledge Integrity, providing a fair multilingual vandalism detection system, now in production.

PaperTemplates and Trust-o-meters: Towards a widely deployable indicator of trust in WikipediaMay 2022

A study on designing widely deployable trust indicators for readers of Wikipedia.

Paper DOIWiki-Reliability: A Large Scale Dataset for Content Reliability on WikipediaMay 2021

A dataset of articles with reliability concerns on English Wikipedia for training language models to detect content reliability issues.

Paper Data DOITracking Knowledge Propagation Across Wikipedia LanguagesMar 2021

A dataset of inter-language knowledge propagation in Wikipedia.

Paper Data

Resources and links

Research pages

Slides

Videos

WikiCite: Wikidata as a structured repository of bibliographic data

Publications

Mykola Trokhymovych, Lydia Pintscher, Ricardo Baeza-Yates, and Diego Saez-Trumper. 2025. Graph-Linguistic Fusion: Using Language Models for Wikidata Vandalism Detection. Proceedings of the 63nd Annual Meeting of the Association for Computational Linguistics (ACL '25 Industry).
Aitolkyn Baigutanova, Diego Saez-Trumper, Miriam Redi, Meeyoung Cha, and Pablo Aragón. 2023. A Comparative Study of Reference Reliability in Multiple Language Editions of Wikipedia. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM '23). https://doi.org/10.1145/3583780.3615254
Mykola Trokhymovych, Muniza Aslam, Ai-Jou Chou, Ricardo Baeza-Yates, and Diego Saez-Trumper. 2023. Fair multilingual vandalism detection system for Wikipedia. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '23).
Aitolkyn Baigutanova, Jaehyeon Myung, Diego Saez-Trumper, Ai-Jou Chou, Miriam Redi, Changwook Jung, and Meeyoung Cha. 2023. Longitudinal Assessment of Reference Quality on Wikipedia. In Proceedings of The Web Conference 2023 (WWW '23). https://doi.org/10.1145/3543507.3583218
Andrew Kuznetsov, Margeigh Novotny, Jessica Klein, Diego Saez-Trumper, and Aniket Kittur. 2022. Templates and Trust-o-meters: Towards a widely deployable indicator of trust in Wikipedia. CHI '22: CHI Conference on Human Factors in Computing Systems. https://doi.org/10.1145/3491102.3517523
KayYen Wong, Miriam Redi, and Diego Saez-Trumper. 2021. Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia. SIGIR '21. https://doi.org/10.1145/3404835.3463253
Rodolfo Valentim, Giovanni Comarela, Souneil Park, and Diego Saez-Trumper. 2021. Tracking Knowledge Propagation Across Wikipedia Languages. Proceedings of the Fifteenth International AAAI Conference on Web and Social Media (ICWSM '21).
Mykola Trokhymovych and Diego Saez-Trumper. 2021. WikiCheck: An end-to-end open source Automatic Fact-Checking API based on Wikipedia. 30th ACM International Conference on Information and Knowledge Management (CIKM '21).
Pablo Aragón and Diego Sáez-Trumper. 2021. A preliminary approach to knowledge integrity risk assessment in Wikipedia projects. MIS2'21: Misinformation and Misbehavior Mining on the Web Workshop held in conjunction with KDD 2021.
Tiziano Piccardi, Miriam Redi, Giovanni Colavizza, and Robert West. 2020. Quantifying Engagement with Citations on Wikipedia. In Proceedings of The Web Conference 2020 (WWW '20). https://doi.org/10.1145/3366423.3380300
Diego Saez-Trumper. 2019. Online Disinformation and the Role of Wikipedia.
Miriam Redi, Besnik Fetahu, Jonathan Morgan, and Dario Taraborelli. 2019. Citation Needed: A Taxonomy and Algorithmic Assessment of Wikipedia's Verifiability. In Proceedings of The Web Conference 2019 (WWW '19). https://doi.org/10.1145/3308558.3313618
Dario Taraborelli, Lydia Pintscher, Daniel Mietchen, and Sarah Rodlund. 2017. WikiCite 2017 Report. figshare. https://doi.org/10.6084/m9.figshare.5648233
Dario Taraborelli, Jonathan Dugan, Lydia Pintscher, Daniel Mietchen, and Cameron Neylon. 2016. WikiCite 2016 Report. figshare. https://doi.org/10.6084/m9.figshare.4042530

Overview ​

Recent updates ​

Resources and links ​

Publications ​

Overview

Recent updates

Resources and links

Publications