Skip to content

Research Report N° 11

The eleventh in a series of biannual reports from the Wikimedia Foundation Research team.

March 2025

Executive Summary

Hi. Welcome to our bi-annual Research Report. July 2024 marked the beginning of the fiscal year in the Wikimedia Foundation and in this report we share with you what our team contributed to during the first six months of the year.

A mixed method approach to research, in action.

In 2023, the Research team integrated survey methods as well as user and design research expertise (Read more in Research Reports No. 8 and No. 9). In 2024, the collaboration between research science, research engineering, survey methods, and design research expertise in the team enabled us to tackle strategic open research questions that we were not equipped to address before (see administrator trends and causes and who are moderators?). We are proud of the impact we have had through mixed method research and we look forward to continuing bringing this type of impact to the Wikimedia Movement.

Reducing the cost and risk of experimentation with new technologies in platforms, products and features.

One of the important functions of a Research team in a product and technology-led organization is to be bold in experimentation, absorb some of the risks of experimentation with new technologies, and provide recommendations about what new technologies can be used in the Wikimedia Foundation’s platforms, products and features at the service of the Movement. We experimented with using LLMs for building models to detect peacock language, informing further development for a new Edit Check feature; We prototyped a LLM for generating simple article summaries that is now under product development for a pilot reading experience; We developed and tested a framework for testing AI models developed by the research and developer communities outside of Wikimedia Foundation for product and features that the Wikimedia Foundation develops and offers the communities.

Informing and supporting important decisions by Wikimedia’s organized groups and the Wikimedia Foundation.

It is part of our team’s mission to inform important conversations, deliberations, and decisions by organized groups in the Wikimedia Movement with research models and insights, and to raise awareness about important research findings and recommendations. To this end, we investigated the potential causes of attrition amongst the Wikipedia administrator community and learned about the potential causes of attrition as well as barriers that potential administrators face for becoming administrators; We published a privacy and research whitepaper to support Wikipedia volunteers and researchers to make more informed trade-off decisions when navigating the need for research transparency and volunteer privacy; We wrote about the challenges of data contamination in the age of more widespread use of AI to raise awareness within the broader research and scientific communities about the severity of the issue and its implications.

Advancing the understanding of the Wikimedia projects.

Millions of readers worldwide use Wikipedia for learning. This makes Wikipedia a unique resource on the internet to learn about architectural styles of human curiosity. Through the use of Wikipedia readership data in the mobile apps and by conducting a study in 14 languages of Wikipedia in 50 countries, we replicated the existence of "busybody" and "hunter" curiosity styles, and for the first time we identified quantitative evidence for the existence of the "dancer" curiosity styles, previously predicted by a historico-philosophical examination of texts over two millennia.

The remainder of this report is organized as follows: In the Projects section you can learn more about more than 40 projects and initiatives that we have led or significantly contributed to; In The people on the team, you can learn about changes in our team; In Events, you can learn about upcoming events that you may be interested to participate in; In Donors, we thank the donors whose support make our work possible; and in Keep in touch with us, we share information about how you can stay in touch with us.

Projects

The large majority of the work of our team is done in the open and reported here. We also provide consultation to other teams at the Wikimedia Foundation (WMF) or external researchers and not all of that work has public outputs. We strive to share what we can and will include these additional projects as they become publicly available in future Research Reports.

Address Knowledge Gaps

We develop models and insights using scientific methods to identify, measure, and bridge Wikimedia’s knowledge gaps

LLMs for text simplification

We established a technical direction for text simplification by extending our previous explorations to more languages via Aya-Expanse-32B model and supporting experiments by the Web Team to surface simple summaries to readers. Learn more

A model for article-country detection, deployed

We developed and (in collaboration with the ML team) deployed a model for inferring what countries are associated with Wikipedia articles in any language using Wikidata, categories, and article hyperlinks. Learn more Try it out

We published a paper in EMNLP '24 about the challenge of recommending where to add new links to Wikipedia articles when the relevant sentence does not yet exist in the article. Paper

A model for readability in Wikipedia

We published and presented a paper in ACL '24 about scoring how difficult it is to read and understand Wikipedia articles across languages. Paper

Metrics to measure Wikipedia’s language gap

Continuing our work to expand content gap metrics, we proposed several potential language gap metrics that could capture overall language access as well as coverage within languages of impactful topics. Task

Deeper understanding of reader curiosity

We published a paper in Science Advances about how reader curiosity expresses itself across 14 languages and 50 countries. Paper

Readers Survey 2024

We launched the 2024 Readers Survey in 11 languages covering over 90% of Wikipedia readership to better understand reader demographics and behavior. Task

Improve Knowledge Integrity

We develop models and insights using scientific methods to support the technology and policy needs of the Wikimedia projects in the areas of misinformation, disinformation, and content integrity.

Through interviews, surveys and log analysis, we investigated the patterns and potential causes behind challenges in Wikipedia admin community’s recruitment and attrition. Read more

A model for moderation detection

We developed a draft working definition and model for detecting Wikipedia moderation activity to be used by WMF teams to better understand moderation activity and measure the impact of interventions. Task

A deployed model for Reference Risk

We collaborated with the ML team to optimize and deploy a model that we developed for estimating the likelihood that the references in a Wikipedia article will survive future edits as a proxy for source reliability in any language edition. Task

A deployed model for Reference Need

We collaborated with the ML team to optimize and deploy a model that we developed for estimating what proportion of sentences in an article are missing a citation as a proxy for assessing Verifiability in 10 Wikipedia language editions. Task

Election analysis

We analyzed how the US Presidential Election impacted reader and editor behavior on key election-related pages. Learn more

Learning about patroller work habits

We surveyed 83 editors in 4 Wikipedia language editions to understand how they balance patrolling with editing and their awareness of automated anti-vandalism tools. Learn more

Towards models to better detect unique devices

We supported the Trust and Safety Product team by analyzing web request traffic data to more accurately identify abusive IP addresses Task. Separately, we conducted literature reviews for unique browser identification methodologies Task.

Building the foundations

Wikimedia projects are created and maintained by a vast network of individual contributors and organizations. We focus part of our efforts on strengthening part of this network: the Wikimedia research community.

Existing external AI models for WMF product use cases

We conducted an extensive analysis of product needs, existing external LLMs, and current internal infrastructure to provide a set of recommendations to the Wikimedia Foundation for the adoption of external models based on our core values for AI. Learn more

A paper about Wikimedia Data for AI

We published a paper at WikiNLP '24 (as part of EMNLP 2024) on opportunities to align usage of Wikimedia data by natural language researchers with the needs of the Wikimedia contributor community. Paper

A paper about using Wikipedia without data contamination

We published a position paper at EvalEval '24 (NeurIPS) on concerns related to the suitability of common language model benchmarks for evaluating models on Wikipedia-related tasks. Paper

AI Strategy for editors

We co-developed a strategy to guide WMF teams’ work to research, develop, and enable AI powered features for editors. Task

Online volunteer archetypes

We tested the stability of the six Volunteer Archetypes (V1) and expanded the online volunteer archetypes to include Wikimedia and non-Wikimedia archetypes. Read more

A whitepaper for Wikipedia research and privacy recommendations

We concluded the research and writing of a whitepaper that provides recommendations to Wikipedia researchers and Wikipedians about how to navigate the intersection of research and privacy on Wikipedia. Learn more

Results and insights from the Community Insights survey

We published the Community Insights 2024 Report about experiences and demographics within the Wikimedia contributor community. Learn more

Developer Satisfaction Survey

We launched the 2025 Developer Satisfaction Survey to track the needs, experiences, and demographics of Wikimedia's developer community. Task

Community Safety Survey

We launched the Community Safety Survey in six languages to understand how contributor's sense of safety on the Wikimedia projects has changed over time. Task

Learning more about editor collaborations

We learned what editors value about collaboration on-wiki via a survey of 147 Wikimedians about WikiProjects. Learn more

Clickstream dataset in more languages

We advocated for and saw through the increase in coverage of the Wikipedia clickstream dataset from 11 to 40 languages. Task Explore the data

Model quantization

We performed model quantization for post-training to improve model inference efficiency without compromising model quality (a building block for constructing a quantization workflow and metrics analysis pipeline). Task

Strengthening the research communities

Presentations and Keynotes

We engaged with research audiences through the following presentations and keynotes during the past six months.

We attended Wikimania 2024 in-person and virtually, and presented two talks. The first talk was titled 10 Research Findings and How You Can Use Them in Your Work and shared key insights and practical applications for Wikimedia contributors watch here. The second talk was titled State of Language Technology and Onboarding at Wikimedia and explored language tools that support Wikimedia sites as well as the current status and future direction of languages onboarding at Wikimedia watch here.

We presented at ACL 2024, contributing research on multilingual systems for scoring readability of Wikipedia. Paper Video

We locally co-chaired ACM KDD 2024 and participated in a panel as part of the PhD Consortium to showcase the great opportunities of Wikimedia's free open data ecosystem for computational research in the post-API era.

We presented at the Wikimedia CEE Meeting 2024 on Research to Support Multilingual Knowledge on Wikipedia, discussing how research can enhance language diversity and knowledge equity across Wikimedia projects. Session Slides Video

At Celtic Knot 2024, we introduced Wikimedia’s new language metrics as a resource for insights on minority languages slides videos submission and explored the future of language incubation in Wikimedia projects, discussing challenges and opportunities for new language communities slides submission.

We co-presented at AOIR 2024, sharing insights on user research and product development at Wikimedia in a talk titled Automoderator As An Example Of Community Driven Product Design. Paper Slides

We co-organized the WikiNLP 2024 workshop at EMNLP 2024, creating a space to celebrate Wikimedia's contributions to the NLP community and explore sustainable approaches for maintaining this relationship in the future.

We moderated a panel on the Impact of LLMs on Wikipedia at WikiNLP 2024 workshop, a discussion on the opportunities and challenges large language models present for Wikipedia and the broader Wikimedia ecosystem.

We gave an invited talk at North Carolina State University Linguistics Program titled Wikimedia, Wikipedia, and Language. Slides

We presented at EvalEval at NeurIPS '24 on challenges in properly evaluating the impact of Generative AI. Paper Poster

We joined Wikipodcast Morocco to speak with the Arabic-speaking community about our research and work. Audio

We gave a research seminar titled A Tour through Source Reliability in Wikipedia for the course on Wikipedia Data in Computational Social Science at the University of Konstanz.

Mentorship through Outreachy

We continue to participate in Outreachy and are excited to have had Mahima Agarwal as an intern. She built a data visualization tool for the evolution of Wikipedia articles maintained by WikiProjects. She worked with Pablo Aragón, Isaac Johnson, and Caroline Myrick.

Office Hour Series

We held over 20 one-on-one office hours to support researchers by answering questions about proposed or ongoing projects, guiding dataset access and analysis, sharing insights on Wikimedia Research team initiatives, exploring potential collaborations, and more. Learn more

Research Showcases

Over the past six months, our Research Showcases featured diverse presentations on AI in Wikimedia, political and election analysis using Wikipedia, factors influencing language edition growth, and the curation of Wikimedia AI datasets, fostering discussions at the intersection of research and Wikimedia projects. Learn more

Wiki Workshop

We began organizing the 12th edition of Wiki Workshop and published the call for contributions. Learn more

Research Community Roadmap

We have completed writing the Wikimedia Research Community Roadmap, outlining an updated multi-year strategy to align our team’s community engagement with our mission, improve impact, and clarify priorities for this line of work. Task

The people on the Research team

In January 2025, we promoted Miriam Redi to become our team’s Senior Manager. Miriam joined us in September 2017 as a research scientist and since then has contributed as an individual contributor as well as a manager in our team. Over the years Miriam has stepped up to take more people management and technical responsibilities. Among many of Miriam’s achievements is successfully leading a project to develop an assessment framework for evaluating and using existing AI models on WMF infrastructure and for product and feature applications. Since 2019 and in her volunteer time, Miriam has consistently contributed to the WikiResearch handle on social media, keeping us all updated about the latest research on the Wikimedia projects. Congratulations, Miriam!

New members

Tanja Anđić joined us in July 2024 as a Senior Survey Specialist. Tanja is a sociologist by training (PhD, Department of Sociology, University of Minnesota), and her doctoral research focused on discourses of migration and young people’s conceptualizations of the future in contemporary Serbia. As a graduate student, she worked for the Minnesota Population Center, where she developed an interest in global censuses, cross-national surveys, and population studies. Methodologically, she has published academic research using ethnographic, interview, and quantitative social science methods, while substantively her interests range from population estimates and survey methodologies, to social theory, inequality, life course studies, and sentiments. Most recently, she has been fascinated by all aspects of Wikimedians, who they are, and why they contribute to free knowledge. Tanja first joined the Foundation in 2021 as a Research Fellow on the Global Data & Insights team, and has led the Community Insights surveys of active editors since 2022. Within the Research team, she will continue to work on large-scale survey research across the Foundation, and support Wikimedia Affiliates and the research community with their survey work.

Collaborations

The Research team's work has been made possible through the contributions of our past and present formal collaborators. To inquire about becoming a formal collaborator, please tell us more about yourself and your interests.

Events

Donors

Funds for the Research team are provided by donors who give to the Wikimedia Foundation in different ways. Thank you!

Keep in touch with us

The Wikimedia Foundation's Research team is part of a global network of researchers who advance our understanding of the Wikimedia projects. We invite everyone who wants to stay in touch with this network to join the public wiki-research-l mailing list.

In addition, our team offers one-on-one conversation times. You are welcome to schedule one of our upcoming Public Office Hours.

We look forward to connecting with you.