Research Report N° 11

Executive Summary

Hi. Welcome to our bi-annual Research Report. July 2024 marked the beginning of the fiscal year in the Wikimedia Foundation and in this report we share with you what our team contributed to during the first six months of the year.

A mixed method approach to research, in action.

In 2023, the Research team integrated survey methods as well as user and design research expertise (Read more in Research Reports No. 8 and No. 9). In 2024, the collaboration between research science, research engineering, survey methods, and design research expertise in the team enabled us to tackle strategic open research questions that we were not equipped to address before (see administrator trends and causes and who are moderators?). We are proud of the impact we have had through mixed method research and we look forward to continuing bringing this type of impact to the Wikimedia Movement.

Reducing the cost and risk of experimentation with new technologies in platforms, products and features.

One of the important functions of a Research team in a product and technology-led organization is to be bold in experimentation, absorb some of the risks of experimentation with new technologies, and provide recommendations about what new technologies can be used in the Wikimedia Foundation’s platforms, products and features at the service of the Movement. We experimented with using LLMs for building models to detect peacock language, informing further development for a new Edit Check feature; We prototyped a LLM for generating simple article summaries that is now under product development for a pilot reading experience; We developed and tested a framework for testing AI models developed by the research and developer communities outside of Wikimedia Foundation for product and features that the Wikimedia Foundation develops and offers the communities.

Informing and supporting important decisions by Wikimedia’s organized groups and the Wikimedia Foundation.

It is part of our team’s mission to inform important conversations, deliberations, and decisions by organized groups in the Wikimedia Movement with research models and insights, and to raise awareness about important research findings and recommendations. To this end, we investigated the potential causes of attrition amongst the Wikipedia administrator community and learned about the potential causes of attrition as well as barriers that potential administrators face for becoming administrators; We published a privacy and research whitepaper to support Wikipedia volunteers and researchers to make more informed trade-off decisions when navigating the need for research transparency and volunteer privacy; We wrote about the challenges of data contamination in the age of more widespread use of AI to raise awareness within the broader research and scientific communities about the severity of the issue and its implications.

Advancing the understanding of the Wikimedia projects.

Millions of readers worldwide use Wikipedia for learning. This makes Wikipedia a unique resource on the internet to learn about architectural styles of human curiosity. Through the use of Wikipedia readership data in the mobile apps and by conducting a study in 14 languages of Wikipedia in 50 countries, we replicated the existence of "busybody" and "hunter" curiosity styles, and for the first time we identified quantitative evidence for the existence of the "dancer" curiosity styles, previously predicted by a historico-philosophical examination of texts over two millennia.

The remainder of this report is organized as follows: In the Projects section you can learn more about more than 40 projects and initiatives that we have led or significantly contributed to; In The people on the team, you can learn about changes in our team; In Events, you can learn about upcoming events that you may be interested to participate in; In Donors, we thank the donors whose support make our work possible; and in Keep in touch with us, we share information about how you can stay in touch with us.

Projects

The large majority of the work of our team is done in the open and reported here. We also provide consultation to other teams at the Wikimedia Foundation (WMF) or external researchers and not all of that work has public outputs. We strive to share what we can and will include these additional projects as they become publicly available in future Research Reports.

Address Knowledge Gaps

We develop models and insights using scientific methods to identify, measure, and bridge Wikimedia’s knowledge gaps.

LLMs for text simplification

We established a technical direction for text simplification by extending our previous explorations to more languages via Aya-Expanse-32B model and supporting experiments by the Web Team to surface simple summaries to readers. Learn more

A model for article-country detection, deployed

We developed and (in collaboration with the ML team) deployed a model for inferring what countries are associated with Wikipedia articles in any language using Wikidata, categories, and article hyperlinks. Learn more Try it out

A model for link insertion in Wikipedia

We published a paper in EMNLP '24 about the challenge of recommending where to add new links to Wikipedia articles when the relevant sentence does not yet exist in the article. Paper

A model for readability in Wikipedia

We published and presented a paper in ACL '24 about scoring how difficult it is to read and understand Wikipedia articles across languages. Paper

Metrics to measure Wikipedia’s language gap

Continuing our work to expand content gap metrics, we proposed several potential language gap metrics that could capture overall language access as well as coverage within languages of impactful topics. Task

Deeper understanding of reader curiosity

We published a paper in Science Advances about how reader curiosity expresses itself across 14 languages and 50 countries. Paper

Readers Survey 2024

We launched the 2024 Readers Survey in 11 languages covering over 90% of Wikipedia readership to better understand reader demographics and behavior. Task

Improve Knowledge Integrity

We develop models and insights using scientific methods to support the technology and policy needs of the Wikimedia projects in the areas of misinformation, disinformation, and content integrity.

Admin trends and potential causes

Through interviews, surveys and log analysis, we investigated the patterns and potential causes behind challenges in Wikipedia admin community’s recruitment and attrition. Read more

A model for moderation detection

We developed a draft working definition and model for detecting Wikipedia moderation activity to be used by WMF teams to better understand moderation activity and measure the impact of interventions. Task

A deployed model for Reference Risk

We collaborated with the ML team to optimize and deploy a model that we developed for estimating the likelihood that the references in a Wikipedia article will survive future edits as a proxy for source reliability in any language edition. Task

A deployed model for Reference Need

We collaborated with the ML team to optimize and deploy a model that we developed for estimating what proportion of sentences in an article are missing a citation as a proxy for assessing Verifiability in 10 Wikipedia language editions. Task

Election analysis

We analyzed how the US Presidential Election impacted reader and editor behavior on key election-related pages. Learn more

Learning about patroller work habits

We surveyed 83 editors in 4 Wikipedia language editions to understand how they balance patrolling with editing and their awareness of automated anti-vandalism tools. Learn more

Towards models to better detect unique devices

We supported the Trust and Safety Product team by analyzing web request traffic data to more accurately identify abusive IP addresses Task. Separately, we conducted literature reviews for unique browser identification methodologies Task.

Building the foundations

Wikimedia projects are created and maintained by a vast network of individual contributors and organizations. We focus part of our efforts on strengthening part of this network: the Wikimedia research community.

Existing external AI models for WMF product use cases

We conducted an extensive analysis of product needs, existing external LLMs, and current internal infrastructure to provide a set of recommendations to the Wikimedia Foundation for the adoption of external models based on our core values for AI. Learn more

A paper about Wikimedia Data for AI

We published a paper at WikiNLP '24 (as part of EMNLP 2024) on opportunities to align usage of Wikimedia data by natural language researchers with the needs of the Wikimedia contributor community. Paper

A paper about using Wikipedia without data contamination

We published a position paper at EvalEval '24 (NeurIPS) on concerns related to the suitability of common language model benchmarks for evaluating models on Wikipedia-related tasks.

AI Strategy for editors

We co-developed a strategy to guide WMF teams’ work to research, develop, and enable AI powered features for editors. Task

Online volunteer archetypes

We tested the stability of the six Volunteer Archetypes (V1) and expanded the online volunteer archetypes to include Wikimedia and non-Wikimedia archetypes. Read more

A whitepaper for Wikipedia research and privacy recommendations

We concluded the research and writing of a whitepaper that provides recommendations to Wikipedia researchers and Wikipedians about how to navigate the intersection of research and privacy on Wikipedia. Learn more

Results and insights from the Community Insights survey

We published the Community Insights 2024 Report about experiences and demographics within the Wikimedia contributor community. Learn more

Developer Satisfaction Survey

We launched the 2025 Developer Satisfaction Survey to track the needs, experiences, and demographics of Wikimedia's developer community. Task

Community Safety Survey

We launched the Community Safety Survey in six languages to understand how contributor's sense of safety on the Wikimedia projects has changed over time. Task

Learning more about editor collaborations

We learned what editors value about collaboration on-wiki via a survey of 147 Wikimedians about WikiProjects. Learn more

Clickstream dataset in more languages

We advocated for and saw through the increase in coverage of the Wikipedia clickstream dataset from 11 to 40 languages. Task Explore the data

Model quantization

We performed model quantization for post-training to improve model inference efficiency without compromising model quality (a building block for constructing a quantization workflow and metrics analysis pipeline). Task

Strengthening the research communities

Presentations and Keynotes

We engaged with research audiences through the following presentations and keynotes during the past six months.

We attended Wikimania 2024 in-person and virtually, and presented two talks. The first talk was titled 10 Research Findings and How You Can Use Them in Your Work and shared key insights and practical applications for Wikimedia contributors watch here. The second talk was titled State of Language Technology and Onboarding at Wikimedia and explored language tools that support Wikimedia sites as well as the current status and future direction of languages onboarding at Wikimedia watch here.

We presented at ACL 2024, contributing research on multilingual systems for scoring readability of Wikipedia. Paper Video

We locally co-chaired ACM KDD 2024 and participated in a panel as part of the PhD Consortium to showcase the great opportunities of Wikimedia's free open data ecosystem for computational research in the post-API era.

We presented at the Wikimedia CEE Meeting 2024 on Research to Support Multilingual Knowledge on Wikipedia, discussing how research can enhance language diversity and knowledge equity across Wikimedia projects. Session Slides Video

At Celtic Knot 2024, we introduced Wikimedia’s new language metrics as a resource for insights on minority languages slides videos submission and explored the future of language incubation in Wikimedia projects, discussing challenges and opportunities for new language communities slides submission.

We co-presented at AOIR 2024, sharing insights on user research and product development at Wikimedia in a talk titled Automoderator As An Example Of Community Driven Product Design. Paper Slides

We co-organized the WikiNLP 2024 workshop at EMNLP 2024, creating a space to celebrate Wikimedia's contributions to the NLP community and explore sustainable approaches for maintaining this relationship in the future.

We moderated a panel on the Impact of LLMs on Wikipedia at WikiNLP 2024 workshop, a discussion on the opportunities and challenges large language models present for Wikipedia and the broader Wikimedia ecosystem.

We gave an invited talk at North Carolina State University Linguistics Program titled Wikimedia, Wikipedia, and Language. Slides

We presented at EvalEval at NeurIPS '24 on challenges in properly evaluating the impact of Generative AI. Poster

We joined Wikipodcast Morocco to speak with the Arabic-speaking community about our research and work. Audio

We gave a research seminar titled A Tour through Source Reliability in Wikipedia for the course on Wikipedia Data in Computational Social Science at the University of Konstanz.

Mentorship through Outreachy

We continue to participate in Outreachy and are excited to have had Mahima Agarwal as an intern. She built a data visualization tool for the evolution of Wikipedia articles maintained by WikiProjects. She worked with Pablo Aragón, Isaac Johnson, and Caroline Myrick.

Office Hour Series

We held over 20 one-on-one office hours to support researchers by answering questions about proposed or ongoing projects, guiding dataset access and analysis, sharing insights on Wikimedia Research team initiatives, exploring potential collaborations, and more. Learn more

Research Showcases

Over the past six months, our Research Showcases featured diverse presentations on AI in Wikimedia, political and election analysis using Wikipedia, factors influencing language edition growth, and the curation of Wikimedia AI datasets, fostering discussions at the intersection of research and Wikimedia projects. Learn more

Wiki Workshop

We began organizing the 12th edition of Wiki Workshop and published the call for contributions. Learn more

Research Community Roadmap

We have completed writing the Wikimedia Research Community Roadmap, outlining an updated multi-year strategy to align our team’s community engagement with our mission, improve impact, and clarify priorities for this line of work. Task

The people on the Research team

In January 2025, we promoted Miriam Redi to become our team’s Senior Manager. Miriam joined us in September 2017 as a research scientist and since then has contributed as an individual contributor as well as a manager in our team. Over the years Miriam has stepped up to take more people management and technical responsibilities. Among many of Miriam’s achievements is successfully leading a project to develop an assessment framework for evaluating and using existing AI models on WMF infrastructure and for product and feature applications. Since 2019 and in her volunteer time, Miriam has consistently contributed to the WikiResearch handle on social media, keeping us all updated about the latest research on the Wikimedia projects. Congratulations, Miriam!

New members

Tanja Anđić

Senior Survey Specialist

Tanja Anđić joined us in July 2024 as a Senior Survey Specialist. Tanja is a sociologist by training (PhD, Department of Sociology, University of Minnesota), and her doctoral research focused on discourses of migration and young people’s conceptualizations of the future in contemporary Serbia. As a graduate student, she worked for the Minnesota Population Center, where she developed an interest in global censuses, cross-national surveys, and population studies. Methodologically, she has published academic research using ethnographic, interview, and quantitative social science methods, while substantively her interests range from population estimates and survey methodologies, to social theory, inequality, life course studies, and sentiments. Most recently, she has been fascinated by all aspects of Wikimedians, who they are, and why they contribute to free knowledge. Tanja first joined the Foundation in 2021 as a Research Fellow on the Global Data & Insights team, and has led the Community Insights surveys of active editors since 2022. Within the Research team, she will continue to work on large-scale survey research across the Foundation, and support Wikimedia Affiliates and the research community with their survey work.

Collaborations

The Research team's work has been made possible through the contributions of our past and present formal collaborators. To inquire about becoming a formal collaborator, please tell us more about yourself and your interests.

Events

WikiNLP

Between July 27-August 1, 2025 · Vienna, Austria + hybrid

We are co-organizing the 2nd edition of the WikiNLP workshop as part of ACL 2025. The event will create a space for showcasing Wikimedia's contributions to the NLP community and highlighting approaches to ensure the sustainability of this relationship for years to come.

Wiki Workshop

May 21-22, 2025 · virtual

Join us for the 12th edition of Wiki Workshop, the annual forum bringing together researchers and practitioners exploring all aspects of Wikimedia projects.

Research Showcases

Every 3rd Wednesday of the month · virtual

Join us for Wikimedia-related research presentations and discussions. The showcases are great entry points into the world of Wikimedia research and for connecting with other Wikimedia researchers.

Office Hours

Throughout the month · virtual

You can book a 1:1 consultation session with a member of the Research team to seek advice on your data or research related questions. All are welcome!

Wikimania

August 6-9, 2025 · Nairobi, Kenya

Wikimania, the largest annual conference for the Wikimedia projects, will turn 20 in August. If you are a Wikimedia researcher and you have never been to Wikimania, we encourage you to consider attending and contributing to this annual forum.

Donors

Funds for the Research team are provided by donors who give to the Wikimedia Foundation in different ways. Thank you!

Keep in touch with us

The Wikimedia Foundation's Research team is part of a global network of researchers who advance our understanding of the Wikimedia projects. We invite everyone who wants to stay in touch with this network to join the public wiki-research-l mailing list.

In addition, our team offers one-on-one conversation times. You are welcome to schedule one of our upcoming Public Office Hours.

We look forward to connecting with you.

Executive Summary ​

A mixed method approach to research, in action. ​

Reducing the cost and risk of experimentation with new technologies in platforms, products and features. ​

Informing and supporting important decisions by Wikimedia’s organized groups and the Wikimedia Foundation. ​

Advancing the understanding of the Wikimedia projects. ​

Projects ​

Address Knowledge Gaps ​

LLMs for text simplification ​

A model for article-country detection, deployed ​

A model for link insertion in Wikipedia ​

A model for readability in Wikipedia ​

Metrics to measure Wikipedia’s language gap ​

Deeper understanding of reader curiosity ​

Readers Survey 2024 ​

Improve Knowledge Integrity ​

Admin trends and potential causes ​

A model for moderation detection ​

A deployed model for Reference Risk ​

A deployed model for Reference Need ​

Election analysis ​

Learning about patroller work habits ​

Towards models to better detect unique devices ​

Building the foundations ​

Existing external AI models for WMF product use cases ​

A paper about Wikimedia Data for AI ​

A paper about using Wikipedia without data contamination ​

AI Strategy for editors ​

Online volunteer archetypes ​

A whitepaper for Wikipedia research and privacy recommendations ​

Results and insights from the Community Insights survey ​

Developer Satisfaction Survey ​

Community Safety Survey ​

Learning more about editor collaborations ​

Clickstream dataset in more languages ​

Model quantization ​

Strengthening the research communities ​

Presentations and Keynotes ​

Mentorship through Outreachy ​

Office Hour Series ​

Research Showcases ​

Wiki Workshop ​

Research Community Roadmap ​

The people on the Research team ​

New members ​

Tanja Anđić

Collaborations ​

Events ​

Donors ​

Keep in touch with us ​

Executive Summary

A mixed method approach to research, in action.

Reducing the cost and risk of experimentation with new technologies in platforms, products and features.

Informing and supporting important decisions by Wikimedia’s organized groups and the Wikimedia Foundation.

Advancing the understanding of the Wikimedia projects.

Projects

Address Knowledge Gaps

LLMs for text simplification

A model for article-country detection, deployed

A model for link insertion in Wikipedia

A model for readability in Wikipedia

Metrics to measure Wikipedia’s language gap

Deeper understanding of reader curiosity

Readers Survey 2024

Improve Knowledge Integrity

Admin trends and potential causes

A model for moderation detection

A deployed model for Reference Risk

A deployed model for Reference Need

Election analysis

Learning about patroller work habits

Towards models to better detect unique devices

Building the foundations

Existing external AI models for WMF product use cases

A paper about Wikimedia Data for AI

A paper about using Wikipedia without data contamination

AI Strategy for editors

Online volunteer archetypes

A whitepaper for Wikipedia research and privacy recommendations

Results and insights from the Community Insights survey

Developer Satisfaction Survey

Community Safety Survey

Learning more about editor collaborations

Clickstream dataset in more languages

Model quantization

Strengthening the research communities

Presentations and Keynotes

Mentorship through Outreachy

Office Hour Series

Research Showcases

Wiki Workshop

Research Community Roadmap

The people on the Research team

New members

Collaborations

Events

Donors

Keep in touch with us