Print

Analyzing the Wikisphere


Authors

Jeff Stuckman and James Purtilo
Dept. of Computer Science, University of Maryland, College Park, Maryland USA.

Abstract

Due to the inherent difficulty in obtaining experimental data from wikis, past quantitative wiki research has largely been focused on Wikipedia, limiting the degree that it can be generalized. We developed WikiCrawler, a tool that automatically downloads and analyzes wikis, and studied 151 popular wikis running Mediawiki (none of them Wikipedias). We found that our studied wikis displayed signs of collaborative authorship, validating them as objects of study. We also discovered that, as in Wikipedia, the relative contribution levels of users in the studied wikis were highly unequal, with a small number of users contributing a disproportionate amount of work. In addition, power-law distributions were successfully fitted to the contribution levels of most of the studied wikis, and the parameters of the fitted distributions largely predicted the high inequality that was found. Along with demonstrating our methodology of analyzing wikis from diverse sources, the discovered similarities between wikis suggest that most wikis accumulate edits through a similar underlying mechanism, which could motivate a model of user activity that is applicable to wikis in general.

Website

View the author's website to gain access to the collected data: here (external link)

Studying Wikipedias and non-Wikipedias

There is an inherent ease to doing quantitative research on Wikipedias because database dumps are freely available. However, researching non-Wikipedias brings the advantage of being able to simultaneously study hundreds of wikis with diverse administration and policies, enabling the identification which phenomena that are generalizable across wikis and which phenomena are dependent on policies and other external factors.

The following papers involve quantitative research on non-Wikipedias:

Identifying wikis to study

When studying many wikis at once, the study population must be selected carefully to avoid obtaining a biased selection. There are several wiki index sites which act as convenient lists of wikis, but any selection bias present in these sites will also be present in the analysis. Search engines such as Google can also be used to collect a study population.

A list of wiki index sites:

Search engine queries that are likely to produce a list of wikis:

Modeling user behavior in wikis

Some researchers have discovered that certain phenomena in Wikipedia (such as the number of aggregated edits over articles or users) can be described by a well-known probability distribution, such as a Pareto or log-normal distribution. This provides a new perspective to discuss current issues in Wikipedia, such as the relative power of occasional and frequent editors of Wikipedia, or the degree that the growth of Wikipedia is accelerating or leveling off.

Being able to fit a probability distribution over the relevant data can also lead to the development of a mathematical model that suggests a pattern of individual behaviors that led to the aggregate distribution seen. The following paper provides an overview of how such models have traditionally been developed: Mitzenmacher: A brief history of generative models for power law and lognormal distributions (external link)

The following papers have used probability distributions to describe phenomena in Wikipedia:

Fitting probability distributions

The use of inferential statistics to test the fit of a probability distribution is often overlooked, while practitioners instead use descriptive goodness-of-fit statistics or visual charting to assert that the phenomenon in question fits.

Note that real-life datasets will vary rarely fit a probability distribution perfectly; instead, we can say that a dataset is consistent with a probability distribution if, with a certain probability, the dastaset is a "better" fit than a random sample from the distribution in question.

The following resources discuss the use of inferential statistics in probability distributions:

The following pages on Wikipedia discuss goodness-of-fit tests: