Analyzing the Wikisphere
Authors
Jeff Stuckman and James PurtiloDept. of Computer Science, University of Maryland, College Park, Maryland USA.
Abstract
Due to the inherent difficulty in obtaining experimental data from wikis, past quantitative wiki research has largely been focused on Wikipedia, limiting the degree that it can be generalized. We developed WikiCrawler, a tool that automatically downloads and analyzes wikis, and studied 151 popular wikis running Mediawiki (none of them Wikipedias). We found that our studied wikis displayed signs of collaborative authorship, validating them as objects of study. We also discovered that, as in Wikipedia, the relative contribution levels of users in the studied wikis were highly unequal, with a small number of users contributing a disproportionate amount of work. In addition, power-law distributions were successfully fitted to the contribution levels of most of the studied wikis, and the parameters of the fitted distributions largely predicted the high inequality that was found. Along with demonstrating our methodology of analyzing wikis from diverse sources, the discovered similarities between wikis suggest that most wikis accumulate edits through a similar underlying mechanism, which could motivate a model of user activity that is applicable to wikis in general.Website
View the author's website to gain access to the collected data: hereStudying Wikipedias and non-Wikipedias
There is an inherent ease to doing quantitative research on Wikipedias because database dumps are freely available. However, researching non-Wikipedias brings the advantage of being able to simultaneously study hundreds of wikis with diverse administration and policies, enabling the identification which phenomena that are generalizable across wikis and which phenomena are dependent on policies and other external factors.The following papers involve quantitative research on non-Wikipedias:
- Taraborelli, Roth, Gilbert: Measuring Wiki Viability (2)
- Roth, Taraborelli, Gilbert:Measuring Wiki Viability
- Roth: Viable Wikis
- (add more links to papers here)
Identifying wikis to study
When studying many wikis at once, the study population must be selected carefully to avoid obtaining a biased selection. There are several wiki index sites which act as convenient lists of wikis, but any selection bias present in these sites will also be present in the analysis. Search engines such as Google can also be used to collect a study population.A list of wiki index sites:
- http://s23.org/wikistats/
- http://www.wikiindex.org/Welcome
- (add links to wiki index sites here)
Search engine queries that are likely to produce a list of wikis:
- http://www.google.com/#hl=en&q=inurl%3A"Main_Page"
- (add more queries here)
Modeling user behavior in wikis
Some researchers have discovered that certain phenomena in Wikipedia (such as the number of aggregated edits over articles or users) can be described by a well-known probability distribution, such as a Pareto or log-normal distribution. This provides a new perspective to discuss current issues in Wikipedia, such as the relative power of occasional and frequent editors of Wikipedia, or the degree that the growth of Wikipedia is accelerating or leveling off.Being able to fit a probability distribution over the relevant data can also lead to the development of a mathematical model that suggests a pattern of individual behaviors that led to the aggregate distribution seen. The following paper provides an overview of how such models have traditionally been developed: Mitzenmacher: A brief history of generative models for power law and lognormal distributions
The following papers have used probability distributions to describe phenomena in Wikipedia:
- Ortega, Gonzalez-Barahona, Robles: On the Inequality of Contributions to Wikipedia
- Vob: Measuring Wikipedia
- Wilkenson, Huberman: Assessing the value of cooperation in Wikipedia
- Kittur, Chi, Pendelton, Suh, Mytkowicz: Power of the Few vs. Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie
- (add more papers here)
Fitting probability distributions
The use of inferential statistics to test the fit of a probability distribution is often overlooked, while practitioners instead use descriptive goodness-of-fit statistics or visual charting to assert that the phenomenon in question fits.Note that real-life datasets will vary rarely fit a probability distribution perfectly; instead, we can say that a dataset is consistent with a probability distribution if, with a certain probability, the dastaset is a "better" fit than a random sample from the distribution in question.
The following resources discuss the use of inferential statistics in probability distributions:
- http://www.santafe.edu/~aaronc/powerlaws/
— An overview page discussing power-law distribution fitting, along with papers and sample code.
- (add more websites here)
The following pages on Wikipedia discuss goodness-of-fit tests:
- http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test
- http://en.wikipedia.org/wiki/Chi-square_goodness-of-fit_test
Sidebar
wikisym: W...
- wikisym: WikiSym 2010 Proceedings available: As you may know, just at the same WikiSym 2010 was taking place in the beautif... http://bit.ly/d12JrE
- wikisym: Five Years of Open Space at WikiSym: WikiSym, as you may know, is about collaboration — open collaboration, in whi... http://bit.ly/9SCjr0
- wikisym: Thanks and see you next year!: WikiSym 2010 was a success! Thank you to everyone for participating and supporting ... http://bit.ly/d4XRY0
- wikisym: #WikISym 2010 proceedings available in the ACM Digital Library http://bit.ly/8X0JW0 now and on the symposium wiki later
