[wiki-research] Fwd: [Wiki-research-l] Wikimedia Research,
Quantitative Analysis, General User Survey and more
dirk at riehle.org
Fri Aug 11 10:19:27 CEST 2006
Here a long and interesting email by Eric Zachte
regarding Wikipedia research coming out of
Wikimania. Some of it is relevant for general wiki research as well. --Dirk
>I am very much looking forward towards a transcript or at least speaker
>notes and/or personal observations of several presentations.
>Foremost among them James' Research about Wikimedia: A workshop 
>I also hope that James as Chief Research Officer could give us a sense of
>direction and timing: the mission of the Wikimedia Research Network  is
>lofty, the number of Wikimedians that subscribed large, but the current
>status for most activities seems to be 'idle'   ? Also is there any
>coordination with external research groups, like mentioned on  and
>elsewhere  ?
>Would it be useful to divide Wikimedia Research Network activities in
>A Quantitative Analysis
>B Social Research Collaborations 
>C Other Activities
>and coordinate these separately?
>C would still cover 50%+ of the WRN mission statement, like: identify the
>needs of the individual Wikimedia projects, make recommendations for
>targeted development, guide and motivate outside developers, assist in the
>study of new project proposals.
>I expect on Wikimania most social science sessions  presented relevant
>material and either used or added to quantative research. So there is
>synergy between A and B.
>There was no IRC meeting of the Research Team after December 2005. There are
>pretty active Wikimedia researchers outside the team though. For me
>Wikimania 2006 confirmed that more exchange of ideas would be helpful.
>I'm not sure more IRC discussions are a panacea. Personally I prefer
>discussion via wiki and mailing list, it is less spontaneous but one can
>easier formulate a coherent proposal or comment on it in a thoughtful
>manner, and no less important: it is much better to follow for others who
>read the discussion later.
>Part of the information flow is now on meta, some of it on the research
>mailing list  (which is largely dormant , though recent posts are very
>useful). And some of it on the freelogy list  and probably elsewhere.
>What about making the Wikimedia research list the central forum for all
>broad and conceptual discussions and link from there to meta for detailed
>discussions? I will post his mail there anyway, of course without the
>I personally enjoyed very much session Can Visualization Help? 
>= IBM researcher Fernanda Viégas  talked about the famous Wikipedia
>History Flow tool , which was recently extended, announced a free
>edition and told that Tim Starling had pledged to reinstate the relevant
>export function so that we can use the tool on our projects.
>= IBM researcher Martin Wattenberg  showed his newest toy where one can
>see all contributions of one single Wikimedia editor, presented as an
>association cloud (titles grouped per namespace and sorted by number of
>edits, font size varied per title to express relative number of edits). It
>is somewhat scary though, I feel a quantitative improvement - exposing data
>that are already online in a much more efficient manner -, can lead to a
>qualitative setback - exposing ones character and interests in a way that
>was never expected. People may after all regret that they edited under their
>real name. Although personally I will happily continue to do so, it is a
>matter of responsibility towards the community to at least discuss whether
>we should actively promote such a tool. I know I'm partially guilty in this
>respect myself with mailing list stats but feel that did not cross the
>= Visualization guru Ben Schneiderman  made a case for more advanced
>data visualisation tools to spice up wikistats. I am a long time admirer of
>several of his UI inventions and happy to take up the challenge.
>General User Survey
>One promising but sleeping WRT project, that I initiated myself, is the
>'General User Survey' . A few Wikimania participants interested in
>wikistats gathered ad hoc at lunch time on Saturday (others interested in
>the project, Cormaggio, Piotrus were at the conference, but not in the
>vicinity at that moment). Kevin Gamble, associate director of 75 Land-Grant
>Universities, expressed his continued interest and said he might be able to
>offer programming support
>A project definition plus rationale  and a mockup questionnaire form
> have been created and discussed for more than a year. I started the
>transition towards technical design  and with Kevins support and
>resources coding might follow later this year. Once we have a proof of
>concept in e.g. English and German (at least two languages to show
>multilingual aspects) I'm sure more people will start to take notice, and
>help to discuss and fine-tune the questionnaire. At a later stage, before
>going live with a multilingual golden edition, we will probably have to
>discuss matters with the board (Anthere already stated her support) in order
>to make this an official survey, hopefully with coverage on the project
>pages themselves (banner announcement ?). Mind you, the implementation is
>not exactly trivial, lots of issues involved that require critical
>discussion, code and coordination. I invite everyone to comment on tech
>notes, especially of course Kevin, and hope to learn from him whether coding
>this project fits within his budget.
>Saturday I met Jeremy Tobacman. We had a long and very interesting
>discussion, mainly on new initiatives centered around the freelogy servers.
>Jeremy proposed to held an impromptu lunch meeting on Sunday and gathered a
>room full of people.
>Several mails have already been written about this, but to a smaller
>audience. So here are a few highlights.
>Issues that were discussed:
>The two tool servers  are very crowded and insufficient for all stats
>jobs we might want to run. The tool servers run a mirror of the live
>database so well behaved SQL queries are possible. Well behaved meaning they
>should no try to emulate the xml dump process where extracting the English
>Wikipedia (all revisions) already takes a full week.
>Alexander Wait (Sasha) has access to huge hardware resources, enough to
>calculate how many parallel universes it takes to find at least one zebra
>couple where a black-and-white mother and a white-and-black father have
>exactly mirrored patterns and thus produce offspring that is either all
>black or all white (mind you, albino's are false positives).
>Since in reality Sasha is merely interested in unraveling the secrets of DNA
>he has some cpu cycles to spare. Upon request virtual machines can be
>catered for. The freelogy-discuss mailing list archives have information
>about hardware availability 
>By the way, Jeremy and Erik Tobacman have a server at The National Bureau of
>Economic Research (NBER) for quantitative research on Wikipedia.
>Also I am urged by the Communications Subcomittee to spend more of my time
>on publishable stats (in time spent TomeRaider offline edition of Wikipedia
>easily dominated, but the time for offline browsing is nearly over) and they
>want me to have a dedicated server. I would like it to be well utilised, but
>of course it should produce timely wikistats in the first place, as that is
>what it is offered for. To be discussed.
>2 Real time data collection / Performance / Storage
>It would be useful to learn when a page is being slashdotted or otherwise in
>the news, at the moment of the actual event, in order that vandal patrols
>can be timely summoned, and article improvement can commence right away.
>Major performance issues need to be addressed.
>Do we gather and keep every page hit ? Hardly practicable. Wikimedia visitor
>stats were not disabled for no reason. It seems we are getting switches that
>can log accesses stochastically (e.g. every 100nth access, plus for a
>selected subset of IP addresses all hits to monitor navigation patterns).
>There might be a need to store data in aggregated (condensed) form, as
>volumes will be huge. At least tapping from switches directly puts no burden
>on squids (=web proxies/caches).
>Brion will be asked to drop bz2 compression on xml dump job, as it is so
>much slower and compresses so much less than 7zip. Brion had to develop a
>distributed version of bzip to get it working at all on the 800 Gb enwiki
>dump file. Format bz2 is however supported on more platforms, so Brion may
>Specifically about wikistats: I explained why I always process the full
>historic dump instead of doing incremental steps: new functionality in
>wikistats means processing it all anyway. Data for older months are not
>really static due to frequent deletions and moves. Could I speed up counts
>section of wikistats by splitting job over several servers ? I'll have to
>look into it.
>3 Data publishing
>We should be careful not to publish very granular data for outside
>inspection. It is a well known fact that China wants complete control over
>its citizens. Less known is that they have the latest technology (mainly
>bought in the US) and lots of it, and about 30.000 IT professionals
>(estimate by Reporters without Borders/Reporters sans Frontières) working on
>concealment of internet resources, redirection of internet requests and
>spying on internet usage patterns in general. They would love to see our raw
>access logs. Cathy will you attend the Chinese Wikimania?  If you happen
>to hear about these things, I hope you will blog about it. See also 
>See also well timed scoop  about AOL privacy disaster.
>4 Measuring quality quantitatively
>It may be impossible to define quality, let alone measure it, But it will be
>fun to zoom in on it and see how far we can come. Spurred by Jimbo's
>excellent Wikimania kick off speech, where he stressed we will need more
>attention to quality, I started a project to extend wikistats. Brian offered
>lots of ideas and hopefully will prove me wrong in my belief that adding
>spelling, grammar and readability assessments is not to be taken too lightly
>in a multilingual environment  
> http://wikimania2006.wikimedia.org/wiki/Proceedings:CM1 mp3 audio
>2.html (registration needed:
> http://wikimania2006.wikimedia.org/wiki/User:Roadrunner (I wonder if he
>is the person who gave a smashing full hour speech on this at 20c3 Berlin)
>(data were anonimized but some users had searched for their own name several
>times and were easily recognized, lots of very embarrassing stuff was
>By the way Angela Beasley and Jakob Voss will give a workshop on Wikipedia
>research on WikiSym 2006  
>Regards, Erik Zachte
>Wiki-research-l mailing list
>Wiki-research-l at Wikimedia.org
More information about the wiki-research