[wiki-standards] how to get raw material for stats?

spir denis.spir at free.fr
Mon Oct 20 21:50:53 CEST 2008


Hello everybody,

I'm new here, this is my first post on the list. I'm no professional 
researcher, programmer, or web designer, only a hobbyist. And an lover 
of all kinds of languages.

I'd like to do some statistics about the actual use of wiki language 
features : which ones are the most used, which ones could be left aside, 
how predominant are the most used, etc. I'm also really interested in 
seeing how cultural/linguistic pregnancy (?) may influence that.
To do this, I plan to parse hundreds of wiki pages from wikipedia or any 
other wiki that exists in multilinguistic versions. The issue for me is 
: how to get the raw material? I have no clue. Would you help me for 
that. This dataset may then be available for further use. I would 
prefere random pages.
I can cope with parsing (my code will not comply with any industry 
standard, but it will be clear and do the job ;-)). The data may be 
either, the full page's html, the wiki doc's html, or the wiki source 
text. It may also be db extracts if I can have the format to decode it 
-- but I highly prefere humanly readable data.

Thank you for your attention,
Denis



More information about the wiki-standards mailing list