[wiki-standards] how to get raw material for stats?
spir
denis.spir at free.fr
Mon Oct 20 21:50:53 CEST 2008
Hello everybody,
I'm new here, this is my first post on the list. I'm no professional
researcher, programmer, or web designer, only a hobbyist. And an lover
of all kinds of languages.
I'd like to do some statistics about the actual use of wiki language
features : which ones are the most used, which ones could be left aside,
how predominant are the most used, etc. I'm also really interested in
seeing how cultural/linguistic pregnancy (?) may influence that.
To do this, I plan to parse hundreds of wiki pages from wikipedia or any
other wiki that exists in multilinguistic versions. The issue for me is
: how to get the raw material? I have no clue. Would you help me for
that. This dataset may then be available for further use. I would
prefere random pages.
I can cope with parsing (my code will not comply with any industry
standard, but it will be clear and do the job ;-)). The data may be
either, the full page's html, the wiki doc's html, or the wiki source
text. It may also be db extracts if I can have the format to decode it
-- but I highly prefere humanly readable data.
Thank you for your attention,
Denis
More information about the wiki-standards
mailing list