2 Apr 2012 14:21

### Wikipedia mathematical search engine

Hi,

I want to introduce a *mathematical* search engine working over English Wikipedia dump. The key advantage is simple - *it works* ;).
Better than a nice speech is a real demo which can be found here:

If you are somehow interested or just want to share your thoughts do not hesitate to contact me.

Best regards,
Jozef Misutka
Charles University in Prague,
Department of Software Engineering,

3 Apr 2012 00:15

### Re: Wikipedia mathematical search engine

Hi Jozef,

I just played around a bit and liked what I saw, though I didn't see
much, as the site was very slow.

How did you strip the dump of the non-mathematical articles? I am
asking because one of the major uses that I have in mind for a good
mathematical search engine would be to identify areas around topic A
(say, theoretical biology) that use the same concepts as those in
topic B (say, economics). Very often such distant fields are only
weakly connected, but solutions or approaches that work in one of them
are not infrequently transferable. In order to be useful for such
purposes, your corpus would still have to contain the economics/
theoretical biology articles (at least those that use equations), but
I couldn't find evidence for that.

Daniel

4 Apr 2012 00:53

### Opinions Needed: Why Do People Contribute to Wikipedia?

Dear Wikipedia contributors,

Your valuable opinions are needed regarding users' motivations to contribute to Wikipedia. This topic is currently investigated by Audrey Abeyta, an undergraduate student at the University of California, Santa Barbara. You can read a more detailed description of the project here: http://meta.wikimedia.org/wiki/Research:Motivations_to_Contribute_to_Wikipedia

Those willing to participate in this study will complete a brief online questionnaire, which is completely anonymous and will take approximately ten minutes. The questionnaire can be accessed here: https://us1.us.qualtrics.com/SE/?SID=SV_8ixU9RkozemzC4s

4 Apr 2012 01:22

### Re: Opinions Needed: Why Do People Contribute to Wikipedia?

Dear Wikipedia contributors,

Your valuable opinions are needed regarding users' motivations to contribute to Wikipedia. This topic is currently investigated by Audrey Abeyta, an undergraduate student at the University of California, Santa Barbara. You can read a more detailed description of the project here: http://meta.wikimedia.org/wiki/Research:Motivations_to_Contribute_to_Wikipedia

Looking at this survey...

Is this intended for USAians only?  I ask because of the following question:

• $0 -$8,700
• $8,701 -$35,350
• $35,351 -$85,650
• $85,651 -$178,650
• $178,651 -$388, 351
• More than $388,351 I assume the currency referenced in USD? If so, could you explicitly state this? I live in Australia, which is also on the dollar. New Zealand and Canada and Jamacia also use the dollar. The dollar is the name of their national currency, but they are not USD. Also, what are these numbers based on? Do they correlate to Cost of Living categories for AUD$ , JAM$, NZD$, CAD$? I do not see a question on the page with this question which asks which country and metro area a respondent is from. My Cost of Living in Canberra will be much different than say some one in Rockford Illinois. If there are no controls for this, the question should probably be jettisoned as completely useless garbage data. I would also ask a question regarding native language speaking, WHAT Wikipedia a person contributes to, etc. A person from Canada who is bilingual and edits primarily edits French Wikipedia is probably going to be very different than an Indian from New Delhi who has a third language of English and is editing English Wikipedia... but the survey questions allow no distinguishing of this. What sort of controls have been taken to insure the data represents a cross section of the Wikipedia population? And that say you don't have an over representation of women, when they make up only 10% of English Wikipedia's contributor base? Ditto with regular contributors who make over 1,000 a month, how will they be controlled for and not over represented when compared to users who make less than 100 edits a month? -- mobile: 0412183663 twitter: purplepopple blog: ozziesport.com _______________________________________________ Wiki-research-l mailing list Wiki-research-l <at> lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l  4 Apr 2012 01:26 ### Re: Opinions Needed: Why Do People Contribute to Wikipedia? Thanks for your thoughtful reply to this survey, Laura. I support your inquiries. On Tue, Apr 3, 2012 at 4:22 PM, Laura Hale wrote: On Wed, Apr 4, 2012 at 8:53 AM, Audrey Abeyta wrote: Dear Wikipedia contributors, Your valuable opinions are needed regarding users' motivations to contribute to Wikipedia. This topic is currently investigated by Audrey Abeyta, an undergraduate student at the University of California, Santa Barbara. You can read a more detailed description of the project here: http://meta.wikimedia.org/wiki/Research:Motivations_to_Contribute_to_Wikipedia Looking at this survey... Is this intended for USAians only? I ask because of the following question: What is your annual income? •$0 - $8,700 •$8,701 - $35,350 •$35,351 - $85,650 •$85,651 - $178,650 •$178,651 - $388, 351 • More than$388,351

I assume the currency referenced in USD?  If so, could you explicitly state this?  I live in Australia, which is also on the dollar.  New Zealand and Canada and Jamacia also use the dollar.  The dollar is the name of their national currency, but they are not USD.  Also, what are these numbers based on?  Do they correlate to Cost of Living categories for AUD$, JAM$, NZD$, CAD$?  I do not see a question on the page with this question which asks which country and metro area a respondent is from.  My Cost of Living in Canberra will be much different than say some one in Rockford Illinois.

If there are no controls for this, the question should probably be jettisoned as completely useless garbage data.  I would also ask a question regarding native language speaking, WHAT Wikipedia a person contributes to, etc.  A person from Canada who is bilingual and edits primarily edits French Wikipedia is probably going to be very different than an Indian from New Delhi who has a third language of English and is editing English Wikipedia... but the survey questions allow no distinguishing of this.

What sort of controls have been taken to insure the data represents a cross section of the Wikipedia population?  And that say you don't have an over representation of women, when they make up only 10% of English Wikipedia's contributor base?  Ditto with regular contributors who make over 1,000 a month, how will they be controlled for and not over represented when compared to users who make less than 100 edits a month?

Lane Rasberry
4 Apr 2012 01:36

### Re: Opinions Needed: Why Do People Contribute to Wikipedia?

Hi Laura,

Thank you for your feedback. You're absolutely correct: I should have specified that currency is in US dollars (I have now specified the currency in the question text). I do, however, have a question that asks about the respondent's country of residence. The questions in this questionnaire were adapted from Hars & Ou (2001), so I tried to deviate from their structure as little as possible.

Your concerns regarding the over/underrepresentation of certain segments of the Wikipedia population are also well-founded. Because respondents are volunteers, I am aware that there may be a large sampling bias, which I will do my best to correct for during statistical analysis. Additionally, I will acknowledge this limitation in the discussion section of my thesis.

Audrey

Dear Wikipedia contributors,

Your valuable opinions are needed regarding users' motivations to contribute to Wikipedia. This topic is currently investigated by Audrey Abeyta, an undergraduate student at the University of California, Santa Barbara. You can read a more detailed description of the project here: http://meta.wikimedia.org/wiki/Research:Motivations_to_Contribute_to_Wikipedia

Looking at this survey...

Is this intended for USAians only?  I ask because of the following question:

• $0 -$8,700
• $8,701 -$35,350
• $35,351 -$85,650
• $85,651 -$178,650
• $178,651 -$388, 351
• More than $388,351 I assume the currency referenced in USD? If so, could you explicitly state this? I live in Australia, which is also on the dollar. New Zealand and Canada and Jamacia also use the dollar. The dollar is the name of their national currency, but they are not USD. Also, what are these numbers based on? Do they correlate to Cost of Living categories for AUD$ , JAM$, NZD$, CAD$? I do not see a question on the page with this question which asks which country and metro area a respondent is from. My Cost of Living in Canberra will be much different than say some one in Rockford Illinois. If there are no controls for this, the question should probably be jettisoned as completely useless garbage data. I would also ask a question regarding native language speaking, WHAT Wikipedia a person contributes to, etc. A person from Canada who is bilingual and edits primarily edits French Wikipedia is probably going to be very different than an Indian from New Delhi who has a third language of English and is editing English Wikipedia... but the survey questions allow no distinguishing of this. What sort of controls have been taken to insure the data represents a cross section of the Wikipedia population? And that say you don't have an over representation of women, when they make up only 10% of English Wikipedia's contributor base? Ditto with regular contributors who make over 1,000 a month, how will they be controlled for and not over represented when compared to users who make less than 100 edits a month? -- mobile: 0412183663 twitter: purplepopple blog: ozziesport.com _______________________________________________ Wiki-research-l mailing list Wiki-research-l <at> lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l  4 Apr 2012 02:33 ### Re: Opinions Needed: Why Do People Contribute to Wikipedia? On Wed, Apr 4, 2012 at 9:36 AM, Audrey Abeyta wrote: Hi Laura, Thank you for your feedback. You're absolutely correct: I should have specified that currency is in US dollars (I have now specified the currency in the question text). I do, however, have a question that asks about the respondent's country of residence. The questions in this questionnaire were adapted from Hars & Ou (2001), so I tried to deviate from their structure as little as possible. I haven't read Hars & Ou. My research background is I probably best described as education, marketing and sociology based. (My dissertation topic is actually fundamentally about online research methods.) When Hars & Ou did their work in 2001, were they conducting research in online communities? And were they dealing in global populations? By not asking both language, country and metro area, by not allowing the expression of income in a local sense, you are creating a junk survey that will not be repeatable. If you look at the cost of living in Texas and compare it to Chicago, Illinois, there is a huge gulf. The cost of housing, of petrol, the local taxes, the cost of medical care, the local commodities in terms of food and clothing mean that$8,000 will go much, much further in Texas than they will in Chicago.  In turn, the cost of living in Chicago will be cheap when compared to Sydney and Canberra.  These will look a bit more reasonable when you compare the cost of living to say Tokyo or Moscow.  $8,000 USD does not go very far in Chicago, Sydney, Canberra, Tokyo, Moscow when compared to Texas. I would STRONGLY urge you to either put in a question that asks country and metro area, and then correct for this by adjusting for cost of living when doing your final results. If you can't do that, I would STRONGLY urge you to remove the question because the data will be completely meaningless. (Minimum wage in my territory is$17.78 USD.)

Your concerns regarding the over/underrepresentation of certain segments of the Wikipedia population are also well-founded. Because respondents are volunteers, I am aware that there may be a large sampling bias, which I will do my best to correct for during statistical analysis. Additionally, I will acknowledge this limitation in the discussion section of my thesis.

How will you do sampling correction?  I don't see a language connection for one.  The survey just says "Wikipedia", not "English Wikipedia" so I assume you're talking about all Wikipedias.  If not, you will want to consider that my own response included experiences with Simple Wikipedia.  You asked time spent editing Wikipedia, but did not ask the type of work done on the site, nor the volume of edits done, nor the status on Wikipedia.  how are you going to correct for an over representation of English Wikipedia contributors, female contributors, the admin core, and power contributors?

This is hugely important.  If you don't have questions for allowing for those connections, if you don't deliberately seek out minority responses but instead advertise to a select selecting population, your results will be fundamentally flawed and not repeatable.  Given your research questions, I suspect if we both advertised this survey, we would get differences in answers that extremely different and STATISTICALLY significant.

The research design here just looks very, very poor and like there is very little done to correct for groups that may have an incentive to contribute versus occasional contributors who have less of an incentive to contribute and complete your survey.

4 Apr 2012 02:51

### Re: Opinions Needed: Why Do People Contribute toWikipedia?

Hi Audrey

As you have already seen there is lots of literature on contribution. I have co-authored a study on Wikipedia English contributors' motivations especially in regard to gender or what is often referred to as the Wikipedia gender gap. We will present our study, including a long interview with Sue Gardner, and over 50 interviews with contributors about the topic at the ICA annual conference in Phoenix, AZ at the end of May. In case you attend, our presentation is May 27: http://convention2.allacademic.com/one/ica/ica12/index.php?click_key=1&cmd=Multi+Search+Load+Person&people_id=2842720&PHPSESSID=f468b6e61a451737642b8d0890930768

Stine Eckert
Hi Laura,

Thank you for your feedback. You're absolutely correct: I should have specified that currency is in US dollars (I have now specified the currency in the question text). I do, however, have a question that asks about the respondent's country of residence. The questions in this questionnaire were adapted from Hars & Ou (2001), so I tried to deviate from their structure as little as possible.

I haven't read Hars & Ou.  My research background is I probably best described as education, marketing and sociology based.  (My dissertation topic is actually fundamentally about online research methods.)

When Hars & Ou did their work in 2001, were they conducting research in online communities?  And were they dealing in global populations?

By not asking both language, country and metro area, by not allowing the expression of income in a local sense, you are creating a junk survey that will not be repeatable.  If you look at the cost of living in Texas and compare it to Chicago, Illinois, there is a huge gulf.  The cost of housing, of petrol, the local taxes, the cost of medical care, the local commodities in terms of food and clothing mean that $8,000 will go much, much further in Texas than they will in Chicago. In turn, the cost of living in Chicago will be cheap when compared to Sydney and Canberra. These will look a bit more reasonable when you compare the cost of living to say Tokyo or Moscow.$8,000 USD does not go very far in Chicago, Sydney, Canberra, Tokyo, Moscow when compared to Texas.

I would STRONGLY urge you to either put in a question that asks country and metro area, and then correct for this by adjusting for cost of living when doing your final results.   If you can't do that, I would STRONGLY urge you to remove the question because the data will be completely meaningless.  (Minimum wage in my territory is \$17.78 USD.)

Your concerns regarding the over/underrepresentation of certain segments of the Wikipedia population are also well-founded. Because respondents are volunteers, I am aware that there may be a large sampling bias, which I will do my best to correct for during statistical analysis. Additionally, I will acknowledge this limitation in the discussion section of my thesis.

How will you do sampling correction?  I don't see a language connection for one.  The survey just says "Wikipedia", not "English Wikipedia" so I assume you're talking about all Wikipedias.  If not, you will want to consider that my own response included experiences with Simple Wikipedia.  You asked time spent editing Wikipedia, but did not ask the type of work done on the site, nor the volume of edits done, nor the status on Wikipedia.  how are you going to correct for an over representation of English Wikipedia contributors, female contributors, the admin core, and power contributors?

This is hugely important.  If you don't have questions for allowing for those connections, if you don't deliberately seek out minority responses but instead advertise to a select selecting population, your results will be fundamentally flawed and not repeatable.  Given your research questions, I suspect if we both advertised this survey, we would get differences in answers that extremely different and STATISTICALLY significant.

The research design here just looks very, very poor and like there is very little done to correct for groups that may have an incentive to contribute versus occasional contributors who have less of an incentive to contribute and complete your survey.

4 Apr 2012 03:03

### Re: Wikipedia mathematical search engine

Hi Daniel,

Hi Jozef,

I just played around a bit and liked what I saw, though I didn't see
much, as the site was very slow.

it was a HW failure (RAID5 I think...). Anyway, it was fixed several hours ago.

How did you strip the dump of the non-mathematical articles?

Very simply: a mathematical article is an article which contains "&lt;/math" inside.

I do not claim to have a perfect Wikipedia tag parser but the vast majority of the formulae in Wikipedia are typeset using standard Wikipedia rules and are simply inside text which is fine.

I am
asking because one of the major uses that I have in mind for a good
mathematical search engine would be to identify areas around topic A
(say, theoretical biology) that use the same concepts as those in
topic B (say, economics). Very often such distant fields are only
weakly connected, but solutions or approaches that work in one of them
are not infrequently transferable.

That is exactly one of the interesting applications for a mathematical search engine.

I wanted to reply to you with something interesting, so I called my friend asking him about interesting formulae from economy. He told me about Vasicek model, so I tried to search for the formula
dr_t = a(b-r_t) dt + \sigma dW_t
which resulted in 2 hits at no abstraction level - no big deal. But then I tried to abstract it and another hit came which is imho interesting (different variables used but the same formula).

In order to be useful for such
purposes, your corpus would still have to contain the economics/
theoretical biology articles (at least those that use equations), but
I couldn't find evidence for that.

See the number of documents (and categories) when you search for simple text e.g.,
economy
biology

Jozef

4 Apr 2012 04:35

### Re: Wikipedia mathematical search engine

Dear Jozef,

a good example - abstractions like that from
dr_t = a(b-r_t) dt + \sigma dW_t
in
http://en.wikipedia.org/wiki/Vasicek_model
to
dx_t = \theta (\mu-x_t)\,dt + \sigma\, dW_t
in
http://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process
are indeed very useful and an invitation to play.

Thank you!

Daniel

