[wiki-standards] Free-standing URL PCRE - how not to match
Filippo A. Salustri
salustri at ryerson.ca
Fri Jul 25 14:06:01 CEST 2008
True. Okay, checking my own code, I'm doing this (translated into relevant
terms for this thread) in perl:
$SAIDCHARS = q/\"\'\)\}\]\%\.\,\;\:\?/; # or whatever else.
$URL = ...; # however you define a URL.
$markA = ...; # a special char particular to my code; irrelevant here.
$Rurl = qr/($URL)(?=[$SAIDCHARS]*(?:$|\s|$markA))/so;
I use the 's' modifier cuz I process multiple newline-separated chunks at once.
So I match the URL, then /optional/ punctuation characters (or whatever) then
a /required/ end-of-string, or space, or magic stuff particular to your code.
It seems to work for me.
Does that help?
Michael B Allen wrote:
> On Thu, Jul 24, 2008 at 10:38 PM, Filippo A. Salustri
> <salustri at ryerson.ca> wrote:
>> Could you just define the regexp for a URL as a string ending with a
>> alphanumeric? Then, I should think any non-alphanum, including space,
>> newline, and your 'said chars' should terminate the match.
> Using such a method with the Wiki text:
> Please visit http://www.yahoo.com/index.html.
> would result in:
> Please visit <a
>> Michael B Allen wrote:
>>> The Creole 1.0 standard says:
>>> Free-standing URLs should be detected and turned into links. Single
>>> punctuation characters (,.?!:;"') at the end of URLs should not be
>>> considered part of the URL.
>>> The problem is I can't seem to come up with a regex that does NOT
>>> match the optional (,.?!:;"') (herein abbreviated "said chars") at the
>>> end of a link.
>>> This is my regex:
>>> [Note that each backslash is escaped with an extra backslash because
>>> this is a PHP string literal.]
>>> The problem is this bit on the end:
>>> This matches one said char followed by white space or the end of the
>>> subject string.
>>> This works with a URL like:
>>> http://www.yahoo.com, end
>>> where the end of the link does not include the ','.
>>> But with:
>>> http://www.yahoo.com end
>>> it doesn't match the URL at all since it doesn't have one of the said
>>> If I make the entire trailing clause optional, it won't match a one of
>>> the said chars because the said chars will be matched in the path part
>>> of the regex. Meaning this:
>>> will include the ',' in the link because it was matched as part of the
>>> path expression.
>>> Can someone recommend a suitable regex for this?
>> Filippo A. Salustri, Ph.D., P.Eng.
>> Department of Mechanical and Industrial Engineering
>> Ryerson University
>> 350 Victoria St, Toronto, ON, M5B 2K3, Canada
>> Tel: 416/979-5000 ext 7749
>> Fax: 416/979-5265
>> Email: salustri at ryerson.ca
>> wiki-standards mailing list. wiki-standards at wikisym.org
>> For the wiki-research, wiki-standards, wikisym-announce mailing lists,
>> please see:
Prof. Filippo A. Salustri, Ph.D., P.Eng.
Department of Mechanical and Industrial Engineering
Ryerson University Tel: 416/979-5000 x7749
350 Victoria St. Fax: 416/979-5265
Toronto, ON email: salustri at ryerson.ca
M5B 2K3 Canada http://deseng.ryerson.ca/~fil/
More information about the wiki-standards