[wiki-standards] Free-standing URL PCRE - how not to match
trailing punctuation?
Filippo A. Salustri
salustri at ryerson.ca
Fri Jul 25 14:06:01 CEST 2008
True. Okay, checking my own code, I'm doing this (translated into relevant
terms for this thread) in perl:
$SAIDCHARS = q/\"\'\)\}\]\%\.\,\;\:\?/; # or whatever else.
$URL = ...; # however you define a URL.
$markA = ...; # a special char particular to my code; irrelevant here.
$Rurl = qr/($URL)(?=[$SAIDCHARS]*(?:$|\s|$markA))/so;
I use the 's' modifier cuz I process multiple newline-separated chunks at once.
So I match the URL, then /optional/ punctuation characters (or whatever) then
a /required/ end-of-string, or space, or magic stuff particular to your code.
It seems to work for me.
Does that help?
Cheers.
Fil
Michael B Allen wrote:
> On Thu, Jul 24, 2008 at 10:38 PM, Filippo A. Salustri
> <salustri at ryerson.ca> wrote:
>> Could you just define the regexp for a URL as a string ending with a
>> alphanumeric? Then, I should think any non-alphanum, including space,
>> newline, and your 'said chars' should terminate the match.
>
> Using such a method with the Wiki text:
>
> Please visit http://www.yahoo.com/index.html.
>
> would result in:
>
> Please visit <a
> href="http://www.yahoo.com/index">http://www.yahoo.com/index</a>.html.
>
> Mike
>
>> Michael B Allen wrote:
>>> Hi,
>>>
>>> The Creole 1.0 standard says:
>>>
>>> Free-standing URLs should be detected and turned into links. Single
>>> punctuation characters (,.?!:;"') at the end of URLs should not be
>>> considered part of the URL.
>>>
>>> The problem is I can't seem to come up with a regex that does NOT
>>> match the optional (,.?!:;"') (herein abbreviated "said chars") at the
>>> end of a link.
>>>
>>> This is my regex:
>>>
>>>
>>> ([a-zA-Z0-9]{1,10}://[a-zA-Z0-9.-]+[\p{L}0-9"!#$%&\\()+,\\./:;=?\\@\\\\^_{}~-]*)(?:[,\\.?!:;"\'](?:\\s|$))
>>>
>>> [Note that each backslash is escaped with an extra backslash because
>>> this is a PHP string literal.]
>>>
>>> The problem is this bit on the end:
>>>
>>> (?:[,\\.?!:;"\'](?:\\s|$))
>>>
>>> This matches one said char followed by white space or the end of the
>>> subject string.
>>>
>>> This works with a URL like:
>>>
>>> http://www.yahoo.com, end
>>>
>>> where the end of the link does not include the ','.
>>>
>>> But with:
>>>
>>> http://www.yahoo.com end
>>>
>>> it doesn't match the URL at all since it doesn't have one of the said
>>> chars.
>>>
>>> If I make the entire trailing clause optional, it won't match a one of
>>> the said chars because the said chars will be matched in the path part
>>> of the regex. Meaning this:
>>>
>>> http://www.yahoo.com,
>>>
>>> will include the ',' in the link because it was matched as part of the
>>> path expression.
>>>
>>> Can someone recommend a suitable regex for this?
>>>
>>> Mike
>>>
>> --
>> Filippo A. Salustri, Ph.D., P.Eng.
>> Department of Mechanical and Industrial Engineering
>> Ryerson University
>> 350 Victoria St, Toronto, ON, M5B 2K3, Canada
>> Tel: 416/979-5000 ext 7749
>> Fax: 416/979-5265
>> Email: salustri at ryerson.ca
>> http://deseng.ryerson.ca/~fil/
>> _______________________________________________
>>
>> wiki-standards mailing list. wiki-standards at wikisym.org
>> http://www.wikisym.org/cgi-bin/mailman/listinfo/wiki-standards
>>
>> For the wiki-research, wiki-standards, wikisym-announce mailing lists,
>> please see:
>> http://www.wikisym.org/cgi-bin/mailman/listinfo
>>
>
>
>
--
Prof. Filippo A. Salustri, Ph.D., P.Eng.
Department of Mechanical and Industrial Engineering
Ryerson University Tel: 416/979-5000 x7749
350 Victoria St. Fax: 416/979-5265
Toronto, ON email: salustri at ryerson.ca
M5B 2K3 Canada http://deseng.ryerson.ca/~fil/
More information about the wiki-standards
mailing list