[wiki-standards] Free-standing URL PCRE - how not to match trailing punctuation?

Filippo A. Salustri salustri at ryerson.ca
Fri Jul 25 04:38:34 CEST 2008


Could you just define the regexp for a URL as a string ending with a 
alphanumeric?  Then, I should think any non-alphanum, including space, 
newline, and your 'said chars' should terminate the match.

Just an idea.
Cheers.
Fil Salustri

Michael B Allen wrote:
> Hi,
> 
> The Creole 1.0 standard says:
> 
>   Free-standing URLs should be detected and turned into links. Single
> punctuation characters (,.?!:;"') at the end of URLs should not be
> considered part of the URL.
> 
> The problem is I can't seem to come up with a regex that does NOT
> match the optional (,.?!:;"') (herein abbreviated "said chars") at the
> end of a link.
> 
> This is my regex:
> 
>   ([a-zA-Z0-9]{1,10}://[a-zA-Z0-9.-]+[\p{L}0-9"!#$%&\\()+,\\./:;=?\\@\\\\^_{}~-]*)(?:[,\\.?!:;"\'](?:\\s|$))
> 
> [Note that each backslash is escaped with an extra backslash because
> this is a PHP string literal.]
> 
> The problem is this bit on the end:
> 
>   (?:[,\\.?!:;"\'](?:\\s|$))
> 
> This matches one said char followed by white space or the end of the
> subject string.
> 
> This works with a URL like:
> 
>   http://www.yahoo.com, end
> 
> where the end of the link does not include the ','.
> 
> But with:
> 
>   http://www.yahoo.com end
> 
> it doesn't match the URL at all since it doesn't have one of the said chars.
> 
> If I make the entire trailing clause optional, it won't match a one of
> the said chars because the said chars will be matched in the path part
> of the regex. Meaning this:
> 
>   http://www.yahoo.com,
> 
> will include the ',' in the link because it was matched as part of the
> path expression.
> 
> Can someone recommend a suitable regex for this?
> 
> Mike
> 

-- 
Filippo A. Salustri, Ph.D., P.Eng.
Department of Mechanical and Industrial Engineering
Ryerson University
350 Victoria St, Toronto, ON, M5B 2K3, Canada
Tel: 416/979-5000 ext 7749
Fax: 416/979-5265
Email: salustri at ryerson.ca
http://deseng.ryerson.ca/~fil/


More information about the wiki-standards mailing list