[wiki-standards] Free-standing URL PCRE - how not to match trailing
punctuation?
Michael B Allen
ioplex at gmail.com
Fri Jul 25 04:34:41 CEST 2008
Hi,
The Creole 1.0 standard says:
Free-standing URLs should be detected and turned into links. Single
punctuation characters (,.?!:;"') at the end of URLs should not be
considered part of the URL.
The problem is I can't seem to come up with a regex that does NOT
match the optional (,.?!:;"') (herein abbreviated "said chars") at the
end of a link.
This is my regex:
([a-zA-Z0-9]{1,10}://[a-zA-Z0-9.-]+[\p{L}0-9"!#$%&\\()+,\\./:;=?\\@\\\\^_{}~-]*)(?:[,\\.?!:;"\'](?:\\s|$))
[Note that each backslash is escaped with an extra backslash because
this is a PHP string literal.]
The problem is this bit on the end:
(?:[,\\.?!:;"\'](?:\\s|$))
This matches one said char followed by white space or the end of the
subject string.
This works with a URL like:
http://www.yahoo.com, end
where the end of the link does not include the ','.
But with:
http://www.yahoo.com end
it doesn't match the URL at all since it doesn't have one of the said chars.
If I make the entire trailing clause optional, it won't match a one of
the said chars because the said chars will be matched in the path part
of the regex. Meaning this:
http://www.yahoo.com,
will include the ',' in the link because it was matched as part of the
path expression.
Can someone recommend a suitable regex for this?
Mike
--
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/
More information about the wiki-standards
mailing list