[wiki-standards] Free-standing URL PCRE - how not to match trailing punctuation?

Michael B Allen ioplex at gmail.com
Fri Jul 25 04:34:41 CEST 2008


Hi,

The Creole 1.0 standard says:

  Free-standing URLs should be detected and turned into links. Single
punctuation characters (,.?!:;"') at the end of URLs should not be
considered part of the URL.

The problem is I can't seem to come up with a regex that does NOT
match the optional (,.?!:;"') (herein abbreviated "said chars") at the
end of a link.

This is my regex:

  ([a-zA-Z0-9]{1,10}://[a-zA-Z0-9.-]+[\p{L}0-9"!#$%&\\()+,\\./:;=?\\@\\\\^_{}~-]*)(?:[,\\.?!:;"\'](?:\\s|$))

[Note that each backslash is escaped with an extra backslash because
this is a PHP string literal.]

The problem is this bit on the end:

  (?:[,\\.?!:;"\'](?:\\s|$))

This matches one said char followed by white space or the end of the
subject string.

This works with a URL like:

  http://www.yahoo.com, end

where the end of the link does not include the ','.

But with:

  http://www.yahoo.com end

it doesn't match the URL at all since it doesn't have one of the said chars.

If I make the entire trailing clause optional, it won't match a one of
the said chars because the said chars will be matched in the path part
of the regex. Meaning this:

  http://www.yahoo.com,

will include the ',' in the link because it was matched as part of the
path expression.

Can someone recommend a suitable regex for this?

Mike

-- 
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/


More information about the wiki-standards mailing list