[wiki-standards] Free-standing URL PCRE - how not to match trailingpunctuation?

Michael B Allen ioplex at gmail.com
Fri Jul 25 08:23:16 CEST 2008


On Fri, Jul 25, 2008 at 12:06 AM, Sunir Shah <sunir at sunir.org> wrote:
> Heya,
>
> Some Perl I just wrote that does the trick. The essential trick is to match
> the last character as everything *except* punctuation and space.
>
> my $UrlCharacter = "[A-Za-z0-9\;/\?\:\@\&\=\+\$\,\-\_\.\!\~\*\'\(\)\%\#\|]";
> my $UrlProtocols = "http|https|ftp|news|mailto|telnet|gopher"; #
> Alternatively, you can just use \w+
> my $UrlRegexp = qr<((?:$UrlProtocols):$UrlCharacter+[^,\\.?!:;"\'\s])>;

Bingo!

Negating the character class indeed does the trick. Although I also
added a ']' to the list of said chars to prevent the regex from
stealing a square bracket used to end links. Otherwise Wiki text like
[[http://www.yahoo.com]] would match 'http://www.yahoo.com]'.

Thanks,
Mike

> Subject: [wiki-standards] Free-standing URL PCRE - how not to match
> trailingpunctuation?
>
> Hi,
>
> The Creole 1.0 standard says:
>
>  Free-standing URLs should be detected and turned into links. Single
> punctuation characters (,.?!:;"') at the end of URLs should not be
> considered part of the URL.
>
> The problem is I can't seem to come up with a regex that does NOT match the
> optional (,.?!:;"') (herein abbreviated "said chars") at the end of a link.
>
> This is my regex:
>
>
> ([a-zA-Z0-9]{1,10}://[a-zA-Z0-9.-]+[\p{L}0-9"!#$%&\\()+,\\./:;=?\\@\\\\^_{}~
> -]*)(?:[,\\.?!:;"\'](?:\\s|$))
>
> [Note that each backslash is escaped with an extra backslash because this is
> a PHP string literal.]
>
> The problem is this bit on the end:
>
>  (?:[,\\.?!:;"\'](?:\\s|$))
>
> This matches one said char followed by white space or the end of the subject
> string.
>
> This works with a URL like:
>
>  http://www.yahoo.com, end
>
> where the end of the link does not include the ','.
>
> But with:
>
>  http://www.yahoo.com end
>
> it doesn't match the URL at all since it doesn't have one of the said chars.
>
> If I make the entire trailing clause optional, it won't match a one of the
> said chars because the said chars will be matched in the path part of the
> regex. Meaning this:
>
>  http://www.yahoo.com,
>
> will include the ',' in the link because it was matched as part of the path
> expression.
>
> Can someone recommend a suitable regex for this?

-- 
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/


More information about the wiki-standards mailing list