[wiki-standards] Free-standing URL PCRE - how not to match
trailingpunctuation?
Alex
kensanata at gmail.com
Fri Jul 25 08:22:18 CEST 2008
This is what Oddmuse is using, similar to Sunir's idea:
$UrlProtocols =
'http|https|ftp|afs|news|nntp|mid|cid|mailto|wais|prospero|telnet|gopher|irc|feed';
$UrlProtocols .= '|file' if $NetworkFile;
my $UrlChars = '[-a-zA-Z0-9/@=+$_~*.,;:?!\'"()&#%]'; # see RFC 2396
my $EndChars = '[-a-zA-Z0-9/@=+$_~*]'; # no punctuation at the end of the
url.
$UrlPattern = "((?:$UrlProtocols):$UrlChars+$EndChars)";
On Fri, Jul 25, 2008 at 6:06 AM, Sunir Shah <sunir at sunir.org> wrote:
> Heya,
>
> Some Perl I just wrote that does the trick. The essential trick is to match
> the last character as everything *except* punctuation and space.
>
> my $UrlCharacter =
> "[A-Za-z0-9\;/\?\:\@\&\=\+\$\,\-\_\.\!\~\*\'\(\)\%\#\|]";
> my $UrlProtocols = "http|https|ftp|news|mailto|telnet|gopher"; #
> Alternatively, you can just use \w+
> my $UrlRegexp = qr<((?:$UrlProtocols):$UrlCharacter+[^,\\.?!:;"\'\s])>;
>
> my @strings = ( "http://www.yahoo.com", "http://www.yahoo.com, end",
> "http://www.yahoo.com,", "http://www.yahoo.com end" );
>
> foreach my $string (@strings) {
> print "\n$string\n";
> if( $string =~ /$UrlRegexp/ ) {
> print "$1\n";
> } else {
> print "no match";
> }
> }
>
> Cheers,
> Sunir
>
> -----Original Message-----
> From: wiki-standards-bounces at wikisym.org
> [mailto:wiki-standards-bounces at wikisym.org] On Behalf Of Michael B Allen
> Sent: July 24, 2008 10:35 PM
> To: wiki-standards at wikisym.org
> Subject: [wiki-standards] Free-standing URL PCRE - how not to match
> trailingpunctuation?
>
> Hi,
>
> The Creole 1.0 standard says:
>
> Free-standing URLs should be detected and turned into links. Single
> punctuation characters (,.?!:;"') at the end of URLs should not be
> considered part of the URL.
>
> The problem is I can't seem to come up with a regex that does NOT match the
> optional (,.?!:;"') (herein abbreviated "said chars") at the end of a link.
>
> This is my regex:
>
>
>
> ([a-zA-Z0-9]{1,10}://[a-zA-Z0-9.-]+[\p{L}0-9"!#$%&\\()+,\\./:;=?\\@\\\\^_{}~
> -]*)(?:[,\\.?!:;"\'](?:\\s|$))
>
> [Note that each backslash is escaped with an extra backslash because this
> is
> a PHP string literal.]
>
> The problem is this bit on the end:
>
> (?:[,\\.?!:;"\'](?:\\s|$))
>
> This matches one said char followed by white space or the end of the
> subject
> string.
>
> This works with a URL like:
>
> http://www.yahoo.com, end
>
> where the end of the link does not include the ','.
>
> But with:
>
> http://www.yahoo.com end
>
> it doesn't match the URL at all since it doesn't have one of the said
> chars.
>
> If I make the entire trailing clause optional, it won't match a one of the
> said chars because the said chars will be matched in the path part of the
> regex. Meaning this:
>
> http://www.yahoo.com,
>
> will include the ',' in the link because it was matched as part of the path
> expression.
>
> Can someone recommend a suitable regex for this?
>
> Mike
>
> --
> Michael B Allen
> PHP Active Directory SPNEGO SSO
> http://www.ioplex.com/
> _______________________________________________
>
> wiki-standards mailing list. wiki-standards at wikisym.org
> http://www.wikisym.org/cgi-bin/mailman/listinfo/wiki-standards
>
> For the wiki-research, wiki-standards, wikisym-announce mailing lists,
> please see:
> http://www.wikisym.org/cgi-bin/mailman/listinfo
>
> _______________________________________________
>
> wiki-standards mailing list. wiki-standards at wikisym.org
> http://www.wikisym.org/cgi-bin/mailman/listinfo/wiki-standards
>
> For the wiki-research, wiki-standards, wikisym-announce mailing lists,
> please see:
> http://www.wikisym.org/cgi-bin/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.wikisym.org/pipermail/wiki-standards/attachments/20080725/91d60186/attachment-0001.html
More information about the wiki-standards
mailing list