[wiki-standards] Free-standing URL PCRE - how not to match trailingpunctuation?

Alex kensanata at gmail.com
Fri Jul 25 08:22:18 CEST 2008


This is what Oddmuse is using, similar to Sunir's idea:

  $UrlProtocols =
'http|https|ftp|afs|news|nntp|mid|cid|mailto|wais|prospero|telnet|gopher|irc|feed';
  $UrlProtocols .= '|file'  if $NetworkFile;
  my $UrlChars = '[-a-zA-Z0-9/@=+$_~*.,;:?!\'"()&#%]'; # see RFC 2396
  my $EndChars = '[-a-zA-Z0-9/@=+$_~*]'; # no punctuation at the end of the
url.
  $UrlPattern = "((?:$UrlProtocols):$UrlChars+$EndChars)";


On Fri, Jul 25, 2008 at 6:06 AM, Sunir Shah <sunir at sunir.org> wrote:

> Heya,
>
> Some Perl I just wrote that does the trick. The essential trick is to match
> the last character as everything *except* punctuation and space.
>
> my $UrlCharacter =
> "[A-Za-z0-9\;/\?\:\@\&\=\+\$\,\-\_\.\!\~\*\'\(\)\%\#\|]";
> my $UrlProtocols = "http|https|ftp|news|mailto|telnet|gopher"; #
> Alternatively, you can just use \w+
> my $UrlRegexp = qr<((?:$UrlProtocols):$UrlCharacter+[^,\\.?!:;"\'\s])>;
>
> my @strings = ( "http://www.yahoo.com", "http://www.yahoo.com, end",
> "http://www.yahoo.com,", "http://www.yahoo.com end" );
>
> foreach my $string (@strings) {
>   print "\n$string\n";
>   if( $string =~ /$UrlRegexp/ ) {
>       print "$1\n";
>   } else {
>       print "no match";
>   }
> }
>
> Cheers,
> Sunir
>
> -----Original Message-----
> From: wiki-standards-bounces at wikisym.org
> [mailto:wiki-standards-bounces at wikisym.org] On Behalf Of Michael B Allen
> Sent: July 24, 2008 10:35 PM
> To: wiki-standards at wikisym.org
> Subject: [wiki-standards] Free-standing URL PCRE - how not to match
> trailingpunctuation?
>
> Hi,
>
> The Creole 1.0 standard says:
>
>  Free-standing URLs should be detected and turned into links. Single
> punctuation characters (,.?!:;"') at the end of URLs should not be
> considered part of the URL.
>
> The problem is I can't seem to come up with a regex that does NOT match the
> optional (,.?!:;"') (herein abbreviated "said chars") at the end of a link.
>
> This is my regex:
>
>
>
> ([a-zA-Z0-9]{1,10}://[a-zA-Z0-9.-]+[\p{L}0-9"!#$%&\\()+,\\./:;=?\\@\\\\^_{}~
> -]*)(?:[,\\.?!:;"\'](?:\\s|$))
>
> [Note that each backslash is escaped with an extra backslash because this
> is
> a PHP string literal.]
>
> The problem is this bit on the end:
>
>  (?:[,\\.?!:;"\'](?:\\s|$))
>
> This matches one said char followed by white space or the end of the
> subject
> string.
>
> This works with a URL like:
>
>  http://www.yahoo.com, end
>
> where the end of the link does not include the ','.
>
> But with:
>
>  http://www.yahoo.com end
>
> it doesn't match the URL at all since it doesn't have one of the said
> chars.
>
> If I make the entire trailing clause optional, it won't match a one of the
> said chars because the said chars will be matched in the path part of the
> regex. Meaning this:
>
>  http://www.yahoo.com,
>
> will include the ',' in the link because it was matched as part of the path
> expression.
>
> Can someone recommend a suitable regex for this?
>
> Mike
>
> --
> Michael B Allen
> PHP Active Directory SPNEGO SSO
> http://www.ioplex.com/
> _______________________________________________
>
> wiki-standards mailing list. wiki-standards at wikisym.org
> http://www.wikisym.org/cgi-bin/mailman/listinfo/wiki-standards
>
> For the wiki-research, wiki-standards, wikisym-announce mailing lists,
> please see:
> http://www.wikisym.org/cgi-bin/mailman/listinfo
>
> _______________________________________________
>
> wiki-standards mailing list. wiki-standards at wikisym.org
> http://www.wikisym.org/cgi-bin/mailman/listinfo/wiki-standards
>
> For the wiki-research, wiki-standards, wikisym-announce mailing lists,
> please see:
> http://www.wikisym.org/cgi-bin/mailman/listinfo
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.wikisym.org/pipermail/wiki-standards/attachments/20080725/91d60186/attachment-0001.html


More information about the wiki-standards mailing list