[wiki-standards] Free-standing URL PCRE - how not to match trailingpunctuation?

Sunir Shah sunir at sunir.org
Fri Jul 25 06:06:11 CEST 2008


Heya,

Some Perl I just wrote that does the trick. The essential trick is to match
the last character as everything *except* punctuation and space.

my $UrlCharacter = "[A-Za-z0-9\;/\?\:\@\&\=\+\$\,\-\_\.\!\~\*\'\(\)\%\#\|]";
my $UrlProtocols = "http|https|ftp|news|mailto|telnet|gopher"; #
Alternatively, you can just use \w+
my $UrlRegexp = qr<((?:$UrlProtocols):$UrlCharacter+[^,\\.?!:;"\'\s])>;

my @strings = ( "http://www.yahoo.com", "http://www.yahoo.com, end",
"http://www.yahoo.com,", "http://www.yahoo.com end" );

foreach my $string (@strings) {
   print "\n$string\n";
   if( $string =~ /$UrlRegexp/ ) {
       print "$1\n";
   } else {
       print "no match";
   }
}

Cheers,
Sunir

-----Original Message-----
From: wiki-standards-bounces at wikisym.org
[mailto:wiki-standards-bounces at wikisym.org] On Behalf Of Michael B Allen
Sent: July 24, 2008 10:35 PM
To: wiki-standards at wikisym.org
Subject: [wiki-standards] Free-standing URL PCRE - how not to match
trailingpunctuation?

Hi,

The Creole 1.0 standard says:

  Free-standing URLs should be detected and turned into links. Single
punctuation characters (,.?!:;"') at the end of URLs should not be
considered part of the URL.

The problem is I can't seem to come up with a regex that does NOT match the
optional (,.?!:;"') (herein abbreviated "said chars") at the end of a link.

This is my regex:

 
([a-zA-Z0-9]{1,10}://[a-zA-Z0-9.-]+[\p{L}0-9"!#$%&\\()+,\\./:;=?\\@\\\\^_{}~
-]*)(?:[,\\.?!:;"\'](?:\\s|$))

[Note that each backslash is escaped with an extra backslash because this is
a PHP string literal.]

The problem is this bit on the end:

  (?:[,\\.?!:;"\'](?:\\s|$))

This matches one said char followed by white space or the end of the subject
string.

This works with a URL like:

  http://www.yahoo.com, end

where the end of the link does not include the ','.

But with:

  http://www.yahoo.com end

it doesn't match the URL at all since it doesn't have one of the said chars.

If I make the entire trailing clause optional, it won't match a one of the
said chars because the said chars will be matched in the path part of the
regex. Meaning this:

  http://www.yahoo.com,

will include the ',' in the link because it was matched as part of the path
expression.

Can someone recommend a suitable regex for this?

Mike

--
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/
_______________________________________________

wiki-standards mailing list. wiki-standards at wikisym.org
http://www.wikisym.org/cgi-bin/mailman/listinfo/wiki-standards

For the wiki-research, wiki-standards, wikisym-announce mailing lists,
please see:
http://www.wikisym.org/cgi-bin/mailman/listinfo



More information about the wiki-standards mailing list