[wiki-standards] Free-standing URL PCRE - how not to match
trailingpunctuation?
Sunir Shah
sunir at sunir.org
Fri Jul 25 06:06:11 CEST 2008
Heya,
Some Perl I just wrote that does the trick. The essential trick is to match
the last character as everything *except* punctuation and space.
my $UrlCharacter = "[A-Za-z0-9\;/\?\:\@\&\=\+\$\,\-\_\.\!\~\*\'\(\)\%\#\|]";
my $UrlProtocols = "http|https|ftp|news|mailto|telnet|gopher"; #
Alternatively, you can just use \w+
my $UrlRegexp = qr<((?:$UrlProtocols):$UrlCharacter+[^,\\.?!:;"\'\s])>;
my @strings = ( "http://www.yahoo.com", "http://www.yahoo.com, end",
"http://www.yahoo.com,", "http://www.yahoo.com end" );
foreach my $string (@strings) {
print "\n$string\n";
if( $string =~ /$UrlRegexp/ ) {
print "$1\n";
} else {
print "no match";
}
}
Cheers,
Sunir
-----Original Message-----
From: wiki-standards-bounces at wikisym.org
[mailto:wiki-standards-bounces at wikisym.org] On Behalf Of Michael B Allen
Sent: July 24, 2008 10:35 PM
To: wiki-standards at wikisym.org
Subject: [wiki-standards] Free-standing URL PCRE - how not to match
trailingpunctuation?
Hi,
The Creole 1.0 standard says:
Free-standing URLs should be detected and turned into links. Single
punctuation characters (,.?!:;"') at the end of URLs should not be
considered part of the URL.
The problem is I can't seem to come up with a regex that does NOT match the
optional (,.?!:;"') (herein abbreviated "said chars") at the end of a link.
This is my regex:
([a-zA-Z0-9]{1,10}://[a-zA-Z0-9.-]+[\p{L}0-9"!#$%&\\()+,\\./:;=?\\@\\\\^_{}~
-]*)(?:[,\\.?!:;"\'](?:\\s|$))
[Note that each backslash is escaped with an extra backslash because this is
a PHP string literal.]
The problem is this bit on the end:
(?:[,\\.?!:;"\'](?:\\s|$))
This matches one said char followed by white space or the end of the subject
string.
This works with a URL like:
http://www.yahoo.com, end
where the end of the link does not include the ','.
But with:
http://www.yahoo.com end
it doesn't match the URL at all since it doesn't have one of the said chars.
If I make the entire trailing clause optional, it won't match a one of the
said chars because the said chars will be matched in the path part of the
regex. Meaning this:
http://www.yahoo.com,
will include the ',' in the link because it was matched as part of the path
expression.
Can someone recommend a suitable regex for this?
Mike
--
Michael B Allen
PHP Active Directory SPNEGO SSO
http://www.ioplex.com/
_______________________________________________
wiki-standards mailing list. wiki-standards at wikisym.org
http://www.wikisym.org/cgi-bin/mailman/listinfo/wiki-standards
For the wiki-research, wiki-standards, wikisym-announce mailing lists,
please see:
http://www.wikisym.org/cgi-bin/mailman/listinfo
More information about the wiki-standards
mailing list