[wiki-standards] Re: Better Creole Tests?
Radomir Dopieralski
wikistandards at sheep.art.pl
Sun Jul 27 11:37:50 CEST 2008
Sat, Jul 26, 2008 at 02:02:44PM -0400:
> On Sat, Jul 26, 2008 at 12:40 PM, Radomir Dopieralski
> >> I need a regex that just matches the last occurrence of three closing
> >> curly braces in a sequence of more than three.
> > Um... that's pretty straightforward no? Exactly how you describe:
> > }}}(?!})
>
> Indeed that would be perfect.
>
> Unfortunately because my tokenizer uses a single regex with an OR
> condition for each token, the image closing token '}}' matches before
> the look-ahead expression. So it seems a look-ahead expression doesn't
> follow the precedence expected.
>
> Meaning an expression and input like:
>
> $expr = '/(}}}(?!}))|(}})/';
> $input = 'foo}}}}bar';
>
> will always match the image closing markup '}}' before the look-ahead
> expression.
That's really straightforward too, isn't it?
(?<!})}}(?!})
I recommend you reading the documentation about regular expressions used
in whatever programming language you use. I'm sure you will find a lot of
conveninet and useful features.
> > I'm not sure that using tokens in the way you describe is a good
> > idea in the specific case of wiki markup -- it surely makes a lot
> > of things more complicated.
> It's not complicated. When I get a '}}}}' token, I output one '}',
> unget '}}}' and continue at the top of the loop. It's 4 lines of code
> and is completely isolated.
If you think that four lines of extra code per token, with addition to
who-konw-how-many lines of code needed to distinguish the special case
and allow injecting tokens back into the stream is a small price, then
there is something wrong. I mean, each and every special case makes it
a little harder to comprehend and manage your code -- and there is no
difference how elaborate the code for that special case it. If you are
going to write your own token-recognition code simply because you can't
understand regular expressions, then why use regular expressions in it
at all?
> Of course it would be better if it could be handled with only regex
> but these sort of clauses are common for a tokenizer + loop model.
Not really, they are a sure sign of a failure at some point in the
development: either a failure of preparing proper grammar and parsing
flow for the specific language, or a failure of the parsing technique
used (like trying to parse a context free grammar with regexps). Of
course, they can sometimes simplify the parser greatly, so the ugliness
might be a cheap price, but you shouldn't take them so lightly.
In particular, they can increase the computational complexity of your
parser -- and even make it never stop in some cases.
--
Radomir `The Sheep' Dopieralski <http://sheep.art.pl>
On and on until we change / Everything remains the same
On and on until we learn / On and on the wheels will turn
More information about the wiki-standards
mailing list