[wiki-standards] Re: Better Creole Tests?

Radomir Dopieralski wikistandards at sheep.art.pl
Sun Jul 27 11:37:50 CEST 2008


Sat, Jul 26, 2008 at 02:02:44PM -0400: 
> On Sat, Jul 26, 2008 at 12:40 PM, Radomir Dopieralski

> >> I need a regex that just matches the last occurrence of three closing
> >> curly braces in a sequence of more than three.

> > Um... that's pretty straightforward no? Exactly how you describe:

> >   }}}(?!})
> 
> Indeed that would be perfect.
> 
> Unfortunately because my tokenizer uses a single regex with an OR
> condition for each token, the image closing token '}}' matches before
> the look-ahead expression. So it seems a look-ahead expression doesn't
> follow the precedence expected.
> 
> Meaning an expression and input like:
> 
> $expr = '/(}}}(?!}))|(}})/';
> $input = 'foo}}}}bar';
> 
> will always match the image closing markup '}}' before the look-ahead
> expression.

That's really straightforward too, isn't it?

   (?<!})}}(?!})

I recommend you reading the documentation about regular expressions used
in whatever programming language you use. I'm sure you will find a lot of
conveninet and useful features.

> > I'm not sure that using tokens in the way you describe is a good
> > idea in the specific case of wiki markup -- it surely makes a lot
> > of things more complicated.
 
> It's not complicated. When I get a '}}}}' token, I output one '}',
> unget '}}}' and continue at the top of the loop. It's 4 lines of code
> and is completely isolated.

If you think that four lines of extra code per token, with addition to
who-konw-how-many lines of code needed to distinguish the special case
and allow injecting tokens back into the stream is a small price, then
there is something wrong. I mean, each and every special case makes it
a little harder to comprehend and manage your code -- and there is no
difference how elaborate the code for that special case it. If you are
going to write your own token-recognition code simply because you can't
understand regular expressions, then why use regular expressions in it
at all?
 
> Of course it would be better if it could be handled with only regex
> but these sort of clauses are common for a tokenizer + loop model.

Not really, they are a sure sign of a failure at some point in the
development: either a failure of preparing proper grammar and parsing
flow for the specific language, or a failure of the parsing technique
used (like trying to parse a context free grammar with regexps). Of
course, they can sometimes simplify the parser greatly, so the ugliness
might be a cheap price, but you shouldn't take them so lightly.

In particular, they can increase the computational complexity of your
parser -- and even make it never stop in some cases.

-- 
Radomir `The Sheep' Dopieralski <http://sheep.art.pl>
 On and on until we change / Everything remains the same
 On and on until we learn / On and on the wheels will turn


More information about the wiki-standards mailing list