[wiki-standards] Re: Better Creole Tests?

Michael B Allen ioplex at gmail.com
Sun Jul 27 20:57:16 CEST 2008


On Sun, Jul 27, 2008 at 1:30 PM, Radomir Dopieralski
<wikistandards at sheep.art.pl> wrote:
> Sun, Jul 27, 2008 at 12:39:17PM -0400:
>> On Sun, Jul 27, 2008 at 5:37 AM, Radomir Dopieralski
>> <wikistandards at sheep.art.pl> wrote:
>> > Sat, Jul 26, 2008 at 02:02:44PM -0400:
>> >> On Sat, Jul 26, 2008 at 12:40 PM, Radomir Dopieralski
>
>> >> Unfortunately because my tokenizer uses a single regex with an OR
>> >> condition for each token, the image closing token '}}' matches before
>> >> the look-ahead expression. So it seems a look-ahead expression doesn't
>> >> follow the precedence expected.
>
> [...]
>
>> > That's really straightforward too, isn't it?
>> >
>> >   (?<!})}}(?!})
>>
>> Apparently not because this doesn't work.
>
>
> <?php
> preg_match_all('/(?<!})}}(?!})|}}}(?!})/', "{{{foo}}}}} {{bar}} baz", $matches);
> print_r($matches);
> ?>
>
> gives output:
>
> Array ( [0] => Array ( [0] => }}} [1] => }} ) )
>
> which looks about right to me, but I'm not that experienced with PHP.
> Of course, you have to add some basic cases for handling beginning and
> ends of the lines -- but this is left as an exercise for the reader.

This regex by itself has the same problem as your original suggestion
but there is a fix the issue. More below.

> [...]
>> So, I use a regex tokenizer that grabs two tokens at a time (stuff
>> that doesn't match and a token). Then I use a regular state machine to
>> handle the grammer.
>
> Really, I'd advice you to use a regular expression that recognizes all
> the tokens at once, because if you do it the way you describe here, you
> are in fact reading the input log n times, where n is the number of
> tokens. This produces O(n*log n) complexity of the parser, which is not
> really too hot.

There you go jumping to conclusions again.

I *do* us a regex that recognizes all of the tokens at once. I have
one regex like '@(tokenexpr1)|(tokenexpr2)|(exprN)|...@/.

This is why the regex you have been suggesting isn't sufficient by
itself because the token expression for the closing image tag '}}'
matches before your nowiki tag expression. For look-ahead (and
apparently look-behind) to work, apparently the stream must be
positioned after were other expressions would have already matched.

However, if I also modify the closing image tag expression to use a
similar look-ahead expression:

  }}(?!})

then this effectively cancels out the undesirable effect of the other
expression. Now I can use your original look-ahead expression with the
above like:

<?php

$expr = '/(}}}(?!}))|(}}(?!}))/';

$input = 'foo}}}}}}}}}}bar';

$ret = preg_match($expr, $input, $matches, PREG_OFFSET_CAPTURE);

echo "ret=$ret\n";
print_r($matches);

The above works, I've removed the clause I had and everything is working great.

Thanks,
Mike


More information about the wiki-standards mailing list