[wiki-standards] Re: Better Creole Tests?
Michael B Allen
ioplex at gmail.com
Sun Jul 27 18:39:17 CEST 2008
On Sun, Jul 27, 2008 at 5:37 AM, Radomir Dopieralski
<wikistandards at sheep.art.pl> wrote:
> Sat, Jul 26, 2008 at 02:02:44PM -0400:
>> On Sat, Jul 26, 2008 at 12:40 PM, Radomir Dopieralski
>
>> >> I need a regex that just matches the last occurrence of three closing
>> >> curly braces in a sequence of more than three.
>
>> > Um... that's pretty straightforward no? Exactly how you describe:
>
>> > }}}(?!})
>>
>> Indeed that would be perfect.
>>
>> Unfortunately because my tokenizer uses a single regex with an OR
>> condition for each token, the image closing token '}}' matches before
>> the look-ahead expression. So it seems a look-ahead expression doesn't
>> follow the precedence expected.
>>
>> Meaning an expression and input like:
>>
>> $expr = '/(}}}(?!}))|(}})/';
>> $input = 'foo}}}}bar';
>>
>> will always match the image closing markup '}}' before the look-ahead
>> expression.
>
> That's really straightforward too, isn't it?
>
> (?<!})}}(?!})
Apparently not because this doesn't work.
>> > I'm not sure that using tokens in the way you describe is a good
>> > idea in the specific case of wiki markup -- it surely makes a lot
>> > of things more complicated.
>
>> It's not complicated. When I get a '}}}}' token, I output one '}',
>> unget '}}}' and continue at the top of the loop. It's 4 lines of code
>> and is completely isolated.
>
> If you think that four lines of extra code per token, with addition to
> who-konw-how-many lines of code needed to distinguish the special case
> and allow injecting tokens back into the stream is a small price, then
> there is something wrong.
My Creole 1.0 implementation is ~600 lines, I process the stream only
once and it has at most two tokens in memory at the same time. There's
nothing wrong.
> I mean, each and every special case makes it
> a little harder to comprehend and manage your code -- and there is no
> difference how elaborate the code for that special case it. If you are
> going to write your own token-recognition code simply because you can't
> understand regular expressions, then why use regular expressions in it
> at all?
That's a good question. If I were doing C, my tokenizer would iterate
over each character to collect each token. But practice has shown that
doing that sort of thing in PHP is actually slower than using regex
because the regex is using an optimized C library.
So, I use a regex tokenizer that grabs two tokens at a time (stuff
that doesn't match and a token). Then I use a regular state machine to
handle the grammer.
>> Of course it would be better if it could be handled with only regex
>> but these sort of clauses are common for a tokenizer + loop model.
>
> Not really, they are a sure sign of a failure at some point in the
> development: either a failure of preparing proper grammar and parsing
> flow for the specific language, or a failure of the parsing technique
> used (like trying to parse a context free grammar with regexps). Of
> course, they can sometimes simplify the parser greatly, so the ugliness
> might be a cheap price, but you shouldn't take them so lightly.
>
> In particular, they can increase the computational complexity of your
> parser -- and even make it never stop in some cases.
My implementation handles every case that I've managed to think up.
Here's my sample page:
http://www.ioplex.com/~miallen/CreoleTest.html
And it's as fast as I think it could ever be for PHP.
I'm not using regex to iteratively transform the entire input 50 times
like some implementations are doing.
You're jumping to conclusions to ring your own bell at my expense. And
I would care if you provided a working answer to my question but you
didn't even do that.
Thanks for nothin'
Mike
More information about the wiki-standards
mailing list