Re: Stack overflow in regexp matcher
Stefan Monnier <monnier <at> iro.umontreal.ca>
2007-04-01 18:50:47 GMT
> The regexp in question looks like: ".*some text.*". When used with
> string-match, removing both ".*"s shouldn't affect the matching
> semantically, right?
Differences (off the top of my head):
- match-beginning (as well as the return value of string-match)
- .* only matches chars other than \n
- .* matches greedily, so (string-match "foo" "foofoo") will only match the
first "foo", whereas (string-match ".*foo" "foofoo") will match the whole
string (the .* matches the first "foo").
> If so, would it be wise to add a regexp simplification step that, for
> example, could remove leading and trailing ".*"s in regular expressions
> passed to string-match?
In the code that calls string-match, yes, but within string-match it would
be terribly difficult to figure out whether the optimization is harmless.
> I realise now is not the time to talk about feature additions, but it
> would certainly make my life simpler if any regular expression that *can*
> be compiled to a DFA is.
I don't think you'll find much disagreement here.
> (I also wonder where "regular expressions" that are more than regular
> came from. Just because they are useful?)
No idea. Originally, only the \N backref was non-regular, but PCRE
introduced many others. Note that using a DFA to implement unix
regexp-matching is actually pretty difficult because of the precise