Discussion
RE#: how we built the world's fastest regex engine in F#
nanoxide: Calling this RE# (resharp), when there is a much more popular and established product already named R# (ReSharper, by JetBrains) in the. NET world will probably hurting your SEO and/or could potentially cause some legal grief.
FrustratedMonky: F# is one of the biggest 'What could have beens'. Great language, that just didn't hit the right time, or reach critical mass of the gestalt of the community.
nbevans: It uses far less tokens than C#, so watch this space...
nodra: Care to explain? Pattern matching, type inference, etc.?
Nelkins: Various investigations have found it to be one of the most token efficient statically typed programming languageshttps://martinalderson.com/posts/which-programming-languages...
masfuerte: This is very impressive.> how does RE# find the leftmost-longest match efficiently? remember the bidirectional scanning we mentioned earlier - run the DFA right to left to find all possible match starts, then run a reversed DFA left to right to find the ends. the leftmost start paired with the rightmost end gives you leftmost-longest. two linear DFA scans, no backtracking, no ambiguity.I'm pretty sure that should say "the leftmost start paired with the leftmost end".This also implies that the algorithm has to scan the entire input to find the first match, and the article goes on to confirm this. So the algorithm is a poor choice if you just want the first match in a very long text. But if you want all matches it is very good.
mananaysiempre: > I'm pretty sure that should say "the leftmost start paired with the leftmost end".I’m pretty sure it shouldn’t, that would give you the leftmost shortest match instead of leftmost longest.
masfuerte: As originally written, doesn't it go from the start of the first match to the end of the last match? I feel like I'm missing something.
ieviev: It goes from start of the first match to the longest "alive" end, in practice it will go to a dead state and return after finding the match end.there's an implicit `.*` in front of the first pass but i felt it would've been a long tangent so i didn't want to get into it.so given input 'aabbcc' and pattern `b+`,first reverse pass (using `.*b+`) marks 'aa|b|bcc'<-the forward pass starts from the first match:'aa->b|b|cc' marking 2 endsthen enters a dead state after the first 'c' and returns the longest end: aa|bb|cci hope this explains it better
masfuerte: Cheers. I was more confused by how you were doing multiple matches. So I read the paper, which describes the AllEnds algorithm. If I understand correctly, the reverse pass captures all of the match starts and these need to be remembered for the forward pass. Which is what you were showing above, but I didn't follow it.So, once it gets going, a traditional engine can produce matches iteratively with no further allocation, but RE# requires allocation proportional to the total number of matches. And in return, it's very much faster and much easier to use (with intersection and complement).
klibertp: If you claim it's the fastest, how does it compare to one-more-re-nightmare?- https://github.com/telekons/one-more-re-nightmare- https://applied-langua.ge/posts/omrn-compiler.htmlOMRN is a regex compiler that leverages Common Lisp's compiler to produce optimized assembly to match the given regex. It's incredibly fast. It does omit some features to achieve that, though.
mananaysiempre: As a potential user (not the author), what jumps out at me about the two is:OMRN: No lookaround, eager compilation, can output first matchRE#: No submatches, lazy compilation, must accumulate all matchesBoth lookaround and submatch extraction are hard problems, but for practical purposes the lack of lazy compilation feels like it would be the most consequential, as it essentially disqualifies the engine from potentially adversarial REs (or I guess not with the state limit, but then it’s questionable if it actually counts as a full RE engine in such an application).
ieviev: Yes, exactly correctIt's also beneficial to merge some of the matching locations into ranges where possible, so when `a*` matches a long sequence of '|a|a|a|a|a|', it can be represented as a range of (0,5), we do this to keep the lookaround internal states smaller in the engine.
mananaysiempre: Worth mentioning (haven’t checked if the paper talks about this) that while the industry mostly forgot about derivatives and extended REs (i.e. REs with intersection and negation), academia did not. Unfortunately, there have been some pretty discouraging results: the DFA for an extended RE (including a lazy DFA implemented using derivatives, as here) is worst-case doubly exponential in the length of the expression[1], not just exponential as for normal REs. So there is a potential reason not to support intersections in one’s RE matcher, even if they are enticingly easy to implement in terms of derivatives (and even if I personally like to see experimentation in this direction).[1] https://www.sciencedirect.com/science/article/pii/S030439751...
someplaceguy: > the DFA for an extended RE (including a lazy DFA implemented using derivatives, as here) is worst-case doubly exponential in the length of the expressionThe authors seem to claim linear complexity:> the result is RE#, the first general-purpose regex engine to support intersection and complement with linear-time guarantees, and also the overall fastest regex engine on a large set of benchmarks
mananaysiempre: These claims are compatible. For instance, lex, re2c, ragel, etc. are exponential in the length of the automaton description, but the resulting lexers work in linear time[1] in the length of the string. Here the situation is even better, because the DFA is constructed lazily, at most one state per input character, so to observe its full enormity relative to the size of the needle you need an equally enormous haystack. “One new DFA state per input character” somewhat understates things, because producing this state requires a non-constant[2] amount of work, and storing the new derivative’s syntax tree a non-constant amount of space; but with hash-consing and memoization it’s not bad. Either way, derivative-based matching with a lazy DFA takes something like O(needle + f(needle) × haystack) time where I’m guessing f(n) has to be O(n log n) at least (to bring large ORs into normal form by sorting) but in practice is closer to constant. Space consumption is less of a constraint because at any point you can flush your cache of derivatives (= DFA) and restart from scratch if it gets too bloated.[1] Kind of, unless you hit ambiguities that need to be resolved with the maximal munch rule; anyways that’s irrelevant to a single-RE matcher.[2] In particular, introductions to Brzozowski’s approach usually omit—but his original paper does mention—that you need to do some degree of syntax-tree simplification for the derivatives to stay bounded in size (thus finite in number) and the matcher to stay linear in the haystack.
sourcegrift: I've had nothing but great experience with F#. If it wasn't associated with Microsoft, it'd be more popular than haskell
lynx97: You realize that Microsoft Research employed Simon for many many years?
chuckadams: [delayed]
ieviev: We refer to this in the paper as well,The standard way to do intersection / complementation of regexes with NFAs requires determinization, which causes a huge blowup, whereas for us this is the cost of a derivative.It is true that we cannot avoid enormous DFA sizes, a simple case would be (.*a.*)&(.*b.*)&(.*c.*)&(.*d.*)... which has 2^4 states and every intersection adds +1 to the exponent.How we get around this in the real world is that we create at most one state per input character, so even if the full DFA size is 1 million, you need an input that is at least 1 million characters long to reach it.The real argument to complexity is how expensive can the cost of taking a lazy derivative get? The first time you use the engine with a unique input and states, it is not linear - the worst case is creating a new state for each character. The second time the same (or similar) input is used these states are already created and it is linear. So as said in the article it is a bit foggy - Lazy DFAs are not linear but appear as such for practical cases
btown: > The second time the same (or similar) input is used these states are already created and it is linear.Does this imply that the DFA for a regex, as an internal cache, is mutable and persisted between inputs? Could this lead to subtle denial-of-service attacks, where inputs are chosen by an attacker to steadily increase the cached complexity - are there eviction techniques to guard against this? And how might this work in a multi-threaded environment?
ieviev: Yes, most (i think all) lazy DFA engines have a mutable DFA behind a lock internally that grows during matching.Multithreading is generally a non-issue, you just wrap the function that creates the state behind a lock/mutex, this is usually the default.The subtle denial of service part is interesting, i haven't thought of it before. Yes this is possible. For security-critical uses i would compile the full DFA ahead of time - the memory cost may be painful but this completely removes the chance of anything going wrong.There are valid arguments to switch from DFA to NFA with large state spaces, but RE# intentionally does not switch to a NFA and capitalizes on reducing the DFA memory costs instead (eg. minterm compression in the post, algebraic simplifications in the paper).The problem with going from DFA to NFA for large state spaces is that this makes the match time performance fall off a cliff - something like going from 1GB/s to 1KB/s as we also show in the benchmarks in the paper.As for eviction techniques i have not researched this, the simplest thing to do is just completely reset the instance and rebuild past a certain size, but likely there is a better way.
noelwelsh: I love regular expression derivatives. One neat thing about regular expression derivatives is they are continuation-passing style for regular expressions. The derivative is "what to do next" after seeing a character, which is the continuation of the re. It's a nice conceptual connection if you're into programming language theory.Low-key hate the lack of capitalization on the blog, which made me stumble over every sentence start. Great blog post a bit marred by unnecessary divergence from standard written English.
spankalee: It's so uncomfortable to read.Why do people do this? They capitalize names, so clearly their shift key works. Do they do it feel special or like some sort of rebel?
trashface: Maybe they drafted it on a phone where capitalization is harder. My guess is the all-lowercase world is mostly people who do most of their text creation on phones and similar, not keyboards.
layer8: > Multithreading is generally a non-issue, you just wrap the function that creates the state behind a lock/mutex, this is usually the default.But you also have to lock when reading the state, not just when writing/creating it. Wouldn’t that cause lock contention with sufficiently concurrent use?
ieviev: No, we do not lock reading the state, we only lock the creation side and the transition table reference stays valid during matching even if it is outdated.Only when a nonexistent state is encountered during matching it enters the locked region.