Say I have a string like this, where {{ is an opening tag and }} is a closing tag (disregard comments). I want to separate the string into topmost-level groups, in this case into "first group" and "second group". I'll be using JS and I'd prefer to use RegExp or the fastest possible way.
{{//first group
somestuff
{{//group inside first group
somemorestuff
}}//end of group inside first group
}}//end of first group
{{//second group
}}//end of second group
I can abstract this into that I want to match the smallest possible strings with an equal amount of opening and closing tags. The tags won't always be on different lines. What's the best possible way to do this?
I asked this on Stack Overflow here, but I received no answers. I'm hoping that Hashnode is better :).
Any time that you want to deal with nested elements, rather than regular expressions, the right tool for the job is a context-free grammar. A CFG has all of the power of a regular grammar—it itself is a superset of a regular grammar—but you can also define symbols and production rules on how symbols are combined into larger strings. CFG are more difficult than regular expressions to wrap your head around but also more powerful.
PegJS is the best JavaScript tool that I've encountered so far to work with CFG parsers.
I created a runnable solution on RunKit using your example: runkit.com/mattstrom/pegjs-example. I threw it together quickly so it might not cover every scenario you're looking for. The code along with the grammar looks like this:
const pegjs = require("pegjs");
const str = `
{{
somestuff
{{
somemorestuff
}}
}}
{{
}}
`;
const grammar = `
{
class Tag {
constructor(inner) {
this.inner = inner;
}
}
}
start
= _ first:node _ second:node _ { return [first, second]; }
/ _ node:node _ { return node; }
node
= nested
/ tag
/ content
nested
= '{{' _ inner:tag _ '}}' { return new Tag(inner); }
/ '{{' left:content _ inner:tag _ '}}' { return [left, new Tag(inner)]; }
/ '{{' inner:tag _ right:content _ '}}' { return [new Tag(inner), right]; }
tag
= '{{' _ inner:content _ '}}' { return new Tag(inner); }
content
= first:token __ rest:content { return first + ' ' + rest; }
/ first:token { return first; }
token
= chars:char* { return chars.join(''); }
char = [A-z0-9_]
_ "optional whitespace"
= [ \\t\\r\\n]*
__ "mandatory whitespace"
= [ \\t\\r\\n]+
`;
const parser = pegjs.generate(grammar);
parser.parse(str);
This doesn't look like a regular language), so it is impossible with unextended regexes, and likely also with common extensions. See this famous StackOverflow answer: stackoverflow.com/a/1732454/723090
As for what you might use instead, perhaps if you found the indices of all open and close tags (you can use regex for that), you can use a simple counter or stack to find the indices of top level blocks?
edit: Just saw you got the same link on SO. It's a similar problem to html because there are also nested strucetures, and regexes have no memory to remember the nesting level.
In some kind of pseudocode:
depth = 0 open_index = -1 for index, tag in find_all("{{", "}}"): if tag == "{{": if depth == 0: open_index = index depth += 1 else: depth -= 1 assert depth >= 0: "unbalanced block" if depth == 0: yield "block from $open_index to $index"