Today I was looking for a way to do nested expression matching with regular expressions, and pretty much came up empty. Then after a trip to the bookstore to pick up Mastering Regular Expressions by Jeffrey Friedl, I finally found it.
Interestingly, even now that I know what to search for :-), I can’t find a single reference to this on the net or on MSDN.
With the .NET regular expression evaluator, there are (?<DEPTH>) and (?<-DEPTH>) constructs that you can use to match nested expressions; for example, if you want to find matching parentheses, or matching HTML tags. Here’s a “simple” example that will match nested <div> tags:
<div> (?><div (?<DEPTH>) | </div (?<-DEPTH>) | .? )* (?(DEPTH)(?!))</div>
Which will match the part in red below:
<div>this is some <div>red</div>text</div></div></div>
This is pretty cool, I’ve got to say. I really can’t do this justice; if you’re interested, I recommend you pick up the book!
The only mention I found of balancing groups on MSDN is the brief one at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpcongroupingconstructs.asp
The better discussion is, as you mention, in Mastering Regex. The .NET chapter is available online actually, which is good because I’m fairly certain there are more than a few first editions out there on people’s bookshelves.
http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf
http://www.oreilly.com/catalog/regex2/chapter/index.html.
Good stuff, Greg.
Actually, that MSDN page does not cover the DEPTH construct, which was the whole point of what I was trying to do.
But I wish I knew that chapter was online – would have saved me an hour or so today going to and from the book store!
Mmm…I stand corrected; it does talk about it. :-)
http://weitz.de/regex-coach
Unless regex-coach uses the .NET regular expression parser, it wouldn’t be of any use – the particular feature I was talking about is a MS extension.
Greg,
Expresso (www.ultrapico.com) does use .NET regular expressions, so you might want to give it a try. Thanks for the reference on balancing groups and Friedl’s book. There is no “DEPTH” construct, per se. That is simply the name that Friedl chose for his capture group, any other name works as well. The key concept is “balancing groups”.
Any idea if this can be used to easily get the data between non nested items. For example, if I wanted to get all the data between a <div>
and </div>
tag and I know there is no nested <div>
tags… I tried this but it doesn’t work.
Thanks for the post- this is working like a champ! Exactly what I needed. I am having trouble getting this to work with multiple lines of html though. If there are any newlines between the tags, this seems to break.. am I doing something wrong? Thanks, Paul
Note that you can do this even without the .NET-exclusive balancing group feature, as long as you know in advance the maximum levels of nesting you need to support. See http://badassery.blogspot.com/2006/03/regex-recursion-without-balancing.html for details.
(Moderator, please remove my previous comment)
The code as posted here (and everywhere I’ve seen it) for matching balancing constructs does not work.
The problem lies in the meat of the pattern:
(?>
<div (?<DEPTH>) | </div (?<-DEPTH>) | .? )* (?(DEPTH)(?!))
)
let’s say input is: “Beforedivs<div>.stuff</div>more stuff</div>Afterdivs”
The pattern matches:<div>.stuff</div>more stuff</div>
That’s not a balanced set of divs. Here’s why: the <div> is matched and the DEPTH stack is pushed. The first </div> is matched and the stack is popped. Now the (?(DEPTH)(?!) comes in and tests if there’s anything on the stack and there isn’t so the pattern continues greedily. The second </div> is finally matched, the EMTPY stack is popped and again the test sees nothing on the stack so the pattern does not fail. The pattern is forced to backtrack and allow the last </div> in the pattern to match, and we get the erroneous result. The pattern does not account for unmatched </div>’s.
The good news is that solution is easy. Make it:
(?>
<div (?) | </div (?) | .? )*? (?(DEPTH)(?!)
)
I made the whole parentheses subexpression NON-GREEDY with the ? mark. Now the pattern stops after every <div>,</div>, or single ‘anything’ character (.?) matched and sees if it can match that last </div>, and if so it stops. This new pattern now matches:
<div>stuff</div>
sZweaver2112@gmail.com (86 the z)
Just to follow up what I said yesterday, having read the post at an msdn blog on the subject:It is best to think of a Group as a Stack of captures. Where the top of the stack is the last capture made. (?\)) Matches to “)” and pops a capture off of the Open group’s capture stack. This match can only be successful if and only if the Open group’s capture stack is not empty. This is a fancy way of saying that for every match of this group there must be a match of the group Open.So a “pop” *should* fail on an empty stack and make the pattern fail, but in my recent experience this was just not the case.
sZweaver@gmail.com (86 the z)
This post covers some interesting uses for balancing groups: Fun With .NET’s Regex Balancing Groups
Does anyone have an idea how to do the same regular expression but with PERL or PHP instead of .NET?
Hello for nice article
But this approach has a problem with \n. in my project I have \n in my text and this code does not recognize reg expression. can you help me about my problem