Nested Constructs in Regular Expressions

Today I was looking for a way to do nested expression matching with regular expressions, and pretty much came up empty. Then after a trip to the bookstore to pick up Mastering Regular Expressions by Jeffrey Friedl, I finally found it.

Interestingly, even now that I know what to search for :-), I can’t find a single reference to this on the net or on MSDN.

With the .NET regular expression evaluator, there are (?<DEPTH>) and (?<-DEPTH>) constructs that you can use to match nested expressions; for example, if you want to find matching parentheses, or matching HTML tags. Here’s a “simple” example that will match nested <div> tags:

<div> (?><div (?<DEPTH>) | </div (?<-DEPTH>) | .? )* (?(DEPTH)(?!))</div>

Which will match the part in red below:

<div>this is some <div>red</div>text</div></div></div>

This is pretty cool, I’ve got to say. I really can’t do this justice; if you’re interested, I recommend you pick up the book!

14 thoughts on “Nested Constructs in Regular Expressions

  1. Grant Carpenter

    The only mention I found of balancing groups on MSDN is the brief one at http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/html/cpcongroupingconstructs.asp

    The better discussion is, as you mention, in Mastering Regex. The .NET chapter is available online actually, which is good because I’m fairly certain there are more than a few first editions out there on people’s bookshelves.

    http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf
    http://www.oreilly.com/catalog/regex2/chapter/index.html.

    Good stuff, Greg.

    Reply
  2. Greg Reinacker

    Actually, that MSDN page does not cover the DEPTH construct, which was the whole point of what I was trying to do.

    But I wish I knew that chapter was online – would have saved me an hour or so today going to and from the book store!

    Reply
  3. Jim Hollenhorst

    Greg,

    Expresso (www.ultrapico.com) does use .NET regular expressions, so you might want to give it a try. Thanks for the reference on balancing groups and Friedl’s book. There is no “DEPTH” construct, per se. That is simply the name that Friedl chose for his capture group, any other name works as well. The key concept is “balancing groups”.

    Reply
  4. Todd Michael

    Any idea if this can be used to easily get the data between non nested items. For example, if I wanted to get all the data between a <div>

    and </div>

    tag and I know there is no nested <div>

    tags… I tried this but it doesn’t work.

    Reply
  5. Paul Warner

    Thanks for the post- this is working like a champ! Exactly what I needed. I am having trouble getting this to work with multiple lines of html though. If there are any newlines between the tags, this seems to break.. am I doing something wrong? Thanks, Paul

    Reply
  6. Scott Weaver

    (Moderator, please remove my previous comment)

    The code as posted here (and everywhere I’ve seen it) for matching balancing constructs does not work.

    The problem lies in the meat of the pattern:

    (?>
    <div (?<DEPTH>) | </div (?<-DEPTH>) | .? )* (?(DEPTH)(?!))
    )

    let’s say input is: “Beforedivs<div>.stuff</div>more stuff</div>Afterdivs”

    The pattern matches:<div>.stuff</div>more stuff</div>

    That’s not a balanced set of divs. Here’s why: the <div> is matched and the DEPTH stack is pushed. The first </div> is matched and the stack is popped. Now the (?(DEPTH)(?!) comes in and tests if there’s anything on the stack and there isn’t so the pattern continues greedily. The second </div> is finally matched, the EMTPY stack is popped and again the test sees nothing on the stack so the pattern does not fail. The pattern is forced to backtrack and allow the last </div> in the pattern to match, and we get the erroneous result. The pattern does not account for unmatched </div>’s.

    The good news is that solution is easy. Make it:
    (?>
    <div (?) | </div (?) | .? )*? (?(DEPTH)(?!)
    )
    I made the whole parentheses subexpression NON-GREEDY with the ? mark. Now the pattern stops after every <div>,</div>, or single ‘anything’ character (.?) matched and sees if it can match that last </div>, and if so it stops. This new pattern now matches:
    <div>stuff</div>

    sZweaver2112@gmail.com (86 the z)

    Reply
  7. Scott Weaver

    Just to follow up what I said yesterday, having read the post at an msdn blog on the subject:It is best to think of a Group as a Stack of captures. Where the top of the stack is the last capture made. (?\)) Matches to “)” and pops a capture off of the Open group’s capture stack. This match can only be successful if and only if the Open group’s capture stack is not empty. This is a fancy way of saying that for every match of this group there must be a match of the group Open.So a “pop” *should* fail on an empty stack and make the pattern fail, but in my recent experience this was just not the case.

    sZweaver@gmail.com (86 the z)

    Reply
  8. M.Bagheri

    Hello for nice article
    But this approach has a problem with \n. in my project I have \n in my text and this code does not recognize reg expression. can you help me about my problem

    Reply

Leave a Reply