regex puzzle!

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance

From: G. Stewart (galenstewart_at_yahoo.com)
Date: 11/23/04


Date: 22 Nov 2004 20:44:05 -0800

The objective is to extract the first n characters of text from an
HTML block. I wish to preserve all HTML (links, formatting etc.), and
at the same time, extend the size of the block to ensure that all
closing tags are recovered.

For example, simply extracting the first 400 characters of a HTML
block may result in an <i> opening tag being including, but its
closing tag being excluding. Or a link may get chopped halfway - [...
blah blah <a href="ht] may be the last few characters of the recovered
phrase.

Ideally, if any html opening tag is included in the first n
characters, then any number of extra characters should continue to be
extracted from the source block until all paired closing tags are
found.

We can assume that the source block is well-formed HTML, and every
opening tag has a closing tag (whether optional or not). Furthermore
(if it makes any difference), we can assume that all tags are given in
their simplest forms with no attributes (e.g. <p>, <ul>, <li>, <b>),
except for anchor tags, which have the href attribute of course.

Can anyone suggest a regular expression to do this?



Relevant Pages

  • Re: regex puzzle!
    ... will extract 400 characters from an HTML source, ... if any html opening tag is included in the first n ...
    (microsoft.public.dotnet.general)
  • Re: sed and html
    ... then extract the desired fields using awk. ... The substitution above works on the entire range ... of characters across multiple newlines in the entire file, ... it would be better if you provide some example html ...
    (comp.unix.shell)
  • regex puzzle!
    ... The objective is to extract the first n characters of text from an ... HTML block. ... simply extracting the first 400 characters of a HTML ... closing tag being excluding. ...
    (microsoft.public.dotnet.languages.csharp)
  • regex puzzle!
    ... The objective is to extract the first n characters of text from an ... HTML block. ... simply extracting the first 400 characters of a HTML ... closing tag being excluding. ...
    (microsoft.public.dotnet.framework)
  • regex puzzle!
    ... The objective is to extract the first n characters of text from an ... HTML block. ... simply extracting the first 400 characters of a HTML ... closing tag being excluding. ...
    (microsoft.public.dotnet.framework.aspnet)