Re: regex puzzle!

Tech-Archive recommends: Fix windows errors by optimizing your registry

From: Niki Estner (niki.estner_at_cube.net)
Date: 11/24/04


Date: Wed, 24 Nov 2004 01:13:50 +0100

A regex like this one:
  ^((<[^>]*>)*[^<]){400}
will extract 400 characters from an HTML source, not counting any HTML-tags
(i.e. ignoring characters between <...>), but I'm not sure about the
opening/closing-tag matching: I think it is possible to do this (thanks to
certain specials in MS' regex implementation), however, as a usual HTML
pages start with a "body" tag, that spans the entire page I'm not sure if
this is really what you want.

Niki

"G. Stewart" <galenstewart@yahoo.com> wrote in
news:258fa3a8.0411222044.6c3b9ec@posting.google.com...
> The objective is to extract the first n characters of text from an
> HTML block. I wish to preserve all HTML (links, formatting etc.), and
> at the same time, extend the size of the block to ensure that all
> closing tags are recovered.
>
> For example, simply extracting the first 400 characters of a HTML
> block may result in an <i> opening tag being including, but its
> closing tag being excluding. Or a link may get chopped halfway - [...
> blah blah <a href="ht] may be the last few characters of the recovered
> phrase.
>
> Ideally, if any html opening tag is included in the first n
> characters, then any number of extra characters should continue to be
> extracted from the source block until all paired closing tags are
> found.
>
> We can assume that the source block is well-formed HTML, and every
> opening tag has a closing tag (whether optional or not). Furthermore
> (if it makes any difference), we can assume that all tags are given in
> their simplest forms with no attributes (e.g. <p>, <ul>, <li>, <b>),
> except for anchor tags, which have the href attribute of course.
>
> Can anyone suggest a regular expression to do this?



Relevant Pages

  • Re: sed and html
    ... then extract the desired fields using awk. ... The substitution above works on the entire range ... of characters across multiple newlines in the entire file, ... it would be better if you provide some example html ...
    (comp.unix.shell)
  • regex puzzle!
    ... The objective is to extract the first n characters of text from an ... HTML block. ... simply extracting the first 400 characters of a HTML ... closing tag being excluding. ...
    (microsoft.public.dotnet.languages.csharp)
  • regex puzzle!
    ... The objective is to extract the first n characters of text from an ... HTML block. ... simply extracting the first 400 characters of a HTML ... closing tag being excluding. ...
    (microsoft.public.dotnet.framework)
  • regex puzzle!
    ... The objective is to extract the first n characters of text from an ... HTML block. ... simply extracting the first 400 characters of a HTML ... closing tag being excluding. ...
    (microsoft.public.dotnet.framework.aspnet)
  • regex puzzle!
    ... The objective is to extract the first n characters of text from an ... HTML block. ... simply extracting the first 400 characters of a HTML ... closing tag being excluding. ...
    (microsoft.public.dotnet.general)