Re: regex puzzle!
From: Niki Estner (niki.estner_at_cube.net)
Date: 11/24/04
- Next message: Klaus H. Probst: "Re: Web User Control"
- Previous message: dimension: "creating Charting software."
- In reply to: G. Stewart: "regex puzzle!"
- Next in thread: G. Stewart: "Re: regex puzzle!"
- Reply: G. Stewart: "Re: regex puzzle!"
- Messages sorted by: [ date ] [ thread ]
Date: Wed, 24 Nov 2004 01:13:50 +0100
A regex like this one:
^((<[^>]*>)*[^<]){400}
will extract 400 characters from an HTML source, not counting any HTML-tags
(i.e. ignoring characters between <...>), but I'm not sure about the
opening/closing-tag matching: I think it is possible to do this (thanks to
certain specials in MS' regex implementation), however, as a usual HTML
pages start with a "body" tag, that spans the entire page I'm not sure if
this is really what you want.
Niki
"G. Stewart" <galenstewart@yahoo.com> wrote in
news:258fa3a8.0411222044.6c3b9ec@posting.google.com...
> The objective is to extract the first n characters of text from an
> HTML block. I wish to preserve all HTML (links, formatting etc.), and
> at the same time, extend the size of the block to ensure that all
> closing tags are recovered.
>
> For example, simply extracting the first 400 characters of a HTML
> block may result in an <i> opening tag being including, but its
> closing tag being excluding. Or a link may get chopped halfway - [...
> blah blah <a href="ht] may be the last few characters of the recovered
> phrase.
>
> Ideally, if any html opening tag is included in the first n
> characters, then any number of extra characters should continue to be
> extracted from the source block until all paired closing tags are
> found.
>
> We can assume that the source block is well-formed HTML, and every
> opening tag has a closing tag (whether optional or not). Furthermore
> (if it makes any difference), we can assume that all tags are given in
> their simplest forms with no attributes (e.g. <p>, <ul>, <li>, <b>),
> except for anchor tags, which have the href attribute of course.
>
> Can anyone suggest a regular expression to do this?
- Next message: Klaus H. Probst: "Re: Web User Control"
- Previous message: dimension: "creating Charting software."
- In reply to: G. Stewart: "regex puzzle!"
- Next in thread: G. Stewart: "Re: regex puzzle!"
- Reply: G. Stewart: "Re: regex puzzle!"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|