Re: regex puzzle!
From: G. Stewart (galenstewart_at_yahoo.com)
Date: 11/24/04
- Next message: jiangyh: "FastGetExistingType() method question?"
- Previous message: Jon Skeet [C# MVP]: "Re: confusing remark in MSDN Control.BeginInvoke() doc"
- In reply to: Niki Estner: "Re: regex puzzle!"
- Next in thread: Niki Estner: "Re: regex puzzle!"
- Reply: Niki Estner: "Re: regex puzzle!"
- Messages sorted by: [ date ] [ thread ]
Date: 24 Nov 2004 00:26:34 -0800
Niki:
Thanks. The HTML source that I am extracting from does not include
<html>, <head>, or <body> tags. Just the block or in-line element
tags: <p>, <i>, <em>, <a>, etc.
What I want to do is to extract a snippet or preview of the source
block, while preserving all the html tags in the snippet/preview,
including formatting and links. Any ideas?
"Niki Estner" <niki.estner@cube.net> wrote in message news:<OCz2Aqb0EHA.1264@TK2MSFTNGP12.phx.gbl>...
> A regex like this one:
> ^((<[^>]*>)*[^<]){400}
> will extract 400 characters from an HTML source, not counting any HTML-tags
> (i.e. ignoring characters between <...>), but I'm not sure about the
> opening/closing-tag matching: I think it is possible to do this (thanks to
> certain specials in MS' regex implementation), however, as a usual HTML
> pages start with a "body" tag, that spans the entire page I'm not sure if
> this is really what you want.
>
> Niki
>
> "G. Stewart" <galenstewart@yahoo.com> wrote in
> news:258fa3a8.0411222044.6c3b9ec@posting.google.com...
> > The objective is to extract the first n characters of text from an
> > HTML block. I wish to preserve all HTML (links, formatting etc.), and
> > at the same time, extend the size of the block to ensure that all
> > closing tags are recovered.
> >
> > For example, simply extracting the first 400 characters of a HTML
> > block may result in an <i> opening tag being including, but its
> > closing tag being excluding. Or a link may get chopped halfway - [...
> > blah blah <a href="ht] may be the last few characters of the recovered
> > phrase.
> >
> > Ideally, if any html opening tag is included in the first n
> > characters, then any number of extra characters should continue to be
> > extracted from the source block until all paired closing tags are
> > found.
> >
> > We can assume that the source block is well-formed HTML, and every
> > opening tag has a closing tag (whether optional or not). Furthermore
> > (if it makes any difference), we can assume that all tags are given in
> > their simplest forms with no attributes (e.g. <p>, <ul>, <li>, <b>),
> > except for anchor tags, which have the href attribute of course.
> >
> > Can anyone suggest a regular expression to do this?
- Next message: jiangyh: "FastGetExistingType() method question?"
- Previous message: Jon Skeet [C# MVP]: "Re: confusing remark in MSDN Control.BeginInvoke() doc"
- In reply to: Niki Estner: "Re: regex puzzle!"
- Next in thread: Niki Estner: "Re: regex puzzle!"
- Reply: Niki Estner: "Re: regex puzzle!"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|