RE: regex puzzle!

From: Andrew (Andrew_at_discussions.microsoft.com)
Date: 11/23/04


Date: Tue, 23 Nov 2004 13:47:03 -0800

Regex's don't count. You may need to look into some grammar tools to
accomplish this. Or write some custom code.

"G. Stewart" wrote:

> The objective is to extract the first n characters of text from an
> HTML block. I wish to preserve all HTML (links, formatting etc.), and
> at the same time, extend the size of the block to ensure that all
> closing tags are recovered.
>
> For example, simply extracting the first 400 characters of a HTML
> block may result in an <i> opening tag being including, but its
> closing tag being excluding. Or a link may get chopped halfway - [...
> blah blah <a href="ht] may be the last few characters of the recovered
> phrase.
>
> Ideally, if any html opening tag is included in the first n
> characters, then any number of extra characters should continue to be
> extracted from the source block until all paired closing tags are
> found.
>
> We can assume that the source block is well-formed HTML, and every
> opening tag has a closing tag (whether optional or not). Furthermore
> (if it makes any difference), we can assume that all tags are given in
> their simplest forms with no attributes (e.g. <p>, <ul>, <li>, <b>),
> except for anchor tags, which have the href attribute of course.
>
> Can anyone suggest a regular expression to do this?
>



Relevant Pages

  • Re: [PHP] generating an html intro text ...
    ... You would have to search out and pull in all closing tags. ... grab 256 characters -- The string. ... html markup should not go towards the string length count, ...
    (php.general)
  • Re: word webpages
    ... The ther are som tags with no closing tags DreamWeaver would remove what ever was causing these problems. ... Just create a simple document and save as HTML Make sure it has some type of formatting. ... XML all versions ...
    (microsoft.public.mac.office.word)
  • Re: macro and cl-who help
    ... Lisp, but... ... you back into the "walking forms as HTML data" mode, ... This would have been extensible with user-defined tags, ... HTML tags are macros can be functions: ...
    (comp.lang.lisp)
  • RE: [PHP] generating an html intro text ...
    ... generating an html intro text ... ... You would have to search out and pull in all closing tags. ... grab 256 characters -- The string. ...
    (php.general)
  • Re: html scraping
    ... Not for parsing HTML! ... DOM and SimpleXML are the right tools here. ... parser that can deal with missing end tags. ... -- If a close tag is seen, push it on the stack. ...
    (comp.lang.php)