Re: regex puzzle!

From: G. Stewart (galenstewart_at_yahoo.com)
Date: 11/24/04

  • Next message: Marc Scheuner [MVP ADSI]: "Re: Why use DLLs?"
    Date: 24 Nov 2004 00:26:34 -0800
    
    

    Niki:

    Thanks. The HTML source that I am extracting from does not include
    <html>, <head>, or <body> tags. Just the block or in-line element
    tags: <p>, <i>, <em>, <a>, etc.

    What I want to do is to extract a snippet or preview of the source
    block, while preserving all the html tags in the snippet/preview,
    including formatting and links. Any ideas?

    "Niki Estner" <niki.estner@cube.net> wrote in message news:<OCz2Aqb0EHA.1264@TK2MSFTNGP12.phx.gbl>...
    > A regex like this one:
    > ^((<[^>]*>)*[^<]){400}
    > will extract 400 characters from an HTML source, not counting any HTML-tags
    > (i.e. ignoring characters between <...>), but I'm not sure about the
    > opening/closing-tag matching: I think it is possible to do this (thanks to
    > certain specials in MS' regex implementation), however, as a usual HTML
    > pages start with a "body" tag, that spans the entire page I'm not sure if
    > this is really what you want.
    >
    > Niki
    >
    > "G. Stewart" <galenstewart@yahoo.com> wrote in
    > news:258fa3a8.0411222044.6c3b9ec@posting.google.com...
    > > The objective is to extract the first n characters of text from an
    > > HTML block. I wish to preserve all HTML (links, formatting etc.), and
    > > at the same time, extend the size of the block to ensure that all
    > > closing tags are recovered.
    > >
    > > For example, simply extracting the first 400 characters of a HTML
    > > block may result in an <i> opening tag being including, but its
    > > closing tag being excluding. Or a link may get chopped halfway - [...
    > > blah blah <a href="ht] may be the last few characters of the recovered
    > > phrase.
    > >
    > > Ideally, if any html opening tag is included in the first n
    > > characters, then any number of extra characters should continue to be
    > > extracted from the source block until all paired closing tags are
    > > found.
    > >
    > > We can assume that the source block is well-formed HTML, and every
    > > opening tag has a closing tag (whether optional or not). Furthermore
    > > (if it makes any difference), we can assume that all tags are given in
    > > their simplest forms with no attributes (e.g. <p>, <ul>, <li>, <b>),
    > > except for anchor tags, which have the href attribute of course.
    > >
    > > Can anyone suggest a regular expression to do this?


  • Next message: Marc Scheuner [MVP ADSI]: "Re: Why use DLLs?"

    Relevant Pages

    • Re: regex puzzle!
      ... The HTML source that I am extracting from does not include ... > will extract 400 characters from an HTML source, ... >> closing tags are recovered. ...
      (microsoft.public.dotnet.languages.csharp)
    • Re: regex puzzle!
      ... The HTML source that I am extracting from does not include ... > will extract 400 characters from an HTML source, ... >> closing tags are recovered. ...
      (microsoft.public.dotnet.framework)
    • Re: regex puzzle!
      ... The HTML source that I am extracting from does not include ... > What I want to do is to extract a snippet or preview of the source ... Niki ...
      (microsoft.public.dotnet.languages.csharp)
    • Re: regex puzzle!
      ... The HTML source that I am extracting from does not include ... > What I want to do is to extract a snippet or preview of the source ... Niki ...
      (microsoft.public.dotnet.framework)
    • Re: regex puzzle!
      ... The HTML source that I am extracting from does not include ... > What I want to do is to extract a snippet or preview of the source ... Niki ...
      (microsoft.public.dotnet.framework.aspnet)