Re: regex puzzle!

From: Niki Estner (niki.estner_at_cube.net)
Date: 11/24/04


Date: Wed, 24 Nov 2004 11:55:33 +0100


"G. Stewart" <galenstewart@yahoo.com> wrote in
news:258fa3a8.0411240026.670605c5@posting.google.com...
> Niki:
>
> Thanks. The HTML source that I am extracting from does not include
> <html>, <head>, or <body> tags. Just the block or in-line element
> tags: <p>, <i>, <em>, <a>, etc.
>
> What I want to do is to extract a snippet or preview of the source
> block, while preserving all the html tags in the snippet/preview,
> including formatting and links. Any ideas?

The point is that matching paranthesis is possible with regex's, but it's
quite tricky (i.e.: I'd have to look it up in a book myself...). However, I
still don't see why you need that; Consider an input like this:
  "This text contains <i>italic</i>,<em>bold</em> and even <a
...>hyperlinked</a> text"
If you extract 20 characters from it, not counting tag-characters (using a
regex like the one I've suggested in my previous post) you'd get:
  "This text contains <i>i"
Now, if you'd put this in an HTML element like:
  "<span>This text contains <i>i</span>..."
So you'd produce correct HTML (not XML). I think this should work for any
input, since the closing-tag's for <p>, <i>, <em>... are all optional.

Niki



Relevant Pages

  • Re: display hit results in proper format
    ... its done at query time. ... in process and use that to extract the text. ... right but how is it converted from all kind of files into html or is ... 2- i wrote a hit highliter script that works only on html files,and i ...
    (microsoft.public.inetserver.indexserver)
  • Re: regex puzzle!
    ... The HTML source that I am extracting from does not include ... > will extract 400 characters from an HTML source, ... >> closing tags are recovered. ...
    (microsoft.public.dotnet.framework)
  • Re: regex puzzle!
    ... The HTML source that I am extracting from does not include ... > will extract 400 characters from an HTML source, ... >> closing tags are recovered. ...
    (microsoft.public.dotnet.general)
  • Re: Word + win32ole - how to find formatting of a word?
    ... I'm able to extract the style and text of each paragraph. ... That works great to convert it into individual divs (in the HTML CSS ... Ist Ihr Browser Vista-kompatibel? ...
    (comp.lang.ruby)
  • Re: regex puzzle!
    ... The HTML source that I am extracting from does not include ... > will extract 400 characters from an HTML source, ... >> closing tags are recovered. ...
    (microsoft.public.dotnet.languages.csharp)

Loading