QUERY: comparing website contents




I've got two websites, one original, the other based off the original.

I like to diff/compare the websites using diff automatic comparison
tools to see what text/information has changed. The problem is, the HTML
code and layout has been changed drastically so I can't do a straight
text file compare. What am interested in is purely the raw content
(paragraphs, sentences, etc.). The original site has no javascript,
onmouseover hovers, etc. The new revamped website has javascript,
onmouseover hovers, popups, etc.

How can I create a script (Perl? C++?) so that it extracts the main
text BODIEs from both sites? I guess also have to specify starting &
ending delimiters. Once extracted, it would need to convert < p ></ p >
paragraph tags, and strip out < a onmouseover... > anchor links (while
maintaining the word inbetween the anchor link ofcourse). The new
website uses two spaces after each full stop while the old website uses
1 space. Will this matter?

Once we got the plain text, how to wrap the paragraphs after 80
characters per line... so that we can easily do file compares.



--
andrewwan1980
------------------------------------------------------------------------
Posted via http://www.codecomments.com
------------------------------------------------------------------------

.



Relevant Pages

  • RE: Spis products worth a try? Or any suggestions for developers tool?
    ... By far it has the best JavaScript analysis engine ... SPI does choke up when testing a JavaScript intensive website, ... > Audit your website security with Acunetix Web Vulnerability Scanner: ... Up to 75% of cyber attacks are launched on ...
    (Pen-Test)
  • RE: Spis products worth a try? Or any suggestions for developers tool?
    ... By far it has the best JavaScript analysis engine and is lightning fast. ... SPI does choke up when testing a JavaScript intensive website, ... > Audit your website security with Acunetix Web Vulnerability Scanner: ... Up to 75% of cyber attacks are launched on ...
    (Pen-Test)
  • Re: Problem with SSL
    ... When I use the fully qualified URL for the index.htm page, the redirect to ... http://www.mysite.com/secure/index.asp then the client side javascript works ... >> of the website root directory. ...
    (microsoft.public.inetserver.iis.security)
  • Re: DOS ATTACK
    ... >I have a friend that has a DOS Attack going on against their website. ... Add some Javascript to your page to break out of the frame. ...
    (Incidents)
  • Re: CSS button in form
    ... On Thu, 29 Sep 2005, meltedown wrote: ... > and sometime they have javascript for reqasons I don't understand, ... > versions each for in and out states, that makes 20 buttons per website. ... off-topic religious/political post, March 28, 2005 ...
    (comp.infosystems.www.authoring.stylesheets)