Re: how to screen scrape content + images

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance

From: MWells (outbound__at_sygnal.com)
Date: 01/12/05


Date: Wed, 12 Jan 2005 14:12:41 +1300

Rachel,

If your extraction is a one-time effort, designed to gather the basic
content for your new version of the website, it's easiest to use a tool like
Juan recommended or even just extract the details by hand. Real-estate
listings can be fairly complex, containing a couple of hundred fields per
property listing, so you might consider whipping up some tools for yourself
to rend the data from the page. Regular expressions are very useful for
this purpose.

If your content-extraction need is recurring, I would at all costs avoid
screen scraping. That's akin to using their existing website as a database
for your new site. Among other things, it means they have to keep their old
site running somewhere and in good working order.

Instead, do some digging to find out where the content is originating from.
If they're taking the photographs and entering the content directly into
their website themselves, you'll probably have to mimic that functionality
through a set of web-based administrative tools. In that case you may be
able to skip the listing-content extraction entirely, build the tools, and
have your client re-enter all of the listing. Sell the idea as
"training"... =)

There's a good chance that they are using a third party provider to acquire
the listings, or are feeding the data in directly from their local MLS. In
the US, most multiple listing services (MLSs) now comply with the national
IDX and VOW standards for publishing listings. Assuming your client's MLS
does, you can acquire a developer license and pull the content yourself from
the MLS, store it in a database, and then embed the data in the website as
desired.

We do this for the Chicago region, so I should note that the effort is all
fairly significant. The raw data is often published daily in large CSV
files (100 MB+ in size), retrieved from an FTP server. It's fully
de-normalized so you probably want to do a ton of scrubbing and
normalization to make it useful. You'll likely need to decode all of the
fields to English text so that the general public can make sense of the
listing content. Images are also often FTP'd although some MLS's offer URL
access to the photos for active listings (i.e. you'd have to cache some if
you want to display sold listings for your client). In the VOW ("Virtual
Office Website") program, regulations are such that you also need to have an
enrollment process before visitors are permitted to see the listings, do an
email address verification by sending an account activation email, etc. etc.
etc.

Nothing insurmountable, but expect to grind some code if you go this route.
Alternately, you may be able to find a third party service to handle the
listing display entirely, and if your client likes the appearance (you
rarely have choices...), then you can just focus on the rest of the website.

/// M

"rachel" <rachel@hotmail.com> wrote in message
news:055e01c4f7e7$1b3c4fe0$a601280a@phx.gbl...
> Hello,
>
> I am currently contracted out by a real estate agent. He
> has a page that he has created himself that has a list of
> homes.. their images and data in html format.
>
> He wants me to take this page and reformat it so that it
> looks different.
> Do I use screen scraping to do this?
> Could someone please point me to a good screen scraping
> article... I am using ASP.NET and C#
>
> Thanks,
> Rachel



Relevant Pages

  • Re: Is it possible to create "linked" text boxes on multiple pages
    ... I was, and I think Rob and David were, just trying to point out reality for you. ... At a dozen Realtors with multiple listings and $100 per listing, that's what, $12,000 in the bank, right? ... After your big meeting on Friday, you should have another two dozen, maybe four dozen Realtors signed up. ... I would like to work with someone that could take this concept to the next level with a Flash website. ...
    (microsoft.public.publisher.webdesign)
  • Re: Is it possible to create "linked" text boxes on multiple pages
    ... Why not bother with Pub?....it's not the correct application and anything ... | concept to the next level with a Flash website. ... Think I'm full of BS Mike? ... |> your own listings to have an edge over other agents during your listing ...
    (microsoft.public.publisher.webdesign)
  • Re: Is it possible to create "linked" text boxes on multiple pages
    ... already have over a dozen Realtors, with multiple listings, who are prepared ... concept to the next level with a Flash website. ... Publisher is not the program for this project. ... great if I could change the photo once, and it would update all locations. ...
    (microsoft.public.publisher.webdesign)
  • Whats up with this re-direct?
    ... I notice a website wants to change the domain it is linked to in the ... google search. ... Is the re-direct something that google does for you if you want to ... And how long will they have two listings -- ...
    (alt.internet.search-engines)