RE: Crawling a blog--How to set up the include/exclude rules

From: Hollis_Paul_mvp (Hollis_Paul_mvp_at_noemail.nospam)
Date: 03/24/05


Date: Wed, 23 Mar 2005 16:39:04 -0800

Wrong conclusion!! Specifically in light of the initial statement that both
Google and MSN were searching the blog.

So, to try something different, I deleted the source and re-created it as
http://www.msmvps.com/OBTS/ . This is the same as the original definition
except there is a final slash.

Then I changed my rules to:
www.msmvps.com
included
http://www.msmvps.com/obts/* included
http://www.msmvps.com/obts/archive/*/*/*/*.aspx included
http://www.msmvps.com/* exclude
http://www.msmvps.com/obts/archive/*.aspx exclude

That didn't get me any pages, so I went in to the source properties and
unchecked the crawl all pages on this site, and changed that to custom
selection and put in a page depth of 10.

That didn't help, so I changed the hop limit to 2. Below you will see the
end of the gatherer log before I managed to uncheck all the logging options.
As you can see it is going all over. As you can see it is going all over.
So, how should those two parameters be set so that I just get the pages on my
blog. It is still indexing, and the page count has grown to 7092 when the
stop took effect.

So, what should those two parameters be set at to restrict the crawl to my
blog?

Gatherer log at end of logging:

3/23/2005 3:36:57 PM Add http://www.datalan.com
      The address has been redirected to http://www.datalan.com/
 
 3/23/2005 3:36:57 PM Add http://www.datalan.com
      Done
 
 3/23/2005 3:36:57 PM Add http://www.parallelspace.com
      The address has been redirected to http://www.parallelspace.com/
 
 3/23/2005 3:36:57 PM Add http://www.parallelspace.com
      Done (The document contains invalid utf-8 encoded characters)
 
 3/23/2005 3:36:57 PM Add
http://blog.u2u.info/DottextWeb/patrick/archive/2004/10.aspx
      Done
 
 3/23/2005 3:36:57 PM Add
http://blog.seattlepi.nwsource.com/microsoft/archives/004519.html
      Done
 
 3/23/2005 3:36:57 PM Add
http://support.microsoft.com/Personalization/MyProducts.aspx
      Links from this address were excluded because the page contains a META
NAME="ROBOTS" tag
 
 3/23/2005 3:36:57 PM Add
http://support.microsoft.com/Personalization/MyProducts.aspx
      The address has been redirected to
http://support.microsoft.com/gp/nolinks
 
 3/23/2005 3:36:57 PM Add
http://support.microsoft.com/Personalization/MyProducts.aspx
      Content for this URL is excluded by the server because a no-index
attribute.
 
 3/23/2005 3:36:57 PM Add http://msmvps.com/obts/archive/2004/12.aspx
      Done
 
 3/23/2005 3:36:57 PM Add
http://support.microsoft.com/personalization/MySupportFavorites.aspx
      Links from this address were excluded because the page contains a META
NAME="ROBOTS" tag
 
 3/23/2005 3:36:57 PM Add
http://support.microsoft.com/personalization/MySupportFavorites.aspx
      The address has been redirected to
http://support.microsoft.com/gp/nolinks
 
 3/23/2005 3:36:57 PM Add
http://support.microsoft.com/personalization/MySupportFavorites.aspx
      Content for this URL is excluded by the server because a no-index
attribute.
 
 3/23/2005 3:36:57 PM Add
http://blog.seattlepi.nwsource.com/microsoft/archives/004508.html
      Done
 
 3/23/2005 3:36:57 PM Add http://www.kdkeys.net/forums/TopicsNotAnswered.aspx
      Done
 
 3/23/2005 3:36:57 PM Add http://www.tzunami.com/careers.htm
      Done
 
 3/23/2005 3:36:57 PM Add http://www.sharepointsolutions.com
      The address has been redirected to http://www.sharepointsolutions.com/
 
 3/23/2005 3:36:57 PM Add http://www.sharepointsolutions.com
      Done
 
 3/23/2005 3:36:57 PM Add
http://google.blognewschannel.com/index.php/archives/category/general/
      Done
 
 3/23/2005 3:36:57 PM Add http://www.enderminh.com/webservices
      The address has been redirected to
http://www.enderminh.com/webservices/
 
 3/23/2005 3:36:57 PM Add http://www.enderminh.com/webservices
      Done
 
 3/23/2005 3:36:57 PM Add
http://www.realdn.net/msblog/default,date,2005-03-16.aspx
      Done
 
 3/23/2005 3:36:57 PM Add
http://microsoft.weblogsinc.com/forward/entry/1234000510037137/
      Done
 
 3/23/2005 3:36:57 PM Add
http://www.veritest.com/certification/verified/default.asp
      Done
 
 3/23/2005 3:36:56 PM Add
http://www.visitorville.com/counter/count.php?ProfileID=3289
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add http://www.visitorville.com/top/?profile_id=3289
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add
http://www.visitorville.com/js/plgtrafic.js.php?ProfileID=3289
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add http://rpc.bloglines.com/blogroll?id=montevino
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add http://www.FreeiPodShuffle.com/?r=16393306
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add http://www.FreeMiniMacs.com/?r=14072206
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add http://img55.exs.cx/img55/8309/basic_vert.gif
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add http://www.marqui.com/Paybloggers/
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add
http://proxy.blogads.com/npoufwjopipunbjmdpn/insidegoogleinsidemicrosoft/feed.js
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add
http://proxy.blogads.com/npoufwjopipunbjmdpn/insidegoogleinsidemicrosoft/ba_as.css
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add http://wordpress.org/
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add http://validator.w3.org/check/referer
      The address was excluded because crawl depth restrictions were exceeded
 
 3/23/2005 3:36:56 PM Add http://google.blognewschannel.com/wp-register.php
      The address was excluded because its file extension is restricted in
the file type rules.
 
 3/23/2005 3:36:56 PM Add http://google.blognewschannel.com/wp-login.php
      The address was excluded because its file extension is restricted in
the file type rules.

Hollis_Paul_mvp

"Wei-Dong XU [MSFT]" wrote:

> Hi Hollis,
>
> From your log and my test, SPS search returns the error message:
> "The address could not be found, (0x80041201 - The object was not found. ) "
> My log:
> " The address could not be found, (0x80041208 - The address appears to be
> in error. Check that the address is valid. )"
>
> It appears that the remote server networking blocks the Sharepoint
> crawling. I don't know whether the remote server uses the F5 Networks
> Big-IP Load Balancer. There is one kb article introducing how to configure
> the load balancer to allow the crawling, which may be helpful for the
> msmvps.com server.
> 889652 Configuring the F5 Networks Big-IP Load Balancer to allow successful
> http://support.microsoft.com/?id=889652
>
> Please feel free to let me know if you have any question.
>
> Best Regards,
> Wei-Dong XU
> Microsoft Product Support Services
>
> When responding to posts, please "Reply to Group" via your newsreader so
> that others may learn and benefit from your issue.
> =====================================================
> Business-Critical Phone Support (BCPS) provides you with technical phone
> support at no charge during critical LAN outages or "business down"
> situations. This benefit is available 24 hours a day, 7 days a week to all
> Microsoft technology partners in the United States and Canada.
> This and other support options are available here:
> BCPS:
> https://partner.microsoft.com/US/technicalsupport/supportoverview/40010469
> Others: https://partner.microsoft.com/US/technicalsupport/supportoverview/
>
> If you are outside the United States, please visit our International
> Support page:
> http://support.microsoft.com/default.aspx?scid=%2finternational.aspx.
> =====================================================
> Get Secure! - www.microsoft.com/security
> This posting is provided "AS IS" with no warranties, and confers no rights



Relevant Pages

  • Re: Prominent Right Wing Blogger Jumps Ship [OT]
    ... Little Green Footballs (LGF) is a political blog run by American web ... � I'm not pretending I'm giving equal time to both sides. ... Support for fascists, both in America (see: ...
    (rec.outdoors.rv-travel)
  • Prominent Right Wing Blogger Jumps Ship [OT]
    ... Little Green Footballs (LGF) is a political blog run by American web ... Support for fascists, both in America (see: Pat Buchanan, Robert Stacy ...
    (rec.outdoors.rv-travel)
  • Re: Prominent Right Wing Blogger Jumps Ship [OT]
    ... Prominent Right Wing Blogger? ... Little Green Footballs (LGF) is a political blog run by American web ... Support for fascists, both in America (see: ...
    (rec.outdoors.rv-travel)
  • Re: Snip~ I do wish that the scum of New Orleans had all
    ... > support it? ... >>You could have added that she told private things she knew about MsP, ... No. Yolonda posted all of her own sordid history on her blog. ... or maps to my home under the cloak of anonymity. ...
    (misc.legal)
  • Rafa blog 25 January
    ... this is my last blog from Melbourne. ... I hope this final blog before leaving for the airport will do you. ... I don't usually eat these things and even less during a tournament, but I wound things up here yesterday and we felt like doing it. ... I want to take this chance to thank many people for all the support they have given me these weeks. ...
    (rec.sport.tennis)

Quantcast