RE: Crawling a blog--How to set up the include/exclude rules
From: Hollis_Paul_mvp (Hollis_Paul_mvp_at_noemail.nospam)
Date: 03/24/05
- Next message: David Wei: "How to get area object based on it's URL?"
- Previous message: dbsearch04_at_yahoo.com: "Re: webpart debugging."
- In reply to: Wei-Dong XU [MSFT]: "RE: Crawling a blog--How to set up the include/exclude rules"
- Next in thread: Hollis_Paul_mvp: "RE: Crawling a blog--How to set up the include/exclude rules"
- Messages sorted by: [ date ] [ thread ]
Date: Wed, 23 Mar 2005 16:39:04 -0800
Wrong conclusion!! Specifically in light of the initial statement that both
Google and MSN were searching the blog.
So, to try something different, I deleted the source and re-created it as
http://www.msmvps.com/OBTS/ . This is the same as the original definition
except there is a final slash.
Then I changed my rules to:
www.msmvps.com
included
http://www.msmvps.com/obts/* included
http://www.msmvps.com/obts/archive/*/*/*/*.aspx included
http://www.msmvps.com/* exclude
http://www.msmvps.com/obts/archive/*.aspx exclude
That didn't get me any pages, so I went in to the source properties and
unchecked the crawl all pages on this site, and changed that to custom
selection and put in a page depth of 10.
That didn't help, so I changed the hop limit to 2. Below you will see the
end of the gatherer log before I managed to uncheck all the logging options.
As you can see it is going all over. As you can see it is going all over.
So, how should those two parameters be set so that I just get the pages on my
blog. It is still indexing, and the page count has grown to 7092 when the
stop took effect.
So, what should those two parameters be set at to restrict the crawl to my
blog?
Gatherer log at end of logging:
3/23/2005 3:36:57 PM Add http://www.datalan.com
The address has been redirected to http://www.datalan.com/
3/23/2005 3:36:57 PM Add http://www.datalan.com
Done
3/23/2005 3:36:57 PM Add http://www.parallelspace.com
The address has been redirected to http://www.parallelspace.com/
3/23/2005 3:36:57 PM Add http://www.parallelspace.com
Done (The document contains invalid utf-8 encoded characters)
3/23/2005 3:36:57 PM Add
http://blog.u2u.info/DottextWeb/patrick/archive/2004/10.aspx
Done
3/23/2005 3:36:57 PM Add
http://blog.seattlepi.nwsource.com/microsoft/archives/004519.html
Done
3/23/2005 3:36:57 PM Add
http://support.microsoft.com/Personalization/MyProducts.aspx
Links from this address were excluded because the page contains a META
NAME="ROBOTS" tag
3/23/2005 3:36:57 PM Add
http://support.microsoft.com/Personalization/MyProducts.aspx
The address has been redirected to
http://support.microsoft.com/gp/nolinks
3/23/2005 3:36:57 PM Add
http://support.microsoft.com/Personalization/MyProducts.aspx
Content for this URL is excluded by the server because a no-index
attribute.
3/23/2005 3:36:57 PM Add http://msmvps.com/obts/archive/2004/12.aspx
Done
3/23/2005 3:36:57 PM Add
http://support.microsoft.com/personalization/MySupportFavorites.aspx
Links from this address were excluded because the page contains a META
NAME="ROBOTS" tag
3/23/2005 3:36:57 PM Add
http://support.microsoft.com/personalization/MySupportFavorites.aspx
The address has been redirected to
http://support.microsoft.com/gp/nolinks
3/23/2005 3:36:57 PM Add
http://support.microsoft.com/personalization/MySupportFavorites.aspx
Content for this URL is excluded by the server because a no-index
attribute.
3/23/2005 3:36:57 PM Add
http://blog.seattlepi.nwsource.com/microsoft/archives/004508.html
Done
3/23/2005 3:36:57 PM Add http://www.kdkeys.net/forums/TopicsNotAnswered.aspx
Done
3/23/2005 3:36:57 PM Add http://www.tzunami.com/careers.htm
Done
3/23/2005 3:36:57 PM Add http://www.sharepointsolutions.com
The address has been redirected to http://www.sharepointsolutions.com/
3/23/2005 3:36:57 PM Add http://www.sharepointsolutions.com
Done
3/23/2005 3:36:57 PM Add
http://google.blognewschannel.com/index.php/archives/category/general/
Done
3/23/2005 3:36:57 PM Add http://www.enderminh.com/webservices
The address has been redirected to
http://www.enderminh.com/webservices/
3/23/2005 3:36:57 PM Add http://www.enderminh.com/webservices
Done
3/23/2005 3:36:57 PM Add
http://www.realdn.net/msblog/default,date,2005-03-16.aspx
Done
3/23/2005 3:36:57 PM Add
http://microsoft.weblogsinc.com/forward/entry/1234000510037137/
Done
3/23/2005 3:36:57 PM Add
http://www.veritest.com/certification/verified/default.asp
Done
3/23/2005 3:36:56 PM Add
http://www.visitorville.com/counter/count.php?ProfileID=3289
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add http://www.visitorville.com/top/?profile_id=3289
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add
http://www.visitorville.com/js/plgtrafic.js.php?ProfileID=3289
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add http://rpc.bloglines.com/blogroll?id=montevino
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add http://www.FreeiPodShuffle.com/?r=16393306
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add http://www.FreeMiniMacs.com/?r=14072206
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add http://img55.exs.cx/img55/8309/basic_vert.gif
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add http://www.marqui.com/Paybloggers/
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add
http://proxy.blogads.com/npoufwjopipunbjmdpn/insidegoogleinsidemicrosoft/feed.js
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add
http://proxy.blogads.com/npoufwjopipunbjmdpn/insidegoogleinsidemicrosoft/ba_as.css
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add http://wordpress.org/
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add http://validator.w3.org/check/referer
The address was excluded because crawl depth restrictions were exceeded
3/23/2005 3:36:56 PM Add http://google.blognewschannel.com/wp-register.php
The address was excluded because its file extension is restricted in
the file type rules.
3/23/2005 3:36:56 PM Add http://google.blognewschannel.com/wp-login.php
The address was excluded because its file extension is restricted in
the file type rules.
Hollis_Paul_mvp
"Wei-Dong XU [MSFT]" wrote:
> Hi Hollis,
>
> From your log and my test, SPS search returns the error message:
> "The address could not be found, (0x80041201 - The object was not found. ) "
> My log:
> " The address could not be found, (0x80041208 - The address appears to be
> in error. Check that the address is valid. )"
>
> It appears that the remote server networking blocks the Sharepoint
> crawling. I don't know whether the remote server uses the F5 Networks
> Big-IP Load Balancer. There is one kb article introducing how to configure
> the load balancer to allow the crawling, which may be helpful for the
> msmvps.com server.
> 889652 Configuring the F5 Networks Big-IP Load Balancer to allow successful
> http://support.microsoft.com/?id=889652
>
> Please feel free to let me know if you have any question.
>
> Best Regards,
> Wei-Dong XU
> Microsoft Product Support Services
>
> When responding to posts, please "Reply to Group" via your newsreader so
> that others may learn and benefit from your issue.
> =====================================================
> Business-Critical Phone Support (BCPS) provides you with technical phone
> support at no charge during critical LAN outages or "business down"
> situations. This benefit is available 24 hours a day, 7 days a week to all
> Microsoft technology partners in the United States and Canada.
> This and other support options are available here:
> BCPS:
> https://partner.microsoft.com/US/technicalsupport/supportoverview/40010469
> Others: https://partner.microsoft.com/US/technicalsupport/supportoverview/
>
> If you are outside the United States, please visit our International
> Support page:
> http://support.microsoft.com/default.aspx?scid=%2finternational.aspx.
> =====================================================
> Get Secure! - www.microsoft.com/security
> This posting is provided "AS IS" with no warranties, and confers no rights
- Next message: David Wei: "How to get area object based on it's URL?"
- Previous message: dbsearch04_at_yahoo.com: "Re: webpart debugging."
- In reply to: Wei-Dong XU [MSFT]: "RE: Crawling a blog--How to set up the include/exclude rules"
- Next in thread: Hollis_Paul_mvp: "RE: Crawling a blog--How to set up the include/exclude rules"
- Messages sorted by: [ date ] [ thread ]
Relevant Pages
|