Re: Index Server and TIF files

Tech-Archive recommends: Speed Up your PC by fixing your registry

From: John Lavender (john.lavender_at_virgin.net)
Date: 08/16/04


Date: Mon, 16 Aug 2004 02:05:48 -0700

Thanks for your help Hilary, "bewildered" just about
describes my feelings too.

Just to conclude, I've managed to reproduce this
behaviour on a production server installation at the
customer site and on my own desktop server, so it's not
machine\environment specific.

I've read Greg's email (thanks Greg) and he does seem to
be aware of problems like this with Index Server.
However, the solution involves additional software, a
change in the file storage format and a new filter. It's
avoiding the problem altogether by eliminating the
Microsoft filter rather than resolving it. If it does
work, and I have to assume it does, then it points the
finger at the Microsoft TIF filter.

Regards,

John.
>-----Original Message-----
>Answers in line:
>BTW - did you review Greg Stobie's post? It seems like
he has more
>experience in this matter than I and has a workable
solution.
>
>--
>Hilary Cotter
>Looking for a book on SQL Server replication?
>http://www.nwsu.com/0974973602.html
>
>
>"John Lavender" <John.Lavender@Virgin.Net> wrote in
message
>news:567901c48111$8b93c180$a501280a@phx.gbl...
>> Hi Hilary,
>>
>> Thanks for your response. Bear with me on this; I don't
>> profess to be accomplished in this area.
>>
>> I accept that the OCR is not a perfect science and the
>> quality of the scans will influence the results.
However,
>> I would expect the result to be the same (good or bad)
on
>> both catalogues and any discrepancies to be limited.
>>
>> The results were very different. For the same single
word
>> search term, the complete catalogue returned 9 hits,
the
>> subset catalogue 16 hits. Despite re-indexing numerous
>> times, these figures continued to disagree
significantly.
>> Would ranking make any difference with such a limited
>> number of hits?
>
>Absolutely not. There is something else at work here.
The catalogs if they
>are indexing the same underlying files/documents and
folders should return
>the same results if indexing is completed.
>
>>
>> I have already conducted tests with Filtdump on the
files
>> in question and the search term was amongst those
emitted
>> on all of the documents.
>>
>
>Well, bang goes the theory that the quality of scan was
the problem.
>
>> I agree, given the results, that indexing has probably
>> not been completed successfully. I have reviewed the
>> registry settings (found on this site) to try to
>> alleviate this, without success. I have been unable to
>> find any log or error message that would confirm a
>> problem during the indexing process.
>
>One thing I have noticed is that sometimes applications
will hold on to
>files. For instance if I save a Word doc and exit Word,
sometimes Word
>remains in the task list and if I go to do something to
that document, I get
>a file in use error message. Check task manager to
ensure that there are no
>applications hanging on to that file. AntiVirus software
and Open File
>agents for backup software are notrious for doing this.
Make sure also that
>IE, and any imaging programs are closed down.
>
>>
>> Those documents that were marked for re-indexing were
>> never actually re-indexed and sat permanently in that
>> queue. Only one or two of these files were candidate
>> documents in the search discrepancies.
>
>I confess that I am bewildered by your problems. I would
try the following
>
>1) reboot and see if this helps
>2) try to repro the problem on another machine
>3) call PSS.
>
>Good luck.
>
>>
>> Regards,
>>
>> John.
>>
>> >-----Original Message-----
>> >When you are running into problems like this you
really
>> should open a
>> >support incident with Microsoft. The $245US per call
>> will probably pay off
>> >in the long run, especially when you consider the cost
>> of some of the
>> >alternative, ie the Google Search applicance which
last
>> time I checked was
>> >$35k.
>> >
>> >The problem with Tiff's is that the OCR does take some
>> time, OCR is not a
>> >perfect science, and depending on the quality of your
>> Tiff, the emitted text
>> >might not match the actual content.
>> >
>> >To test this you really should download filtdump from
>> the platform SDK and
>> >run some tests.
>> >
>> >If you are using a new catalog, you should see the
same
>> results, but ranking
>> >might be different as ranking is skewed by the number
of
>> documents you are
>> >indexing and the density of words in your content, the
>> distribution of
>> >words, and the density of hits.
>> >
>> >The fact that you didn't means that indexing was
>> possibly not complete, and
>> >you may have had another process which "touched" these
>> tiff's and marked
>> >them for re-indexing.
>> >--
>> >Hilary Cotter
>> >Looking for a book on SQL Server replication?
>> >http://www.nwsu.com/0974973602.html
>> >
>> >
>> >"John Lavender" <john.lavender@virgin.net> wrote in
>> message
>> >news:4b2c01c48073$99520460$a301280a@phx.gbl...
>> >> I've submitted the query below twice now without any
>> >> response from Microsoft staff on this newsgroup.
>> >>
>> >> I can only conclude their silence means that the
>> >> combination of the Microsoft Office Professional TIF
>> >> filter and Index Server are not suitable for
production
>> >> use. We are now working on an alternative solution.
>> >>
>> >> It would have saved a great deal of effort if this
>> could
>> >> have been pointed out earlier by specialists
monitoring
>> >> this site.
>> >>
>> >> ----------------------------------------------------

---
>> ---
>> >>
>> >> Hi,
>> >>
>> >> I've posted a general query on this recently. 
Hopefully
>> >> with more detail someone can help.
>> >>
>> >> We use Index Server to index 60,000 TIF files in a
>> single
>> >> catalog. We have installed Office XP Imaging to 
index
>> the
>> >> TIF file content. The catalogue build is slow, but
>> >> the 'total docs' eventually reflects that expected 
and
>> >> the 'Docs to Index' reduces to zero.
>> >>
>> >> Searches will return documents, but further testing
>> with
>> >> the Query Catalog interface has shown that the 
number
>> of
>> >> documents returned is inconsistent or incomplete.
>> >>
>> >> Indexing a subset of the main catalog in a seperate
>> >> catalog proved this. Different totals are returned 
from
>> >> the the main and subset catalogs for the same search
>> when
>> >> the same documents should have been returned. The 
same
>> >> search will return different totals if tried over 
two
>> >> sessions, regardless of which catalog is used.
>> >>
>> >> Additionally, some filtered documents are seemingly
>> >> returned to the unfiltered list for no apparent 
reason.
>> >> These document files are never updated.
>> >>
>> >> I'd be extremely grateful for any help from the
>> Microsoft
>> >> reps or any other members.
>> >>
>> >> Regards,
>> >>
>> >> John Lavender.
>> >>
>> >>
>> >>
>> >
>> >
>> >.
>> >
>
>
>.
>


Relevant Pages

  • Re: Performance problems -- need guidance on scaling
    ... Thanks for the update on the FT Catalog corruption, ... will use, up to a max of 512Mb, if this memory is available. ... a server with 512 MB of RAM and a resource_usage value of 5 ... the data, you could, detach your SQL 2000 mdf & ldf files ...
    (microsoft.public.sqlserver.fulltext)
  • No record on ASP-Page with SQL-Query
    ... I'm want to use the Indexing Service on a Windows 2000 SP4 Server to ... Then I saw the catalog growing. ... Dim myConnection ...
    (microsoft.public.inetserver.indexserver)
  • SqlServer 2000, MSSearch and Fulltext search problems
    ... When we try to do an incremental population or rebuild the catalog from ... we get an error message saying that the ... the server goes into BlueScreen. ...
    (microsoft.public.sqlserver.fulltext)
  • Stop Indexing Service automatically adding network shares
    ... I'm on W2K Server. ... Whenever I create a catalog in the Indexing ... Service, if I stop/start the catalog, it will always add network shares ...
    (microsoft.public.inetserver.indexserver)
  • Re: Use of an Index Server catalog from other systems
    ... One large block still on server and most of it archived onto a read only ... Have the clients somehow use that existing catalog for their search ... Warning there can be some performance issues with Indexing ...
    (microsoft.public.windows.server.sbs)