Re: Architecture question
- From: "Hilary Cotter" <hilary.cotter@xxxxxxxxx>
- Date: Mon, 3 Jul 2006 07:44:38 -0400
answers in line.
--
Hilary Cotter
Director of Text Mining and Database Strategy
RelevantNOISE.Com - Dedicated to mining blogs for business intelligence.
This posting is my own and doesn't necessarily represent RelevantNoise's
positions, strategies or opinions.
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Jeremy" <grand@xxxxxxxxxxx> wrote in message
news:%23iOV2SlnGHA.4728@xxxxxxxxxxxxxxxxxxxxxxx
Hilary, thanks. Yes, I have read the blog article, but the djvu format is
new to me & I'll study that.
Part of my question had to do with the overall architecture of the process
( I know it's not a technical fts question, but what better forum to ask
it
on?). Boiled down, I'm thinking of a process along these lines:
- The fax machine would store incoming images in a windows folder. Should
it store each transmission in a single file (might be a batch of dox, not
just a multi-page doc), or each page?
If its multiple pages per fax and you search on a word and get a word from
one page of a fax, how are you going to display all pages for that fax.
Unless you can figure out a way to do this, it should be a single page.
- An unattended program would monitor the output of the ocr process & suck
the images with (embedded or separate?) text into a sql table, where
incremental indexing is going on. Up to this point, the incoming stuff
has
not been touched or viewed by a human being.
You would probably want the OCR'd data in a text column for faster indexing
speed. Store the binary data in the file system. For administrative purposes
you could store it in the database.
- A program looking for variants on customer name will classify the
incoming
images by customer & index the dox. A user would handle exceptions, and
hopefully the bulk of the images would be correctly indexed without
intervention.
Use fuzzy grouping or the thresausus option to handle spelling variations.
- Customer specialists will work a list of newly arrived and ocr'd images
for their assigned customer, keying into our system from the images. FTS
can
attempt to extract the target data & prefill various fields. The users
will
verify & correct.
Any thoughts are appreciated.
Jeremy
"Hilary Cotter" <hilary.cotter@xxxxxxxxx> wrote in message
news:uFOwoWEnGHA.2264@xxxxxxxxxxxxxxxxxxxxxxx
You might want to look at this technology http://www.djvuzone.org/computer.
Office does automatic OCR on Tiffs, which you can receive on your
It works with SQL FTS.as
You could also push the pdf's into SQL Server and have them indexed there
well. The problem is the quality of your OCR.
Here is an article on how to do it.
http://www.indexserverfaq.com/blobs.htm
.
- References:
- Re: Architecture question
- From: Jeremy
- Re: Architecture question
- Prev by Date: Re: what the path of gatherer logs?
- Next by Date: Re: why i never see process Msftefd?
- Previous by thread: Re: Architecture question
- Next by thread: Re: Newbie question:why use stack in Full-Text index?
- Index(es):
Relevant Pages
|