Re: Convert doc, xls, pdf, rtf to plain Text (string) - I will pay
- From: Frank Uray <FrankUray@xxxxxxxxxxxxxxxxxxxxxxxxx>
- Date: Fri, 12 Dec 2008 14:28:01 -0800
Hi Pete
Thanks for your answer.
The functionality is easy, I do not care about any
formattings and structures. I just need all the text
with all the spaces between the words out of these files.
For example the text above I expect like:
"Hi Pete Thanks for your answer. The functionality is ..."
I know, .csv/.txt/.rtf are simple and I can do also
by my own. The, for me not possible things, are .pdf and .doc.
If somebody can provide all this (including extracting text
from pictures), I am prepared to pay $ 5'000.-.
I need a offer for the finished solution and not a price
per hour or day.
Thanks and best regards
Frank Uray
"Peter Duniho" wrote:
On Fri, 12 Dec 2008 04:37:00 -0800, Frank Uray.
<FrankUray@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
I am looking for a C# class to read all text informations
(including spaces) out of files.
I need a C# class with just one function like this:
public string GetTextFromFile(string local_PathFile, string
local_FileType)
I will call this function like:
string SomeText = GetTextFromFile(@"C:\temp\somePDF.pdf", "PDF");
string SomeText = GetTextFromFile(@"C:\temp\someDOC.doc", "DOC");
The File Types are: .pdf, .doc, .xls, .txt, .csv, .rtf
The Text, out of this files need to be read without using COM.
There will be no Office or Acrobar Reader installed.
Please offer what this solution would cost.
That depends on exactly what functionality you want and who you're paying.
Are you looking for code that can successfully extract structured text
from documents that support text contents? Or simply some code that would
return anything in a file that looks like text, without making any attempt
to recover the structure of that text? If the latter, what's your
definition of "looks like text"?
A traditional solution to the latter problem includes defining "looks like
text" as any "significant" (e.g. more than N characters, where N is small)
string of bytes where every byte is less than 128. An experienced
programmer could probably solve that problem in a day or two, even
allowing for making the delivered code tidy and tested.
A more complex solution to the latter problem would be to support Unicode
or other encodings and rely on a dictionary to identify strings that are
actual words. This would be both more robust and more fragile; it would
be more robust because it could support arbitrary encodings, and could
eliminate strings that are actually not language text (e.g. such as you
might find in RTF). It would be more fragile because, of course, if there
is real text in the file that includes words not found in your dictionary,
you might miss that text (you'd include logic to include text not found in
the dictionary as long as matched text is still nearby, but that's not
going to catch everything).
I'd estimate that more complex solution could take an experienced
programmer a week to complete.
The most complex, most reliable solution would be to actually parse the
file formats and extract the text. Possibly this solution would even
inspect formatting data in the files so as to try to preserve the general
layout of the text as best as is possible using just text. For an
experienced programmer, assuming the simplest case of just extracting the
text without formatting, I'd estimate this would be roughly a week _per
file type_.
Some of the types are easy (CSV, RTF) or even trivial (TXT), but others
are more problematic (PDF, but Adobe has opened their specification so at
least it's documented) or very poorly documented (DOC, XLS). And of
course, there are other formats you might want to support (e.g. Office
2007's new format, the ZIP-ed, XML-based .docx, .xlsx, etc.). The trivial
formats should be doable in less than a week (much less, for CSV and TXT),
but the more complex ones might take longer; in the end, I think averaging
a week-per is probably about right.
Additionaly (please offer):
If somebody can provide a function to read all Text and Spaces
out of picture files, this woul dby great :-)
See above.
As for actual costs, that depends a lot on the qualifications of the
programmer you're hiring. That said, for an experienced freelance
programmer, you might expect to pay as much as $1000/day (which you can
apply to each of the time estimates above). You could pay a
less-experienced programmer less, but not all will be able to solve the
problem in the same time, so you could wind up paying more anyway.
Of course, with the economic slowdown, you may find out-of-work
programmers desparate for work, and as such might get away with paying
less for the same experience. But I wouldn't count on it. :)
Pete
- Follow-Ups:
- Re: Convert doc, xls, pdf, rtf to plain Text (string) - I will pay
- From: Peter Duniho
- Re: Convert doc, xls, pdf, rtf to plain Text (string) - I will pay
- References:
- Convert doc, xls, pdf, rtf to plain Text (string) - I will pay for
- From: Frank Uray
- Re: Convert doc, xls, pdf, rtf to plain Text (string) - I will pay for
- From: Peter Duniho
- Convert doc, xls, pdf, rtf to plain Text (string) - I will pay for
- Prev by Date: Re: what does this statement double? pPlanMinimum; mean?
- Next by Date: How to close a StreamWriter class when the program exits
- Previous by thread: Re: Convert doc, xls, pdf, rtf to plain Text (string) - I will pay for
- Next by thread: Re: Convert doc, xls, pdf, rtf to plain Text (string) - I will pay
- Index(es):
Relevant Pages
|