Re: Convert doc, xls, pdf, rtf to plain Text (string) - I will pay

Tech Tip: Click here to run a free scan for Windows Errors and optimize PC performance



Hi Pete

Thanks for your answer.

The functionality is easy, I do not care about any
formattings and structures. I just need all the text
with all the spaces between the words out of these files.

For example the text above I expect like:
"Hi Pete Thanks for your answer. The functionality is ..."

I know, .csv/.txt/.rtf are simple and I can do also
by my own. The, for me not possible things, are .pdf and .doc.

If somebody can provide all this (including extracting text
from pictures), I am prepared to pay $ 5'000.-.
I need a offer for the finished solution and not a price
per hour or day.

Thanks and best regards
Frank Uray



"Peter Duniho" wrote:

On Fri, 12 Dec 2008 04:37:00 -0800, Frank Uray
<FrankUray@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

I am looking for a C# class to read all text informations
(including spaces) out of files.

I need a C# class with just one function like this:
public string GetTextFromFile(string local_PathFile, string
local_FileType)

I will call this function like:
string SomeText = GetTextFromFile(@"C:\temp\somePDF.pdf", "PDF");
string SomeText = GetTextFromFile(@"C:\temp\someDOC.doc", "DOC");

The File Types are: .pdf, .doc, .xls, .txt, .csv, .rtf

The Text, out of this files need to be read without using COM.
There will be no Office or Acrobar Reader installed.

Please offer what this solution would cost.

That depends on exactly what functionality you want and who you're paying.

Are you looking for code that can successfully extract structured text
from documents that support text contents? Or simply some code that would
return anything in a file that looks like text, without making any attempt
to recover the structure of that text? If the latter, what's your
definition of "looks like text"?

A traditional solution to the latter problem includes defining "looks like
text" as any "significant" (e.g. more than N characters, where N is small)
string of bytes where every byte is less than 128. An experienced
programmer could probably solve that problem in a day or two, even
allowing for making the delivered code tidy and tested.

A more complex solution to the latter problem would be to support Unicode
or other encodings and rely on a dictionary to identify strings that are
actual words. This would be both more robust and more fragile; it would
be more robust because it could support arbitrary encodings, and could
eliminate strings that are actually not language text (e.g. such as you
might find in RTF). It would be more fragile because, of course, if there
is real text in the file that includes words not found in your dictionary,
you might miss that text (you'd include logic to include text not found in
the dictionary as long as matched text is still nearby, but that's not
going to catch everything).

I'd estimate that more complex solution could take an experienced
programmer a week to complete.

The most complex, most reliable solution would be to actually parse the
file formats and extract the text. Possibly this solution would even
inspect formatting data in the files so as to try to preserve the general
layout of the text as best as is possible using just text. For an
experienced programmer, assuming the simplest case of just extracting the
text without formatting, I'd estimate this would be roughly a week _per
file type_.

Some of the types are easy (CSV, RTF) or even trivial (TXT), but others
are more problematic (PDF, but Adobe has opened their specification so at
least it's documented) or very poorly documented (DOC, XLS). And of
course, there are other formats you might want to support (e.g. Office
2007's new format, the ZIP-ed, XML-based .docx, .xlsx, etc.). The trivial
formats should be doable in less than a week (much less, for CSV and TXT),
but the more complex ones might take longer; in the end, I think averaging
a week-per is probably about right.

Additionaly (please offer):
If somebody can provide a function to read all Text and Spaces
out of picture files, this woul dby great :-)

See above.

As for actual costs, that depends a lot on the qualifications of the
programmer you're hiring. That said, for an experienced freelance
programmer, you might expect to pay as much as $1000/day (which you can
apply to each of the time estimates above). You could pay a
less-experienced programmer less, but not all will be able to solve the
problem in the same time, so you could wind up paying more anyway.

Of course, with the economic slowdown, you may find out-of-work
programmers desparate for work, and as such might get away with paying
less for the same experience. But I wouldn't count on it. :)

Pete

.



Relevant Pages

  • Re: Convert doc, xls, pdf, rtf to plain Text (string) - I will pay for
    ... public string GetTextFromFile(string local_PathFile, string ... An experienced programmer could probably solve that problem in a day or two, even allowing for making the delivered code tidy and tested. ... most reliable solution would be to actually parse the file formats and extract the text. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Possible bug in Calendar
    ... year" functionality and "do all these whiz-bang locale-dependent presentations and manipulations" functionality in one place. ... It would have been a horrid mess -- as the Calendar class actually proved to be, as has been noted by "Lew", Mark, Roedy, and one or two other people here as well. ... I already AM a programmer, and I probably was a programmer when you were still in diapers. ... and somehow thereby manage to crash the school's computer, earning an F in sixth-grade computer class. ...
    (comp.lang.java.programmer)
  • Re: How could it be that?
    ... functionality itself, it makes the issue a "base feature". ... But why dammit each programmer should implement it? ... is something that I think is necessarily a "bug", as opposed to a perfectly valid design decision, and b) it's not just the word "bug" that's the problem in your presentation, it's the entire way you are describing your issues. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Visionary things about smalltalk, JS ...
    ... Personally I think an artistic IDE is irrelevant. ... For a professional programmer, ok, functionality is ... into the mental model behind a programming language, ...
    (comp.lang.smalltalk)
  • Re: Ensuring a method exists
    ... When different classes offer similar functionality and you want ... Interface types are more flexible than class types because the former ... (defgeneric rem (collection object)) ... It's left up to the responsibility of the programmer to define the right methods, or to leave them out when they are actually not necessary. ...
    (comp.lang.lisp)