Re: Most efficient ebook format and fastest reader?

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance




"casioculture" <casioculture@xxxxxxxxx> wrote in message
news:1115918540.059193.174890@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> Hello, thanks a lot for the reply. I did use another tool which is the
> open source pdftohtml, http://pdftohtml.sourceforge.net/. The problems
> I had were 1) file import in the tomeraiders 3 will only list files
> with the *.txt or *.tr2 extensions, and 2) even when I manually changed
> the .html file extension to .txt, the file import in tomeraider 3
> objected to characters and tags such as ! and doctype and gave me
> errors. What is it that I'm not doing right?
>


Well, you are not doing anything wrong. The html tags supported by the
import are basically a subset of true html, and the pdf-html generator tool
must be writing fairly nice HTML. And yes you need to rename it to .txt to
have it load correctly. In the installation there is a doc that has more
detail, but the gist of it is that.

At this point I would probably suggest using something like OpenOffice to
convert the HTML to text and trying it again. But if you have a lot of
formatting in the HTML (from the PDF) you may be a bit disappointed in the
end result.

The second option is to write a quick script to rip out the offending tags.
This is atttractive because you keep all the image tags in place, if there
are any. Once you have the script you can quickly fix all your files in a
three step process (pdf-html-strippedhtml-tr3), but it will take some trial
and error to have them working correctly. On top of that if the pdfs are
smaller by nature you would need to combine them into one large file in
order to have searching optimized. If you do use images be sure you are
using at least 3.12.

The last thing to note is that TR's strength is in massive amounts of data,
not so much in pretty pages until they improve their import process and
better image handling. For general purpose a simpler reader may be the best
alternative (mobibook, for example). But once you get the process working
you might be surprised at the tons of data you can have at your fingertips.


Hope that helps, but it looks like you have issues to work out, sorry to
say.

BTW, if you were curious about Wikipedia and TR check out here:
http://en.wikipedia.org/wiki/Wikipedia:TomeRaider
http://download.wikimedia.org/tomeraider/current_tr3/ (The 517 meg file is
pretty sweet, if you have the room. I am running a different (newer) dump,
but this one is still darn current for a general purpose encyclopedia.).


Good Luck






.



Relevant Pages

  • [Full-Disclosure] Multiple XSS vulnerabilites in PHPNuke
    ... Yes, more XSS in phpnuke... ... PHP-Nuke's rdf/rss parser doesn't strip html tags when parsing RSS files. ... strips <script> tags, it allows for events on tags. ...
    (Full-Disclosure)
  • Re: Enable script in HTTPWebRequest
    ... It would help if you posted the actual HTML you receive. ... The reason is that there is probably a script in the page that writes out ... > have when i use a browser, to parse some data from the page. ... Answer from the page is (without HTML tags): ...
    (microsoft.public.dotnet.framework.aspnet)
  • Re: Navigating text string that contains HTML of a page as DOM object?
    ... with AJAX I will get a remote web page into a string. ... The first question is why are you sending more HTML to the client than is necessary? ... Remove script tags & content - necessary to cut down ...
    (comp.lang.javascript)
  • Re: maqetta
    ... Wow - 1.7mb of script, one file is 48,000 lines of reasonably minified ... and checked the "HTML markup" option. ... source editor meant I could add more rows and columns. ... incorrect tags (the previously mentioned BR tag and some missing ...
    (comp.lang.javascript)
  • How to fix the whacky bug in C# when dynamically creating javascript.
    ... So I spent ages trying to work out what the problem was with my code ... the HTML later on - especially where there were other scripts in the ... as you can't have nested <script> tags according to the the HTML spec - ...
    (microsoft.public.dotnet.framework.aspnet)