Re: Check folder for duplicate files



Co wrote:
On 14 aug, 21:03, "Mike Williams" <mi...@xxxxxxxxxxxxxxxxx> wrote:
"Co" <vonclausow...@xxxxxxxxx> wrote in message

news:1187112810.989536.322340@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

I need a code to check a folder on my HDD for duplicate files.
. . . just to expand on what I said in my previous post (which seems a
little ambiguous now that I've looked at it again), if you discover a file
called test1.doc and another file called test1.pdf they may in fact
represent two totally different documents, regardless of the fact that their
names happen to be the same. Also, checking the actual file contents will
not produce any meaningful result because they will be completety different.
....
The only thing I want to do is check for the names of the files.
If they match (without the extension) then I have to both open them and check if they have the same data. Believe me in my business it
happens.

So, what is the definition of "contain same data"?

Are you asking to examine the contents of a .pdf file as compared to a ..doc as compared to a plain text .txt file and parse them to see if the same words exist in the same order despite all the formatting someone apparently worked hard to accomplish and trash it for the original?

Or is it simply two copies of the _same_ identical text file have been saved but one was inadvertently saved w/ an extension that doesn't represent the actual file content?

If the former, you've bit off a big job as Mike says as there are an infinite number of possible ways and you'll have to do a complete lexical parsing of the file format to remove the superfluous information and uncover the fundamental "sameness" underlying it.

If the later, that's a diff utility of which there are a zillion and I'd suggest using one of them pre-rolled would be the place to start...

Again, if we're off target, post back w/ more detail as the problem still seems quite poorly defined.

--
.



Relevant Pages

  • Re: Save Picture As.. is saving every jpg with the extension .jpe
    ... In the rght side you will see the 'Extension' value is likely to be .jpe (if ... MS MVP/Windows - Internet Explorer ... Selecting 'Save Picture As' Does Not Save Image with Correct ... > saved as and unknown file format. ...
    (microsoft.public.windows.inetexplorer.ie6.browser)
  • saving problems in OpenOffice 2.0
    ... Saving the file in the default file format didn't seem to work. ... When saving the files in OOO 2.0 there is a *"file type"* drop-down field containing the list of possible file formats this particular document can be saved as. ... Some items in this list are displayed as a combination of a file type description followed by the file extension in parentheses. ...
    (comp.unix.bsd.freebsd.misc)
  • Re: Identifying an image type.
    ... I need to figure out what type if image it is and attach an extension to the file. ... Each image file format has its own header and data format, and other than inspecting that data directly, you can't determine the file format. ... If there are specific file formats that you want to be able to handle, it should be simple enough to research each format and figure out what the header looks like. ... To handle the most basic cases, it should not require much effort, though it will be tedious since you'll have to create some sort of table that includes the unique sequence of bytes, where that sequence is found in the file, and a file extension to associate with that sequence. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Save Picture As.. is saving every jpg with the extension .jpe
    ... Double click on the word 'Extension' and change it to .jpg ... > How to make a good newsgroup post: ... >> saved as and unknown file format. ... >> associations REG fix (found online after reading another post here). ...
    (microsoft.public.windows.inetexplorer.ie6.browser)