Re: File Duplication check

Tech-Archive recommends: Repair Windows Errors & Optimize Windows Performance



That's the thing, from a file perspective, the files are different
because the metadata is embedded in the file. If you want to check to see
if specific portions of the file are different, then you are going to have
to open the file using word, and then compare word for word, style for
style, etc, etc. Not an easy task.


--
- Nicholas Paldino [.NET/C# MVP]
- mvp@xxxxxxxxxxxxxxxxxxxxxxxxxxx

<giftson.john@xxxxxxxxx> wrote in message
news:1176994738.686643.251010@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
On Apr 19, 7:37 pm, "Nicholas Paldino [.NET/C# MVP]"
<m...@xxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
John,

Well, using a hash is the right way to go, but I don't understand why
everything gives you different values. I mean, if you have no
duplicates,
then yes, you SHOULD get different values.

What you have to do is scan the contents of the directory, hashing
each
file as you go. You then store the values of the hashes. While scanning
the directory, you check the value of the hash against the list you have
already compiled. If the hash exists in the list, then the two files
could
be duplicates (you really have to check both files against each other at
that point byte by byte to see if they are if you want to be completely
accurate).

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- m...@xxxxxxxxxxxxxxxxxxxxxxxxxxx

<giftson.j...@xxxxxxxxx> wrote in message

news:1176983002.868965.88530@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx



Hi,

I am creating an application which migrates all documents from one
repository to another repository. Before migration i have to verify
all the documents are unique. No duplicates has to be uploaded. Event
the document created date, modified date, filename can be different.
How to find the document is duplidate.

What i did is, i created a file and did save as and saved into another
location. I am not able to find that the document is duplicate. I have
tried MD5 hash, CRC check, SHA1. Everything gives different values.

Can anyone give me a solution for this?

Thanks in advance.

Giftson John- Hide quoted text -

- Show quoted text -

Hi Nicholas,

I was bit confused about the MD5 hashing.

Could you please tell me how to compare the contents of Word
Documents. What is happening is MS Word is having some set of Metadata
and even the file contents are same, the metadata difference is giving
different MD5 hash value.

Thanks for your help.



.



Relevant Pages

  • Re: File Duplication check
    ... If the hash exists in the list, ... be duplicates (you really have to check both files against each other at ... tried MD5 hash, CRC check, SHA1. ... What is happening is MS Word is having some set of Metadata ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: File Duplication check
    ... Well, using a hash is the right way to go, but I don't understand why ... I mean, if you have no duplicates, ... tried MD5 hash, CRC check, SHA1. ... and even the file contents are same, the metadata difference is giving ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Unique v Numeric Ids
    ... one could bypass the need for a separate audit change file. ... Many of the address matching and de- ... That hash can be stored ... processes and the fact is there is no way to avoid duplicates if any ...
    (comp.databases.pick)
  • how can I make this script shorter?
    ... Lowell Kirsh wrote: ... > I have a script which I use to find all duplicates of files within a ... existing python file duplicate detector. ... a hash. ...
    (comp.lang.python)
  • Re: File Duplication check
    ... Well, using a hash is the right way to go, but I don't understand why ... I mean, if you have no duplicates, ... repository to another repository. ... Before migration i have to verify ...
    (microsoft.public.dotnet.languages.csharp)