IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Best duplicate finder/deleter for Win32?
I'm using [link|http://www.hugmot.is/acutefinder/|AcuteFinder] on Windows 2000 to search for duplicate files on 4 partitions. It works pretty well initially, but gets very pokey and hogs the CPU. Since I have something like 600,000 files on this system and it's less than half-way through after searching about 14 hours, and since it seems to slow down over time, I'd like to find something better.

Any suggestions?

Thanks.

Cheers,
Scott.
New That's a Perl problem if ever I saw one.
Hey, Baz/Ben:

Is File::Find fast enough for this to be worth coding up?


Peter
[link|http://www.no2id.net/|Don't Let The Terrorists Win]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Home]
Use P2P for legitimate purposes!
New ICLRPD: That's a Perl problem if ever I saw one. (new thread)
Created as new thread #235660 titled [link|/forums/render/content/show?contentid=235660|ICLRPD: That's a Perl problem if ever I saw one.]
--
Steve
[link|http://www.ubuntulinux.org|Ubuntu]
New Baz? What's a Baz?
Simple enough problem.
Walk the file system, compute a crc for every file found.
Might want to salt it with the filesize if scared of a crc collision.
Push filename an into an array ref, stored in a hash indexed by crc.
When done, loop through list of crcs, popping out a list of files when there are more than one.

A couple of minutes of code.

Have fun.
New Alternatively...
File::Find::Duplicates


Peter
[link|http://www.no2id.net/|Don't Let The Terrorists Win]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Home]
Use P2P for legitimate purposes!
New Any gotchas?
I've never run a Perl script before, at least not of my own volition. :-)

I found [link|http://aspn.activestate.com/ASPN/CodeDoc/File-Find-Duplicates/Duplicates.html|File::Find::Duplicates] on CPAN. Any gotchas? It's not terribly descriptive, but I assume that the description of things given for [link|http://search.cpan.org/~jhi/perl-5.8.0/lib/File/Find.pm|File::Find] will help clarify things. I also assume I'll be able to figure out how to direct the output to a file.

I'll be downloading ActivePerl in a few minutes to try it out tomorrow, after my download of SimplyMEPIS_3.4-1.rc1 completes.

Thanks a bunch.

Cheers,
Scott.
New The sample program in the documentation...
is close to what you want. Change the call to the function to:

my %dupes = find_duplicate_files('c:\\');

Write it to a file and from the command line you can write something like this:

perl myprogram > output_file

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New Thanks Ben. I appreciate it.
New You're a Baz.


Peter
[link|http://www.no2id.net/|Don't Let The Terrorists Win]
[link|http://www.kuro5hin.org|There is no K5 Cabal]
[link|http://guildenstern.dyndns.org|Home]
Use P2P for legitimate purposes!
New If you want to be really cautious....
When you find possible duplicates, go scan the files again and verify that they really are duplicate.

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New Yes
It sounds like his initial application wrote this in the stupid naive way that C programmers love - he wrote a double loop.

With 600,000 files that means that it wants to do about 180,000,000,000 comparisons of one file against another. Sure, C is fast, but that will take a while.

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New I'm trying a Python solution now.
I wasn't able to immediately figure out the Perl solutions mentioned, so I went and looked at what was available for Python. I'm experimenting with [link|http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/364953|FindDuplicateFileNames] and it seems to work pretty well. It doesn't hog the CPU and the system seems very responsive while the script is running (from the command line. It also works OK inside the PythonWin IDE).

At the moment I'm just redirecting the output to a .txt file. When it's done I'll see how easy it is to sort that (probably with OO.org's spreadsheet as I think it'll choke Excel) or experiement with some other Python scripts.

Thanks for the tips.

[edit:]

I'm not sure when the Python script finished, but it was done in less than 90 minutes, searching through 4 partitions of ~ 600,000 files (~ 500 GB) and writing the matches to a ~ 157 MB text file. (Note that this verion does not calculate CRC-32 or MD5 checksums.)

I loaded the text file in OpenOffice.org 2.0 and it was loaded into Writer. It took a minute or so on this Athlon64 3000+ with 1 GB of RAM, but it didn't choke. It's a 43085 page document with 10 pt Courier New. ;-)

I'll mess around some more, checking matches based on some sort of checksum or at least the file size before doing any actual deleting of duplicates. (Note that there's a Python script called [link|http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/362459|Dupinator] that isn't very flexible but does MD5 checksums (first quickly checking the first 1024 bytes, then checking the entire file if it's a potential duplicate), and something called [link|http://www.pixelbeat.org/fslint/|FSLint] that needs GTK2 and thus seems to be designed for Linux systems.)

Cheers,
Scott.
Expand Edited by Another Scott Dec. 7, 2005, 02:36:50 PM EST
New Any reason you can't use gnu/sort ?
===

Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats].
[link|http://DocHope.com|http://DocHope.com]
New Dunno. I'll check into it. Thanks.
New I'm trying DUFF now.
AcuteFinder seems to want me to register it again, so I'm looking at other things that are out there. While [link|http://rdfind.paulsundvall.net/|rdfind] looks to be speedy, I don't have time to get cygwin working, etc., etc. I didn't get very far with Python or perl either... :-/

I'm trying [link|http://dff.sourceforge.net/|DUFF] now. It's listed as an alpha, but it has lots of nice options (like adjustable CPU load). We'll see how it goes.

Just in case it comes up again... :-)

Cheers,
Scott.
New <drool> Mm, Duff... </homer>
New Ha! :-)
New It's called an alpha for a reason.
In trying it on 4 PCs, it only completed successfully on 1 (with only a 30 GB partition). On the other 3, it closed up without warning after scanning for a day or more. Unless you've got a small drive, it's not worth bothering with, and even then there are likely more robust tools. Another strike against it is you can't save or export the results....

It looks pretty nice though. ;-)

Cheers,
Scott.
New Edupe seems pretty good. (More details.)
At least it works pretty quickly to give a list of duplicate files sorted by file size. It's written in [link|http://www.erlang.org/|Erlang].

[link|http://blog.diginux.net/2007/04/03/writing-a-duplicate-file-finder-in-erlang/|Writing a Duplicate File Finder in Erlang].

Instructions on compiling and using it are [link|https://bohr.diginux.net/wiki/index.php/EDupe|here]. Remember to pipe the output to a file (e.g. C:\\Program Files\\erl5.5.4\\bin>erl -noshell -run edupe start c:/ -s erlang halt > 070526-DriveC.txt).

[edit:] Note that (on Windows) you'll need the DLL - libeay32.dll. It's easily found. You can get a copy of it in the WGet package - [link|http://www.christopherlewis.com/WGet/WGetFiles.htm|here]. Simply copy the DLL to the c:\\Program Files\\erl5.5.4\\bin\\ directory. Also, don't forget to compile the edupe package by running "erlc edupe.erl" before running the erl - noshell ... script.

HTH.

Cheers,
Scott.
Expand Edited by Another Scott May 26, 2007, 09:38:32 AM EDT
New Edupe crashes on some of my machines. Trying another...
On some systems I get an error dump with comments about being unable to create enough virtual memory, but it seems to be related to very deep directories with very long directory names. I don't know enough to try to debug it.

I'm looking at [link|http://www.tucows.com/preview/373411|Duplicate File Finder 1.1.0.0] right now. It's very well done, very fast (even when calculating CRC32s), has a nice GUI, and has lots of output options. It seems nearly perfect for Win32. The price is right too - it's free.

It had no problem with my T41 laptop. I'm having it search my Opteron box now - it's got ~ 2 M files with a sizable fraction being duplicates...

Cheers,
Scott.
     Best duplicate finder/deleter for Win32? - (Another Scott) - (19)
         That's a Perl problem if ever I saw one. - (pwhysall) - (9)
             ICLRPD: That's a Perl problem if ever I saw one. (new thread) - (Steve Lowe)
             Baz? What's a Baz? - (broomberg) - (6)
                 Alternatively... - (pwhysall) - (3)
                     Any gotchas? - (Another Scott) - (2)
                         The sample program in the documentation... - (ben_tilly) - (1)
                             Thanks Ben. I appreciate it. -NT - (Another Scott)
                 You're a Baz. -NT - (pwhysall)
                 If you want to be really cautious.... - (ben_tilly)
             Yes - (ben_tilly)
         I'm trying a Python solution now. - (Another Scott) - (2)
             Any reason you can't use gnu/sort ? -NT - (drewk) - (1)
                 Dunno. I'll check into it. Thanks. -NT - (Another Scott)
         I'm trying DUFF now. - (Another Scott) - (3)
             <drool> Mm, Duff... </homer> -NT - (CRConrad) - (1)
                 Ha! :-) -NT - (Another Scott)
             It's called an alpha for a reason. - (Another Scott)
         Edupe seems pretty good. (More details.) - (Another Scott) - (1)
             Edupe crashes on some of my machines. Trying another... - (Another Scott)

I knew it as soon as you told me.
100 ms