Best duplicate finder/deleter for Win32?

Post #235,644 by Another Scott 11/23/05 9:50:10 AM Reply	Best duplicate finder/deleter for Win32? I'm using [link\|http://www.hugmot.is/acutefinder/\|AcuteFinder] on Windows 2000 to search for duplicate files on 4 partitions. It works pretty well initially, but gets very pokey and hogs the CPU. Since I have something like 600,000 files on this system and it's less than half-way through after searching about 14 hours, and since it seems to slow down over time, I'd like to find something better. Any suggestions? Thanks. Cheers, Scott.
Post #235,655 by pwhysall 11/23/05 10:38:26 AM Reply	That's a Perl problem if ever I saw one. Hey, Baz/Ben: Is File::Find fast enough for this to be worth coding up? Peter [link\|http://www.no2id.net/\|Don't Let The Terrorists Win] [link\|http://www.kuro5hin.org\|There is no K5 Cabal] [link\|http://guildenstern.dyndns.org\|Home] Use P2P for legitimate purposes!
Post #235,661 by Steve Lowe 11/23/05 10:57:11 AM Reply	ICLRPD: That's a Perl problem if ever I saw one. (new thread) Created as new thread #235660 titled [link\|/forums/render/content/show?contentid=235660\|ICLRPD: That's a Perl problem if ever I saw one.] -- Steve [link\|http://www.ubuntulinux.org\|Ubuntu]
Post #235,687 by broomberg 11/23/05 12:17:48 PM Reply	Baz? What's a Baz? Simple enough problem. Walk the file system, compute a crc for every file found. Might want to salt it with the filesize if scared of a crc collision. Push filename an into an array ref, stored in a hash indexed by crc. When done, loop through list of crcs, popping out a list of files when there are more than one. A couple of minutes of code. Have fun.
Post #235,704 by pwhysall 11/23/05 1:35:25 PM Reply	Alternatively... File::Find::Duplicates Peter [link\|http://www.no2id.net/\|Don't Let The Terrorists Win] [link\|http://www.kuro5hin.org\|There is no K5 Cabal] [link\|http://guildenstern.dyndns.org\|Home] Use P2P for legitimate purposes!
Post #235,795 by Another Scott 11/23/05 10:45:23 PM Reply	Any gotchas? I've never run a Perl script before, at least not of my own volition. :-) I found [link\|http://aspn.activestate.com/ASPN/CodeDoc/File-Find-Duplicates/Duplicates.html\|File::Find::Duplicates] on CPAN. Any gotchas? It's not terribly descriptive, but I assume that the description of things given for [link\|http://search.cpan.org/~jhi/perl-5.8.0/lib/File/Find.pm\|File::Find] will help clarify things. I also assume I'll be able to figure out how to direct the output to a file. I'll be downloading ActivePerl in a few minutes to try it out tomorrow, after my download of SimplyMEPIS_3.4-1.rc1 completes. Thanks a bunch. Cheers, Scott.
Post #235,799 by ben_tilly 11/23/05 11:01:27 PM Reply	The sample program in the documentation... is close to what you want. Change the call to the function to: my %dupes = find_duplicate_files('c:\\'); Write it to a file and from the command line you can write something like this: perl myprogram > output_file Cheers, Ben I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
Post #235,800 by Another Scott 11/23/05 11:02:48 PM Reply	Thanks Ben. I appreciate it.
Post #235,705 by pwhysall 11/23/05 1:35:55 PM Reply	You're a Baz. Peter [link\|http://www.no2id.net/\|Don't Let The Terrorists Win] [link\|http://www.kuro5hin.org\|There is no K5 Cabal] [link\|http://guildenstern.dyndns.org\|Home] Use P2P for legitimate purposes!
Post #235,784 by ben_tilly 11/23/05 9:52:06 PM Reply	If you want to be really cautious.... When you find possible duplicates, go scan the files again and verify that they really are duplicate. Cheers, Ben I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
Post #235,783 by ben_tilly 11/23/05 9:51:11 PM Reply	Yes It sounds like his initial application wrote this in the stupid naive way that C programmers love - he wrote a double loop. With 600,000 files that means that it wants to do about 180,000,000,000 comparisons of one file against another. Sure, C is fast, but that will take a while. Cheers, Ben I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
Post #237,278 by Another Scott 12/7/05 12:43:34 PM 12/7/05 2:36:50 PM Reply	I'm trying a Python solution now. I wasn't able to immediately figure out the Perl solutions mentioned, so I went and looked at what was available for Python. I'm experimenting with [link\|http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/364953\|FindDuplicateFileNames] and it seems to work pretty well. It doesn't hog the CPU and the system seems very responsive while the script is running (from the command line. It also works OK inside the PythonWin IDE). At the moment I'm just redirecting the output to a .txt file. When it's done I'll see how easy it is to sort that (probably with OO.org's spreadsheet as I think it'll choke Excel) or experiement with some other Python scripts. Thanks for the tips. [edit:] I'm not sure when the Python script finished, but it was done in less than 90 minutes, searching through 4 partitions of ~ 600,000 files (~ 500 GB) and writing the matches to a ~ 157 MB text file. (Note that this verion does not calculate CRC-32 or MD5 checksums.) I loaded the text file in OpenOffice.org 2.0 and it was loaded into Writer. It took a minute or so on this Athlon64 3000+ with 1 GB of RAM, but it didn't choke. It's a 43085 page document with 10 pt Courier New. ;-) I'll mess around some more, checking matches based on some sort of checksum or at least the file size before doing any actual deleting of duplicates. (Note that there's a Python script called [link\|http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/362459\|Dupinator] that isn't very flexible but does MD5 checksums (first quickly checking the first 1024 bytes, then checking the entire file if it's a potential duplicate), and something called [link\|http://www.pixelbeat.org/fslint/\|FSLint] that needs GTK2 and thus seems to be designed for Linux systems.) Cheers, Scott. Edited by Another Scott Dec. 7, 2005, 02:36:50 PM EST Expand All History
Post #237,307 by drewk 12/7/05 7:19:24 PM Reply	Any reason you can't use gnu/sort ? === Purveyor of Doc Hope's [link\|http://DocHope.com\|fresh-baked dog biscuits and pet treats]. [link\|http://DocHope.com\|http://DocHope.com]
Post #237,315 by Another Scott 12/7/05 9:47:02 PM Reply	Dunno. I'll check into it. Thanks.
Post #281,022 by Another Scott 4/8/07 11:12:04 PM Reply	I'm trying DUFF now. AcuteFinder seems to want me to register it again, so I'm looking at other things that are out there. While [link\|http://rdfind.paulsundvall.net/\|rdfind] looks to be speedy, I don't have time to get cygwin working, etc., etc. I didn't get very far with Python or perl either... :-/ I'm trying [link\|http://dff.sourceforge.net/\|DUFF] now. It's listed as an alpha, but it has lots of nice options (like adjustable CPU load). We'll see how it goes. Just in case it comes up again... :-) Cheers, Scott.
Post #281,041 by CRConrad 4/9/07 10:05:32 AM Reply	<drool> Mm, Duff... </homer>
Post #281,050 by Another Scott 4/9/07 11:49:23 AM Reply	Ha! :-)
Post #281,460 by Another Scott 4/11/07 9:22:02 PM Reply	It's called an alpha for a reason. In trying it on 4 PCs, it only completed successfully on 1 (with only a 30 GB partition). On the other 3, it closed up without warning after scanning for a day or more. Unless you've got a small drive, it's not worth bothering with, and even then there are likely more robust tools. Another strike against it is you can't save or export the results.... It looks pretty nice though. ;-) Cheers, Scott.
Post #283,919 by Another Scott 5/4/07 5:11:53 PM 5/26/07 9:38:32 AM Reply	Edupe seems pretty good. (More details.) At least it works pretty quickly to give a list of duplicate files sorted by file size. It's written in [link\|http://www.erlang.org/\|Erlang]. [link\|http://blog.diginux.net/2007/04/03/writing-a-duplicate-file-finder-in-erlang/\|Writing a Duplicate File Finder in Erlang]. Instructions on compiling and using it are [link\|https://bohr.diginux.net/wiki/index.php/EDupe\|here]. Remember to pipe the output to a file (e.g. C:\\Program Files\\erl5.5.4\\bin>erl -noshell -run edupe start c:/ -s erlang halt > 070526-DriveC.txt). [edit:] Note that (on Windows) you'll need the DLL - libeay32.dll. It's easily found. You can get a copy of it in the WGet package - [link\|http://www.christopherlewis.com/WGet/WGetFiles.htm\|here]. Simply copy the DLL to the c:\\Program Files\\erl5.5.4\\bin\\ directory. Also, don't forget to compile the edupe package by running "erlc edupe.erl" before running the erl - noshell ... script. HTH. Cheers, Scott. Edited by Another Scott May 26, 2007, 09:38:32 AM EDT Expand All History
Post #286,922 by Another Scott 6/12/07 11:11:54 PM Reply	Edupe crashes on some of my machines. Trying another... On some systems I get an error dump with comments about being unable to create enough virtual memory, but it seems to be related to very deep directories with very long directory names. I don't know enough to try to debug it. I'm looking at [link\|http://www.tucows.com/preview/373411\|Duplicate File Finder 1.1.0.0] right now. It's very well done, very fast (even when calculating CRC32s), has a nice GUI, and has lots of output options. It seems nearly perfect for Win32. The price is right too - it's free. It had no problem with my T41 laptop. I'm having it search my Opteron box now - it's got ~ 2 M files with a sizable fraction being duplicates... Cheers, Scott.

Welcome to IWETHEY!