I'm trying a Python solution now.

I wasn't able to immediately figure out the Perl solutions mentioned, so I went and looked at what was available for Python. I'm experimenting with [link|http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/364953|FindDuplicateFileNames] and it seems to work pretty well. It doesn't hog the CPU and the system seems very responsive while the script is running (from the command line. It also works OK inside the PythonWin IDE).

At the moment I'm just redirecting the output to a .txt file. When it's done I'll see how easy it is to sort that (probably with OO.org's spreadsheet as I think it'll choke Excel) or experiement with some other Python scripts.

Thanks for the tips.

[edit:]

I'm not sure when the Python script finished, but it was done in less than 90 minutes, searching through 4 partitions of ~ 600,000 files (~ 500 GB) and writing the matches to a ~ 157 MB text file. (Note that this verion does not calculate CRC-32 or MD5 checksums.)

I loaded the text file in OpenOffice.org 2.0 and it was loaded into Writer. It took a minute or so on this Athlon64 3000+ with 1 GB of RAM, but it didn't choke. It's a 43085 page document with 10 pt Courier New. ;-)

I'll mess around some more, checking matches based on some sort of checksum or at least the file size before doing any actual deleting of duplicates. (Note that there's a Python script called [link|http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/362459|Dupinator] that isn't very flexible but does MD5 checksums (first quickly checking the first 1024 bytes, then checking the entire file if it's a potential duplicate), and something called [link|http://www.pixelbeat.org/fslint/|FSLint] that needs GTK2 and thus seems to be designed for Linux systems.)

Cheers,
Scott.

Post #237,278 by Another Scott 12/7/05 12:43:34 PM 12/7/05 2:36:50 PM Reply	I'm trying a Python solution now. I wasn't able to immediately figure out the Perl solutions mentioned, so I went and looked at what was available for Python. I'm experimenting with [link\|http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/364953\|FindDuplicateFileNames] and it seems to work pretty well. It doesn't hog the CPU and the system seems very responsive while the script is running (from the command line. It also works OK inside the PythonWin IDE). At the moment I'm just redirecting the output to a .txt file. When it's done I'll see how easy it is to sort that (probably with OO.org's spreadsheet as I think it'll choke Excel) or experiement with some other Python scripts. Thanks for the tips. [edit:] I'm not sure when the Python script finished, but it was done in less than 90 minutes, searching through 4 partitions of ~ 600,000 files (~ 500 GB) and writing the matches to a ~ 157 MB text file. (Note that this verion does not calculate CRC-32 or MD5 checksums.) I loaded the text file in OpenOffice.org 2.0 and it was loaded into Writer. It took a minute or so on this Athlon64 3000+ with 1 GB of RAM, but it didn't choke. It's a 43085 page document with 10 pt Courier New. ;-) I'll mess around some more, checking matches based on some sort of checksum or at least the file size before doing any actual deleting of duplicates. (Note that there's a Python script called [link\|http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/362459\|Dupinator] that isn't very flexible but does MD5 checksums (first quickly checking the first 1024 bytes, then checking the entire file if it's a potential duplicate), and something called [link\|http://www.pixelbeat.org/fslint/\|FSLint] that needs GTK2 and thus seems to be designed for Linux systems.) Cheers, Scott. Edited by Another Scott Dec. 7, 2005, 02:36:50 PM EST Expand All History
Post #237,307 by drewk 12/7/05 7:19:24 PM Reply	Any reason you can't use gnu/sort ? === Purveyor of Doc Hope's [link\|http://DocHope.com\|fresh-baked dog biscuits and pet treats]. [link\|http://DocHope.com\|http://DocHope.com]
Post #237,315 by Another Scott 12/7/05 9:47:02 PM Reply	Dunno. I'll check into it. Thanks.

Welcome to IWETHEY!