perl a good candidate for memory constrained processing.
I just setup a new dual Xeon to act as a business specific
compute server. Since this box only has 3GB, I figure I'd
get a feel for my limitations and required programming style.
I'm running Linux 2.6.9-34.ELsmp - RH ES 4.
I typically need to load a bunch of records in memory,
match them against other records, sort, munge, report, etc.
I try to figure out a unit of work that fits in memory
without forcing me to fall into some external storage. I
also need to figure out the point it makes sense to externalize
everything, and end up working in a database such as Postgres
or Oracle.
This is not our database group, which means files are small.
100,000 records to process would be a LOT. Of course, I wanted
to get a measure of safety, so I made the base file 10 times
that, ie: 1 million records.
The file is 1.2GB in size.
I wanted a baseline, best case scenario. So I read the data
and pushed it into an array:
\nmy @h;\nwhile (<>){\n push (@h,$_);\n}\n
This took 9 seconds and consumed 1.2 GB.
Ok, so I can store over 100MB per second, and it seems to
have negligable overhead on space when using an array.
Now do the same thing in a hash:
\nmy %h;\nwhile (<>){\n $h{$.} = $_;\n}\n
This took 11 seconds and consumed 1.2 GB. Hmm, slight CPU
increase, about the same memory.
Now extract out standard name, address, city, state, zip, and
save the fields into a hash. I also saved the complete input
record:
\nmy @fields = qw/first last addr2 addr1 city state zip/;\nmy $pat = '@91 A12 @103 A20 @123 A27 @151 A28 @179 A20 @199 A2 @201 A9';\n\nmy %holder;\n\nwhile (<>){\n my %rec;\n (@rec{@fields}) = unpack($pat, $_);\n $rec{ALL} = $_;\n $holder{$.} = \\%rec;\n}\n
The various key fields can be between 50 and 120 bytes (or so).
It takes 35 seconds to load the data and consumes 1.7 GB
of resident memory.
I can copy the hash to another hash in about a second, consuming
only 100MB or memory or so. This means it did a pointer copy
only, and has now set to memory to be COW (copy on write),
so it will only really copy the records if I modify the data.
I had always though that perl arrays would be a bit quicker
or a bit less memory intensive and that it might make sense
to work a bit harder to use them if I was tight on memory.
I was wrong.