IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Hey Ben - Perl memory / performance research
I believe you had mentioned before that you did not consider
perl a good candidate for memory constrained processing.

I just setup a new dual Xeon to act as a business specific
compute server. Since this box only has 3GB, I figure I'd
get a feel for my limitations and required programming style.

I'm running Linux 2.6.9-34.ELsmp - RH ES 4.

I typically need to load a bunch of records in memory,
match them against other records, sort, munge, report, etc.

I try to figure out a unit of work that fits in memory
without forcing me to fall into some external storage. I
also need to figure out the point it makes sense to externalize
everything, and end up working in a database such as Postgres
or Oracle.

This is not our database group, which means files are small.
100,000 records to process would be a LOT. Of course, I wanted
to get a measure of safety, so I made the base file 10 times
that, ie: 1 million records.

The file is 1.2GB in size.

I wanted a baseline, best case scenario. So I read the data
and pushed it into an array:

\nmy @h;\nwhile (<>){\n   push (@h,$_);\n}\n


This took 9 seconds and consumed 1.2 GB.

Ok, so I can store over 100MB per second, and it seems to
have negligable overhead on space when using an array.

Now do the same thing in a hash:

\nmy %h;\nwhile (<>){\n   $h{$.} = $_;\n}\n


This took 11 seconds and consumed 1.2 GB. Hmm, slight CPU
increase, about the same memory.

Now extract out standard name, address, city, state, zip, and
save the fields into a hash. I also saved the complete input
record:

\nmy @fields = qw/first last addr2 addr1 city state zip/;\nmy $pat = '@91 A12 @103 A20 @123 A27 @151 A28 @179 A20 @199 A2 @201 A9';\n\nmy %holder;\n\nwhile (<>){\n   my %rec;\n   (@rec{@fields}) = unpack($pat, $_);\n   $rec{ALL} = $_;\n   $holder{$.} = \\%rec;\n}\n


The various key fields can be between 50 and 120 bytes (or so).

It takes 35 seconds to load the data and consumes 1.7 GB
of resident memory.

I can copy the hash to another hash in about a second, consuming
only 100MB or memory or so. This means it did a pointer copy
only, and has now set to memory to be COW (copy on write),
so it will only really copy the records if I modify the data.


I had always though that perl arrays would be a bit quicker
or a bit less memory intensive and that it might make sense
to work a bit harder to use them if I was tight on memory.

I was wrong.
New I wouldn't expect them to allow you to save space
Only arrays that have a declared type and size would allow you to save space. If you have arrays that allow for dynamic resizing, that means that you don't really have an array in the classic C sense, you have something else, with the typical overheads that are required for those kinds of data structures.
--\n-------------------------------------------------------------------\n* Jack Troughton                            jake at consultron.ca *\n* [link|http://consultron.ca|http://consultron.ca]                   [link|irc://irc.ecomstation.ca|irc://irc.ecomstation.ca] *\n* Kingston Ontario Canada               [link|news://news.consultron.ca|news://news.consultron.ca] *\n-------------------------------------------------------------------
New Array vs. Hash performance
When you push new data onto an array, Perl has to periodically grow the underlying data structure (which involves some copying). A lot of data pushed means a lot of growing/copying. With a hash, you're still growing the underlying data structure, but the impact isn't as bad, depending on the mix of keys and how they hash.

You can preallocate the array by doing something like

my @array = (undef) x 100_000;

and then assign into it by index. That'll eliminate the overhead you're seeing from push().
New Shaves a second off
Not consistently though.
Thanks.
New The fastest way to pull the file in
To get a baseline on how quickly you can get the entire file into memory, something like
\nmy $contents;\nsysopen(F, "somefile.dat");\nsysread(F, $contents, -s F);\nclose(F);\n

is about as fast as you can go in Perl. But that might not help if you need to slice the file into pieces for processing, since the pieces will need yet more memory.
New Sorry for not responding, I've been on vacation
Yes, I have said that. Repeatedly.

How does that square with your results?

There are two problems that I was referring to.

The first problem is that the Perl interpreter takes up a minimum amount of space for itself plus your program. In a tight memory environment this can be an issue - particularly if you have lots of interpreters running concurrently. As happens with mod-perl.

The second problem is that Perl data structures have a lot of fixed overhead that you cannot remove. For instance a string takes up 28 bytes plus whatever the data in the string was. If you've used the string as string and a number, it takes up more. An array takes up 28 bytes. Etc. (You can use Devel::Size to figure out these numbers.)

But in your case your data far outweighs the overhead of the interpreter. Also you had 1 million lines of average size 1200 bytes each. With the string overhead, each line takes 1228 bytes to store. Thus your memory overhead is negligible. When you break those up into smaller pieces, the overhead becomes bigger. If you then tried to save space by trimming off blank space created by the fixed-width format, you'd save space, but you'll find that the overhead becomes a lot bigger proportionately.

Basically if you want to have lots of copies of Perl running, or a data structure with tons of small data elements, Perl burns memory quickly. If you have few Perl interpreters and only big data elements, Perl's memory overhead won't seem bad at all. And since Perl doesn't do anything that is particularly stupid internally, its performance is generally reasonable. But if you start working with data byte by byte, you'll find that the overhead of assigning and unassigning those big Perl data structures kills your performance.

Incidentally the picture has improved in recent Perls. The 5.8 series, as you discovered, enabled copy on write in various circumstances. Which can save a lot of memory for some folks. And the upcoming 5.10 series found ways to trim a lot of basic data structures.

But still there are a lot of cases where people will be left asking how Perl managed to waste so much memory.

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
     Hey Ben - Perl memory / performance research - (broomberg) - (5)
         I wouldn't expect them to allow you to save space - (jake123)
         Array vs. Hash performance - (dws) - (2)
             Shaves a second off - (broomberg) - (1)
                 The fastest way to pull the file in - (dws)
         Sorry for not responding, I've been on vacation - (ben_tilly)

I've spent an unreasonable amount of thought on the line Janet sings in Rocky Horror: "His lust is so sincere". Now, on the surface, that's a big "DUH!" because that's the nature of lust. But wait: Brad has earlier sung about how hot he is for her, and it's pretty clear he's just being conventional. Brad's lust is not sincere.

Now, I could swill green mead and gnaw raw meat with sincere lust. As for kimchee and lutefisk, well, those would be just for effect. And if I'm just going for effect, I might as well wear a tie. And pants.


-- mhuber
107 ms