Java String field optimization tips?

Post #25,110 by dlevitt 1/22/02 12:05:19 PM Reply	Java String field optimization tips? We are developing _many_ server resident components in Java. Our design pattern uses JDBC - with most selects for a component sharing everything except the terminal 'AND' clause. I've been placing the common string into a field declared private final String RETRIEVE_SQL = "select column, column2 " + "from table " + "where AnyCommonWhereCondition " ; One of the developers seems to think that changing the string declaration to private static final String will make things dramaticlly faster. I think that that is premature optimization - if we are that concerned whe chould be using StringBuffers anyway. Any real world experience? [JDK 1.2.2 or 1.3.1, with or without HotSpot] Dave Levitt How many CPU cycles, to draw the head of a pin?
Post #25,122 by admin 1/22/02 1:09:20 PM Reply	Strings shouldn't matter. You should be storing queries that occur more than once in PreparedStatements, so they only need to be parsed once. Regards, -scott anderson "Welcome to Rivendell, Mr. Anderson..."
Post #25,309 by dlevitt 1/23/02 3:07:12 PM Reply	JDBC PreparedStatements The strings are used in the creation of PreparedStatements - oddly enough, we are reconsidering the use of PreparedStatements based on some information in an O'Reilly book on Oracle & JDBC - the sample chapter on the web site shows that [in the autor's tests] PreparedStatement operations are slower than other Statements.
Post #25,366 by admin 1/23/02 8:12:59 PM Reply	Depends on the database I've seen that assertion in regards to SQL Server, but not Oracle. Link? Regards, -scott anderson "Welcome to Rivendell, Mr. Anderson..."
Post #25,367 by admin 1/23/02 8:14:02 PM Reply	But back to the main question... Why are you spending so much time worrying about a few Strings if they only get used once? Use a performance profiling tool and find your real bottlenecks, instead of guessing. Regards, -scott anderson "Welcome to Rivendell, Mr. Anderson..."
Post #25,368 by wharris2 1/23/02 8:28:00 PM Reply	Yeeha, Scott But you wouldn't believe (er, on second thought, maybe you would) the companies that won't spend $500 or $1000 (or whatever it is, I think the max cost I've seen was $3000) on a profiling tool that really seeks out inefficiencies. On second thought, many of the profiling tools I've used under Unix became worthless when compiled with vendor's object-only libraries. cc compiled with profiling (with third-party object libraries) loses itself in untracable links more often than not, in my experience. "Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth
Post #25,373 by admin 1/23/02 8:41:49 PM Reply	Java profiling usually works pretty well I'll dig up some links. Regards, -scott anderson "Welcome to Rivendell, Mr. Anderson..."
Post #25,376 by wharris2 1/23/02 8:55:56 PM Reply	I believe that, no links required Unless they call native code, Java classes are reasonably exposed to a profiler. Alas, I've run into C code that ran into the roadblock of vendor libraries. You can get some good information out of cc compiled for gprof, but many times it is only tantalizing as to where it leads you. "Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth
Post #25,386 by slugbug 1/23/02 9:27:03 PM Reply	Try: JProbe...
Post #25,454 by drewk 1/24/02 10:32:07 AM Reply	Oh I believe it Like the people I've talked to who don't like using generalized functions stored in Postgres. They prefer to hand tune the queries. When you're querying terrabytes of data, maybe. But we were talking about a templating system for a (fairly) low-volume extranet. Take the money you save in developers' time ( ten minutes per query X several dozen pages ) and double the amount of RAM in your webserver. There are times throwing hardware at it is the solution. Otherwise we'd all be writing in assembly. We have to fight the terrorists as if there were no rules and preserve our open society as if there were no terrorists. -- [link\|http://www.nytimes.com/2001/04/05/opinion/BIO-FRIEDMAN.html\|Thomas Friedman]
Post #25,455 by admin 1/24/02 11:03:56 AM Reply	Kinda missing the point... ... the point is to optimize the stuff that gets run often. Loops, queries, repeated operations. Not one-time inits, or static vs. non-static strings that are used once as in this case. The profiler will tell you what's taking the most time in the app. Hit the hot spots with the optimizations and you'll get the most return for your time. Regards, -scott anderson "Welcome to Rivendell, Mr. Anderson..."
Post #25,126 by Yendor 1/22/02 1:26:53 PM Reply	Real world experience We used to create CSV files from our application using the string catenation operator. We recently changed them to StringBuffer.append() calls. Sped up creation of said CSV files from hours to minutes. Literally. -YendorMike "The problems of the world cannot possibly be solved by the skeptics or the cynics whose horizons are limited by the obvious realities. We need people who dream of things that never were." - John F. Kennedy
Post #25,157 by wharris2 1/22/02 3:49:21 PM Reply	(The technical reason) For Java newbies, the technical reason is that when you add two strings together, you're actually creating new String object(s) and destroying others. In bad situations, that can lead to creating and destroying objects like mad. Not intuitive behavior, but that's how String works. "Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth
Post #25,195 by admin 1/22/02 6:47:20 PM Reply	The technical reason behind the technical reason Efficiency, believe it or not. Strings are immutable objects. The interpreter knows that a String will never change. As a result, Strings store their internals as a byte array. If you create a new String as a substring of a string, you actually just create an object with an offset and length into the old string's byte array. This is quite fast when you are doing a lot of substrings and the like. But when you add them together, you end up copying two byte arrays into a third byte array, and creating a new String object from that. Unfortunately, more people add strings together than take substrings, so you end up with needing something like a StringBuffer, which appends other strings rationally. Regards, -scott anderson "Welcome to Rivendell, Mr. Anderson..."
Post #25,278 by ben_tilly 1/23/02 1:08:55 PM Reply	I don't believe it If substrings are the issue, then you can have your cake and eat it too. Make every string a structure with a length, offset, and a pointer to a string storage structure. Make the string storage structure have a maximum length and a pointer to the start of the actual string data. The string data is just an array of bytes. The space is allocated in powers of 2. If you go to append and you still have room, you just insert the data. Otherwise you reallocate room, move the existing string, free the old space and allocate at the new place. This makes incremental appends perform just fine. Taking substrings is just as fast. The overhead is seen from following one extra pointer when finding the string. This is not entirely dissimilar to what you find in, say, Perl or Ruby. (Neither is exactly like this. though I think that Ruby is closer. A lot closer.) It makes building up large strings incrementally perform just fine. Substrings can be taken efficiently (well not in Perl - but look at how Ruby implements backreferences lazily for instance). But even so in Ruby the += operator is still slow, for exactly the technical reason given above. You are creating lots of new objects. Why? Well, very simple. += has defined semantic effects. You can't get those semantic effects without creating new objects. If you want an efficient string append you have to use the special << operator because the semantic effects are visible. Observe: `# Create 2 strings init_a = "Hello"; init_b = "Hello"; # Duplicate them. dup_a = init_a; dup_b = init_b; # Append to the dups dup_a += ", World"; dup_b << ", World"; # What do we have? puts "init_a is '#{init_a}'"; #-> "Hello" puts "init_b is '#{init_b}'"; #-> "Hello, World" puts "dup_a is '#{dup_a}'"; #-> "Hello, World" puts "dup_b is '#{dup_a}'"; #-> "Hello, World"` And there we see why, even with smart data structures where efficiency is achievable, the += operator still has to be slow. Cheers, Ben
Post #25,311 by marlowe 1/23/02 3:14:25 PM Reply	That would help, but.... 1. I believe Java tries to conserve memory by merging duplicate strings of class String. So you can easily have two String objects pointing to the same buffer. Edit one of them, and the runtime has to alloc a completely new buffer for it. Even if you're shrinking it. 2. If you alloc in powers of 2, you'll have beacoup wasted space, and the spectre of disk thrashing lurks in the shadows. This isn't a criticism of the basic concept, but some fine tuning is called for. Now if you increment by a fraction, say 5/4, you get a different tradeoff ratio. Less wasted memory, but more frequent allocs. In fact, it's rumored that some Java implementations do something like this. But there's no beating the fine control you get with your own StringBuffers. That way, the efficiency is more directly influenced by the ability of the programmer. I say use String for prototyping and for code that doesn't get exercised much, but use StringBuffer for high usage production code. Or else code it in C or C++. [link\|http://www.angelfire.com/ca3/marlowe/index.html\|http://www.angelfir...e/index.html] Sometimes "tolerance" is just a word for not dealing with things.
Post #25,389 by ben_tilly 1/23/02 10:11:39 PM Reply	The scheme addresses those acceptably well It is, of course, a compromise between issues. You wind up wasting about a third of your space. (Much less if you just apply the simple heuristic of only creating a string with as little memory as you can, then using the power of 2 trick if someone starts appending to it.) If you want to share objects you either need to do some fancy footwork, or else define semantics in which programmers choose what happens. (Ruby does this, see the code example I gave.) That algorithmic efficiency and memory usage conflict is no surprise. That is usually true. Scalable algorithms tend to use buffering. Buffering costs memory. Vice versa contortions to avoid using extra memory take excess operations. As for your suggestion, mine is to not use Java. In other languages you get reasonable default behaviour without having to know language trivia about available types with precise promised tradeoffs. Besides which, string manipulation is not exactly one of Java's strengths. Cheers, Ben
Post #25,313 by dlevitt 1/23/02 3:15:12 PM Reply	Compile time only We are not concatinating the strings at runtime - only at compile time. Hopefully the compiler is smart enough to use a single literal string - literals are supposed to be java.lang.String.intern()'d by the compiler - so only one copy of each should exist - even if multiple copy of the class are constructed.
Post #25,322 by marlowe 1/23/02 3:54:40 PM Reply	Are you sure? If it's static, I'm almost certain the expression will be evaluated at compile time, or at the latest, evaluated just once when the program starts up. But if you leave out the "static," the "final" may not be enough to do it. The compiler may not be clever enough, and it would end up being evaluated every time you create an instance. Put in the static keyword. At worst it's redundant. Better safe than sorry. [link\|http://www.angelfire.com/ca3/marlowe/index.html\|http://www.angelfir...e/index.html] Sometimes "tolerance" is just a word for not dealing with things.
Post #25,326 by ChrisR 1/23/02 4:30:52 PM Reply	Compile time concatenation The JDK compiler will generate identical byte code for the following: private final String x = "ABC"; private final String y = "A" + "B" + "C"; The concatenation is done by the compiler, not at runtime. The compiler is even smart enuf to do the compile time concatenation with a constant variable (i.e. final) if the value of the constant is fixed at compile time. This means that the z string below will produce the same results as above: private final String b = "B"; private final String z = "A" + b + "C"; OTOH, if b was not a final value, then this optimization would not take place, and two concats would take place at runtime at the place in the code that the declaration takes place.
Post #25,532 by tuberculosis 1/24/02 2:34:55 PM 8/21/07 6:03:38 AM Reply	Sounds like a shitty compiler. Seems to me, with a wee bit of work this bit of code: String x = "Hello "; String y = System.getProperty("user"); String z; the expression z = x + y; should be interpreted by the compiler the same as z = new StringBuffer(x).append(y).toString(); and any longer expression like z = a + b + c + d; should be z = new StringBuffer(a).append(b).append(c).append(d).toString(); The fact that this is not what happens sounds to me like bad compiler implementation. The average hunter gatherer works 20 hours a week. The average farmer works 40 hours a week. The average programmer works 60 hours a week. What the hell are we thinking? Edited by tuberculosis Aug. 21, 2007, 06:03:38 AM EDT Expand All History
Post #25,534 by wharris2 1/24/02 2:41:50 PM Reply	Wrong, syntactically should be interpreted by the compiler the same as z = new StringBuffer(x).append(y).toString(); You can't assign a StringBuffer to z because it's a String. And Java doesn't support the "+" operator for StringBuffers, so declaring everything as StringBuffers is hosed, too. Basically, string handling in Java looks sorta neat but it's almost as half-baked as the AWT. "Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth
Post #25,803 by tuberculosis 1/25/02 8:56:22 PM 8/21/07 6:10:22 AM Reply	Read it again The expression: String z = new StringBuffer(x).append(y).toString(); is of type string (note the toString() on the end there). What I meant to say was if the compiler sees z = x + y; it should read it as if the user wrote the StringBuffer version. Its not wrong The average hunter gatherer works 20 hours a week. The average farmer works 40 hours a week. The average programmer works 60 hours a week. What the hell are we thinking? Edited by tuberculosis Aug. 21, 2007, 06:10:22 AM EDT Expand All History
Post #25,818 by ben_tilly 1/25/02 11:39:03 PM Reply	But what about the general case? The programmer writes a function to append another row onto the string in memory. That function calls other functions to get individual entries, and that function is in turn called many times by other functions. This is exactly the case in question. While knowing the overall usage of that function, it is easy to optimize, when just compiling that function, it is no surprise at all that the compiler would have trouble. (Also see my Ruby example where attempting to aggressively optimize += can lead to visible semantic changes you wouldn't want.) Cheers, Ben
Post #26,147 by tuberculosis 1/28/02 5:28:59 PM 8/21/07 6:19:12 AM Reply	Re: But what about the general case? That's sort of degenerative I think. The right answer is that the language should enforce that Strings are immutable instead of being wishy washy about it. The very existence of += on String is evil. Were it eliminated, programmers would have to declare mutable string references as StringBuffer (which is a stupid name - MutableString would have been better), which is more efficient. Interestingly, in ObjectiveC, the opposite is true. NSString a = @"Hello"; [a stringByAppendingString: @", world!"]; is more efficient than NSMutableString a = [NSMutableString stringWithString:@"Hello"]; [a appendString: @", world!"]; I'm not sure why, but its cheaper to create the new strings in ObjC than it is to use the mutable string. The average hunter gatherer works 20 hours a week. The average farmer works 40 hours a week. The average programmer works 60 hours a week. What the hell are we thinking? Edited by tuberculosis Aug. 21, 2007, 06:19:12 AM EDT Expand All History
Post #26,263 by wharris2 1/29/02 8:58:35 AM Reply	(arguing) One can argue that Java string handling is just plain bad, even if you do know what they did and why they did it. Better to have ignored the "+" case than to have weirded around it. "Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth
Post #26,301 by ben_tilly 1/29/02 10:37:56 AM Reply	I always get dubious... When someone says there is a single right answer. Particularly when it is an answer that makes a set of tradeoffs which doesn't fit some common situations. Immutable strings can have a simpler structure and are faster to access (to allow resizing you need to allow moving the string). However if you are incrementally building up a large string using strings, immutable strings require recopying, and recopying forces you into an nn algorithm. Now, you say, just use a StringBuffer (or whatever the current flavour is) object. Well yes. That works. If you try to implement it in code, though, you will eventually hit other bottlenecks as the garbage collector tries to garbage collect on a huge number of little strings. This happens much later, but I have been there, done that. Collecting lots of little objects eventually becomes its own problem. If it is not natively implemented, then it becomes important to do some sort of naive scaling operation. (Having done this in JavaScript I can report that a good heuristic is that when you push a StringBuffer onto a StringBuffer, you resolve the smaller one. Now create one buffer per row, push those onto the ones per table, scalability is now acceptable.) You know, this is really a headache. I have been there, done that, and don't like it. Back to the mutable string. OK, it is an immediate significant overhead. Hrm. But go to incrementally build up a string and it bloody works!* The first time. The first thing you try. No need to show how good of a programmer you are, how smart you are. You can go off and be smart about something else. If that initial overhead is acceptable, you don't ever get into having to waste your time worrying about this cr*p. Been there, done that as well. And given current CPU power, I really prefer always wasting some computer time and not having to take out a day or 3 doing performance profiling so I can figure out how to rewrite code that should work the first time. So I would say that there is no truly "right answer". There are cases where mutable strings make sense. Cases where immutable strings make sense. For me, most of the time, I prefer having mutable strings that work out reasonably well for any access problem, that is good enough. If it isn't good enough some time, then I will make the hard decisions. Cheers, Ben
Post #26,340 by tuberculosis 1/29/02 1:09:20 PM 8/21/07 6:26:57 AM Reply	Don't know why When I say right answer - I mean right answer to the Java language design decision to have immutable strings that appear to be mutable. Its an efficiency trap. Things that are runtime inefficient shouldn't be easy to write. This is the philosophy behind the C++ STL. You don't implement += on an iterator that has access time O(n). Its deceitful. You would implement += on an iterator that has access time O(1), that makes sense. I feel the same way about Java's String += operator. If String is immutable, then why did you give me this little thing that appears to modify the string? Again, its deceitful. (OT - This is one of the things that caused me to pitch C++, its inherent unpredictability. Given the wackiness of operator overloading, I was never entirely sure if a given line of code was expensive or not.) If your goal is to incrementally build up a big string by continuously modifying some string, then what you want is a string you can modify. Declaring your container a String is bad design choice that appears to be sanctioned by the language designers since they gave you the nifty += and + operators. Its misleading. I think your other comments had more to do with making the transition from string to text. Text is a much more complicated thing than a string. The average hunter gatherer works 20 hours a week. The average farmer works 40 hours a week. The average programmer works 60 hours a week. What the hell are we thinking? Edited by tuberculosis Aug. 21, 2007, 06:26:57 AM EDT Expand All History
Post #26,362 by wharris2 1/29/02 2:38:09 PM Reply	Whacky overloading. Yes. "Beware of bugs in the above code; I have only proved it correct, not tried it." -- Donald Knuth
Post #26,404 by ben_tilly 1/29/02 4:57:20 PM Reply	That I will agree with If you give people basic operations, try to make them efficient ones. That is one thing that Perl has historically done very well on. Everything has a significant overhead. Assume it is about a factor of 10. But their native data types (which there are very few of) can be put together easily to implement virtually any easily stated algorithm. No, you can't choose to make subtle trade-offs. But it is darned easy to get something that works, and if it worse it is probably not, barring your stupidity and that factor of 10, going to be that bad. While getting everything just right might be really fun, but get it wrong and you are in trouble. If you have that factor of 10 to throw away up front, well I create enough of my own problems to think about. (What do you mean I have deep recursion...?) Cheers, Ben
Post #25,167 by ChrisR 1/22/02 4:43:57 PM 1/22/02 4:46:41 PM Reply	Depends Strictly speaking, the Java compilers are a bit smarter in that the compiler does the concatenation on the string at compile for adjacent constant strings. In other words, the plus signs (concat operators) in the expression you give are not done at runtime. Both versions will store a single string in the constant pool, that will be loaded into the heap when assigned to a variable (or field). Result is that the GC will be called in both instances. Been awhile since I've looked at the decompiled code, but just glancing at the results, I don't see any significant difference: class TestMe { private final String RETRIEVE_SQL = "select column1, column2 " + "from table " + "where AnyCommonWhereCondition "; private static final String STATIC_SQL = "select column1, column2 " + "from table " + "where AnyCommonWhereCondition "; public static void main(String argv[]) { TestMe me = new TestMe(); } TestMe() { String s; s = RETRIEVE_SQL; s = STATIC_SQL; } } .source TestMe.java .class TestMe .super java/lang/Object .field private final RETRIEVE_SQL Ljava/lang/String; = "select column1, column2 from table where AnyCommonWhereCondition " .field private static final STATIC_SQL Ljava/lang/String; = "select column1, column2 from table where AnyCommonWhereCondition " .method public static main([Ljava/lang/String;)V .limit stack 2 .limit locals 2 .line 11 new TestMe dup invokespecial TestMe/<init>()V astore_1 .line 12 return .end method .method <init>()V .limit stack 2 .limit locals 2 .line 14 aload_0 invokespecial java/lang/Object/<init>()V .line 2 aload_0 ldc "select column1, column2 from table where AnyCommonWhereCondition " putfield TestMe/RETRIEVE_SQL Ljava/lang/String; .line 17 ldc "select column1, column2 from table where AnyCommonWhereCondition " astore_1 .line 18 ldc "select column1, column2 from table where AnyCommonWhereCondition " astore_1 .line 19 return .end method Edited by ChrisR Jan. 22, 2002, 04:46:41 PM EST Expand All History
Post #25,199 by Arkadiy 1/22/02 7:28:44 PM Reply	I don't think it will affect speed but you'll have an extra field per instance if you don't use static. I don't see any reason why not to use it.
Post #25,280 by tuberculosis 1/23/02 1:13:14 PM 8/21/07 5:57:47 AM Reply	Yep, its a size issue. Without static, you have an instance variable in every instance that all point to the same string (done by the compiler so that string literals that are equivalent are also indentical). So you pay for 1 object reference per instance of the enclosing class. If you make it static, you have one object reference, period. So you might save 4 bytes per instance by using static and since its a constant (it is a constant, right?) its a more rational way to do things. The only speed impact would be at object construction time, the cost of initializing the per instance reference to the string. This is likely negligible - I doubt you could measure it without creating 100k objects in a tight loop and measuring the difference. The average hunter gatherer works 20 hours a week. The average farmer works 40 hours a week. The average programmer works 60 hours a week. What the hell are we thinking? Edited by tuberculosis Aug. 21, 2007, 05:57:47 AM EDT Expand All History
Post #27,416 by dshellman 2/5/02 2:56:42 PM Reply	Re: Java String field optimization tips? Let me provide some clarification on this issue (even though it appears to have been beaten to death). As of JDK 1.3 (I believe), there is no difference between doing a String concatination ( "foo" + "bar" ) and using a StringBuffer to do the same thing. So, the moving of string concatinations to StringBuffers will have no speed difference in the newer JVM's (I'm assuming we're referring to Sun's JVM, as I'm not sure what IBM's does in the same situation). As for the need for static...I don't believe it will add value. The final means that it's a constant, which means that it shouldn't even create a new value for each instance (I believe, when it's final, that if affects the scope of the field, and nothing else). I'm not a 100% sure of that, but I believe that to be the case. The point is that, to a certain extent, the programmer shouldn't be too worried about certain optimizations, as it's a JVM issue, not a language issue. Obviously, this can be a problem when the JVM is a bottleneck and the programmer has to work around it. The nice thing about this is that newer JVM's speed up existing code (for free...no code changes). Another example is that way the garbage collector works...lots of small, local object creations aren't that interesting anymore (that is, they aren't the bottleneck). So, doing the string concatination doesn't cost as much in the new JVM's. In fact, in one project I worked on, I'd built an object pool. It worked really nice in JDK 1.2, but when we ran it in JDK 1.3, it actually was slower (sped up after I removed the pool). Of course, I'd better note that the pool was accessible across multiple threads, so the synchronization was the speed cost (but I had to have it, because of the multiple threads). Dan Shellman

Welcome to IWETHEY!