Performance Quiz #8 -- The problems with parsing -- Part 4
In my last posting I made some recommendations about how to drastically improve the evaluation process based on the key observations that even the basic operations required to evaluate the truth of the facts were quite costly -- more so than all of the extra allocations.
I suggested a plan like this one:
preparse the predicates so that no string processing is required for evaluation
assign facts a number in sequence as they are encountered in the predicates
the preparsed predicate is a mix of fact numbers and operators probably in postfix order
when looking up facts simply access an array of truth bits indexed by fact number
as facts change simply change the array contents and then reevaluate predicates as desired
do the evaluation itself with a simple stack machine; no recursion in the evaluation
And as promised here's some code that does just that: parser_compling.cs
Now how do we fare with this code? Well the raw number I get on my machine is now 225ms to do those same predicate evaluations. That's down from 4.839s -- so it's over 21 times faster. Let's have a look at how the CPU usage is distributed now:
Module/Function | Exclusive % |
ParserCompiling.exe | 93.50 |
ParserCompiling.Program.EvaluatePredicate(int32[]) | 76.52 |
ParserCompiling.Program.Main(string[]) | 16.98 |
ParserCompiling.Program.CompilePredicate(string) | 0.00 |
ParserCompiling.Program.CompileOptionalNot() | 0.00 |
ParserCompiling.Program.GetToken() | 0.00 |
ParserCompiling.Program.CompileAnd() | 0.00 |
ParserCompiling.Program.CompileOr() | 0.00 |
mscorwks.dll | 3.15 |
ntdll.dll | 1.47 |
mscorjit.dll | 0.63 |
Unknown | 0.42 |
SHLWAPI.dll | 0.21 |
KERNEL32.dll | 0.21 |
mscorlib.ni.dll | 0.21 |
MSVCR80.dll | 0.21 |
shell32.dll | 0.00 |
mscoree.dll | 0.00 |
Things have really improved at this point, the vast majority of the time (76.52%) is now being spent directly doing predicate evaluation -- that's just where we want the time to go. Our new implementation is also much more cache friendly as the number of facts increases because now we simply have an array of booleans (it could be even denser if it was an array of bits). Even if there were a few thousand facts we'd still have a tight representation -- the hash table approach tends to scatter things (deliberately) throughout a larger table. There are no redundant string comparisons, no hashing, it just goes voom :)
Now, what if we get out the big guns and start doing native code generation. Could we do even better? After all, this is managed code, we have the JIT at our disposal.
Tune in tomorrow :)
NOTE: When I originally posted this I reported the times for the debug build... I had to restate the results because the release build is approximately three times faster. While I was at it I re-verified that my previous numbers were on the release build and they were. Sorry about that...
Concluded in Performance Quiz #8 -- The problems with parsing -- Part 5
Comments
- Anonymous
November 29, 2005
Your evaluation code is basically identical to what I had, with two exceptions.
1. As I mentioned in an earlier comment, the top of the stack is an operand in every operation. By using an implicit top-of-stack element, you can eliminate quite a bit of unnecessary pushing and popping.
2. I seem to recall that indexed array access is still faster than a foreach in .NET 2.0.
I see you haven't optimized the compilation step much. With a 1,000,000:1 evaluation to compilation ratio, the compile time rounds down to a deceptively comforting 0.00. I would like to see the numbers with a more realistic 20:1 ratio. My expectation is that compilation takes up a lot (5-7x?) more than its fair share of the time.
A simple optimization would be to have GetToken return an int. This would eliminate about 60% of string allocations, and would avoid nearly all duplicate string comparisons. I see this shaving at least 20% off the compile time.
Turning the compiler into a hand-crafted state machine would be even better. The grammar is simple enough, but it may not be worth the effort.
As for native compilation: I don't believe the overhead of JITting the predicate code is worth it for only 20 executions. Our stack machine is very efficient. - Anonymous
November 29, 2005
I'd like to disagree with Jeffrey. Native code generation is the only sensible optimization. 76% of the time is in EvaluatePredicate just running through that switch statement.
Jeffrey says 20:1 is more "realistic" than 1,000,000 to one. I don't know how he can say that. We were never told where this code would be used. I will restrict my comments to the benchmark as it was presented.
Further optimization of the compiler isn't worth it right now. None of the current compilation routines are greater than a hundredth of a percent, so optimizing the compiler cannot give you more than 3-4 hundredths of a percent improvement, even if you made them take no time at all.
Let me put in a word here for the lightweight code gen componets of System.Reflection.Emit. I have found it very speedy, and perfectly suited to this "small task" enviornment. We would only have to do a, genuinely lightweight, code generation and jit once to save between 1 million and 6 million trips through the foreach/switch statement.
I think its a win. - Anonymous
November 29, 2005
Tomorrow's posting should clear the air :) - Anonymous
November 29, 2005
Mike, I totally agree with you for the current benchmark. Native code generation wins hands down. There is simply no comparison.
However, Rico posted a comment in Part 3 that realistic numbers were: a few dozen facts, a few thousand predicates, and around 20 evaluations. My comment was made with these numbers in mind.
LCG does come with a significant overhead. Constructing the IL stream is significantly more expensive than building our highly specialized stack machine. The JITter itself takes time, also when it tries to optimize. The JITter ends up going through the code several times. All this overhead has to outweigh the difference between our stack based evaluation and the native code... I have my doubts. - Anonymous
November 29, 2005
Tomorrow I'll include some results for the benchmark as given (and I'll post source for that) plus I'll also include results for some scenarios that are closer to how it's actually going to be used. I'm not going to post source for that because 2000 lines of predicates seems just dumb :)
With so few predicates I had to up the iteration count to get repeatable times but I think with the extra data points I give tomorrow it should be clear what the impact of the setup is for both cases.
Tomorrow is the last installment I'm planning.
Hint: When have I ever not thrown a curve ball? :)