Costs, Modelling, and Managing Risk
Here's a little peice of mail I send out to some folks today discussing some root causes of performance problems generally. I've written about basically all of this before but here it is in summary form perhaps mostly to prove that what I tell my colleagues here really is the same as what I tell you guys.
----
We do very little to encourage people to understand the properties of their algorithms before they start coding. Generally this is a much bigger problem than any machine level anomalies you might encounter. The “real costs” people need to know are more often at a much higher level than the machine.
The big two problems are almost invariably:
- The developer is using an algorithm that is fundamentally unsuitable for the task at hand
- The developer has taken a dependency on a technology that is fundamentally too costly in the relevant context
Those are broad categories but I say them that way to emphasize that the problem is almost certainly not something like “your algorithm requires 35 TLB slots and you only have 32 available”.
That said there is a prescription for success, and it is not “code it all up and then measure the heck out of it.” It’s too late by then and teaching such a practice teaches despair.
Engineering is about achieving predictable results for predictable costs. Notwithstanding that basic truth we rarely set out to predict anything in any kind of reasonable way. What I’ve been trying to teach for the last 3+ years now is a fairly simple process:
- Decide what level of performance you are looking for in rough terms – do you want an “A+” or is a “C-” good enough? An “F” is never acceptable by definition.
- Understand what that grade of performance looks like in the terms your customer thinks about
- Consider the limits the metrics above place on your consumption of resources (cycles, disk reads, network round trips, whatever is likely to be relevant)
- Postulate an algorithm and then take steps to cost it out in terms of the resource(s) in (3) before you code it all. Be as detailed as is necessary but not gratuitously detailed.
The idea is to control risk. If you are shooting for an C- chances are you can very easily demonstrate that you’ll be able to meet the goal because it should be an easy goal to hit. A few quick calculations on the back of a napkin will do the job. Contrariwise if you are shooting for an A+ – world class performance – chances are that your margins are razor thin and you will be testing the limits of the hardware. You will want to spend a considerable amount of time trying out things and perhaps creating proof of concept implementations, models etc. It will be cost effective to do so under strenuous requirements.
The bottom line is that you should know, very early in your cycle, that you are substantially likely to succeed.
All of this plays directly into having a basic understanding of elementary framework costs and architecture costs. It’s not that hard to get the facts you need via experiment.
I get very worried when people say things like “Productivity and cleanliness always trump performance.” Productivity is about creating product. A “clean” design which fundamentally fails to address performance requirements is not an example of a productive enterprise, it is a looming disaster. A developer productively engaged in creating a failure is uninteresting.
I like to teach that it is best to consider the entire cycle from a risk management perspective. Complex design incurs risk, significant unknowns incur risk, unmodeled security threats, unwritten code, and messy code all incur risk.
At any given stage you take the steps most needed to best control the remaining risks with the time you have (including the risk of not finishing). Keeping in mind that a messy yet performant design has merely introduced a different category of risks; maybe worse than the performance risk was in the first place. Risk is the ultimate equalizer and importantly it teaches balance in approach.
After all, overdoing your performance work is just another way to fail.
Comments
Anonymous
November 15, 2006
I don't hear many people saying “Productivity and cleanliness always trump performance." What I hear is "Productivity and cleanliness trumps performance, until performance degrades the user experience. Furthermore, we are betting that our clean design will be easier to optimize than a design we tried to optimize from the get go." Also, everyone performs premptive optimization, at least at an architectural level. We can do this because we have enough information early on to decide which architectural decisions will result in a resonably performant system (e.g. do we need to use an enterprise class RDB, will we have to scale up to the point where clustering is necessary, etc.). However, this is all infrastructure, which developers should be more versed in. Domain logic is a different story, however, since there are many more unknowns.Anonymous
November 15, 2006
+1 for Colin just wouldn't be right. Fortunately there's a solution: I second Colin by an order of magnitude.Anonymous
November 15, 2006
The comment has been removedAnonymous
November 16, 2006
Wise words indeed, and I also agree with Colin's sentiments to a degree - productivity & cleanliness do win, until such time as the user experience degrades. However, I'm not sure that I agree that a clean design will be easier to optimize just because it's clean - I think Rico is trying to say that if performance isn't considered up front, then when the system starts to creak you may find that your clean design is more in need of a re-write rather than some "optimization" - the word (along with "refactoring") tends to imply that it's a relatively simple task with little associated cost, whereas my experience suggests that reversing features such as performance and security into an existing system where they were not originally given due consideration is actually a hugely expensive exercise.Anonymous
November 16, 2006
Steve, I agree that a clean design is not always easier to optimize, but rather it will usually be easier to optimize. Also, clean designs are ALWAYS easier to optimize than designs that are optimized incorrectly. These kind of solutions, where developers guessed early on where the performance bottlenecks would be with little to no information and missed the mark entirely, are where most of the backlash against premature optimization comes from. I think we're actually all in agreement that balance is the key.Anonymous
November 16, 2006
You know Mr. Mariani is pretty close to Mr. Miyagi :) Must have balance. I always did like Pat Morita. RIP.Anonymous
November 16, 2006
Steve Strong wrote: > I think Rico is trying to say that if > performance isn't considered up front, then > when the system starts to creak you may find > that your clean design is more in need of a > re-write rather than some "optimization" I agree with that second-guessing and I agree with the opinion expressed therein. But I still agree with Colin first. If you code it right first, then it will be far far easier to do a rewrite with the additional performance considerations that you need. If you do the opposite, if you do too much premature guessing about optimizations in the 95% of the code that wouldn't even need it in the first place, then you have to do 10 times as much rewriting and they won't be easy to rewrite. You'll have to start over by figuring out what the requirements were again.Anonymous
November 16, 2006
Colin writes: >Also, everyone performs premptive optimization, at least at an architectural level. The sad fact is that if the above were true anything like universally I would not have a job. Failure to consider scale, at all, is a leading cause of performance problems. Testing against baby-sized scenarios is commonplace. Tragic but true.Anonymous
November 21, 2006
Hmm, I think that before you can evaluate a statement like "Productivity and cleanliness always trump performance," you must first identify what "performance" means. If "performance" means "meeting the performance requirements of the solution," then that statement is clearly ridiculous. Meeting the requirements doesn't make the software good, but without doing so, the software will never succeed. However, if "performance" means "completing a task/action/routine as quickly as possible," then I tend to agree with the statement. As long as you are meeting the requirements and have a good understanding of the stakeholders' vision for the solution, then I think productivity and cleanliness do always trump performance. I think Rico's point may be similar to an idea a more experienced coworker was discussing with me recently. I haven't been in the business long enough to know the validity of his statement, but this is essentially what he said. For the first several decades of CS/software development, the hardware was slow, communication among computers was slow, and eeking every bit of performance out of your solution was a top priority. In the last, oh, 6-9 years, hardware and communication performance have been increasing so rapidly that a new trend has emerged where performance is almost an afterthought. The sentiment is something like this: "Let's just write the code 'correctly,' and if it's a little slow right now, then that's fine; 6 months from now, the software will be plenty fast on new hardware." Many young or new developers never bother to learn the internals and details of their development platform. They simply learn how to implement functionality using patterns that are "elegant," with productivity and clean design being their primary goals. The pendulum has swung too far away from performance, and as our industry moves towards developing applications to process and mine the vast amount of information we generate and collect nowadays, a lot of these elegant-design-focused developers are going to have a rude awakening. I'm not so sure I agree with all of that, but on the other hand, I do take pride in knowing as much as possible about my development platforms.Anonymous
November 21, 2006
I think those 6-9 years are more like 25 years. The first time I read part of the code for the X Windows system, I was astounded that it performed fast enough to be usable, but indeed it did. Remember programs like rsh roundtripping every single character that was typed, unbearable over the WANs of those days, but astoundingly usable on a LAN.