# Posts tagged performance

## 10 reasons my O(n²) algorithm is better than your O(n) algorithm

OK, so it’s a controversial title, but the point I want to make is that programmers often get confused about the time complexity, i.e. order O() of an algorithm and performance.  They will happily declare that their O(n) algorithm is so much better than someone else’s O(n²) algorithm as they rip out some simple, stable, working code and replace it with their vastly more complex, un-tested, but ‘optimized’ code, often without even measuring to see if there is in fact any performance gain.  The fact is that O() notation refers to the asymptotic performance of an algorithm and that in a typical environment with typical data you aren’t going to be anywhere close to asymptotic behavior.  Performance is, in fact, influenced by many other factors besides the time complexity of an algorithm.

So what factors can make an O(n²) algorithm better than an O(n) algorithm?

1. It might take a very large N before the O(n²) algorithm is slower than the O(n) algorithm.  And we aren’t talking n=10, n=100, … here, but perhaps millions or billions before the cross-over in performance is achieved.

2. O() refers to time-complexity not to how long the algorithm actually takes to execute on a particular CPU.  For that you need to look at the mix of instructions used and how long each of those instructions takes.  But that in itself is a gross-simplification as we shall see next …

Even if the two algorithms have a very similar number of operations using similar CPU instructions, they still are not going to run at the same speed:

3. The compiler might be able to apply optimizations to one algorithm that it cannot apply to the other, dramatically boosting performance.

4. The CPU itself might be able to optimize the instruction execution sequence through pipelining or other techniques to dramatically boost performance.  A dumb scan of an array for example can be accomplished really quickly by most CPUs because they have been optimized to handle such frequent, repetitive tasks.

5. A simpler O(n²) algorithm might have tighter loops in it than the more-complex O(n) algorithm allowing it to run entirely in on-chip memory, dramatically boosting performance.

Even if two algorithms have similar operations and similar raw CPU performance they may have different space requirements.  The O(n) algorithm might require 10x as much data space to run or maybe a different order of space requirement to the other, e.g. O(n²) space requirement vs O(n) space requirement.

6. The algorithm that requires least data space in the inner loop may find that its data is being held in registers or on-chip RAM during execution and thus receive a significant performance boost.

7. The more space hungry algorithm might have its data paged to disk during operation incurring a huge additional time penalty.

Some algorithms are inherently more parallelizable than others:

8. As we move to a world of multi-core processors it’s going to be more important to be able to parallelize code to get performance gains of 4x, 8x, … PLINQ and other parallel extensions in modern languages can easily be applied to ‘dumb’ nested for-loops to gain significant improvements in performance.  Applying those same speedups to your ‘optimized’ tree-based code might be much harder or even impossible.

‘Better’ is a very subjective word, better performance is just one of many issues

9. The better algorithm is probably the one that’s easiest to write, easiest for others to read, easiest to test and easiest to maintain.  A simple nested for-loop may be 1000x better than some O(n) algorithm when best is defined like this.

10. Even if your algorithm beats mine with all of these factors taken into account that’s no guarantee it will do so in the future.  Compilers evolve, CPUs get smarter and much of that effort is devoted to making dumb code faster.

Conclusion

Premature optimization is always a bad idea.

Many ‘optimizations’ can actually lead to worse not better performance.

Always measure, then optimize, then measure again.

It’s often cheaper to buy another server than it is to have someone ‘optimize’ your code.

O(n) is a much misunderstood topic

On modern servers space complexity is nearly always more important than time complexity these days.  If you can run in on-chip RAM you have a huge advantage.

## Measuring website browser performance

www.webpagetest.org is a great resource for anyone looking into web site performance.

## Shaving seconds off page load times

IIS6 comes with some antiquated settings for Gzip compression.  Even after enabling it in the IIS management console nothing happens until you edit the metabase file to enable the files you really want to compress.
With the correct settings the main page for my site now loads in 5.3s instead of 6.0s and subsequent loads are now down to about 1second with just three requests to the server.
Browser caching and Gzip compression is improving our customer experience significantly now.
Here’s what my metabase settings look like after removing ‘asp’ and adding ‘aspx’, ‘cs’ and ‘js’.
HcFileExtensions=”htm
html”
HcScriptFileExtensions=”aspx
js
css”

Don’t forget to change both sections in the metabase, one for deflate and one for gzip.

## Amazon High-CPU Medium Instance is about 30% slower than a dedicated core 2 duo server

Recently we benchmarked a dedicated GoDaddy server against an Amazon EC2 instance and \$ for \$ the GoDaddy server was about 30% faster.  Here are some stats:
2 GB memory
Processor: Intel Core 2 Duo – 2.13 GHz
32-bit platform
Raid: None
\$146 per month
UPDATETEXT … – Took 116ms
SLIDEHANDLER3 … – Took 2722ms
Amazon EC2: High-CPU Medium Instance
1.7 GB memory
5 EC2 Compute Units (2 virtual core with 2.5 EC2 Compute Unit each)
32-bit platform
I/O Performance: Moderate
Instance Type name: c1.medium (used in EC2 APIs)
Price: \$0.20 per instance hour
\$144 per month
UPDATETEXT … – Took 144ms
SLIDEHANDLER3 … – Took 3525ms
Conclusion: So unless you need proximity to Amazon S3, scalability, … you are better off with a dedicated server than an instance in the cloud.

Optimization is the root cause of much unmaintainable code. It is also the root cause of many bugs. My advice to the vast majority of programmers is to stay away from it: if you have a good architecture you should not need to optimize it at this level.

Personally I divide optimization into three buckets: 1. Optimization measured in seconds, 2. Optimization measured in milliseconds, 3. Optimization measured in microseconds.

#1 and #2 are worth doing, option #3 is rarely worth doing and most novices will often make performance worse not better when they attempt it.

For #1, start by reducing round trips to the server, set caching headers correctly, compress your JS and CSS files, … and follow the other advice Yahoo’s YSlow plugin offers.

#2 is usually concerned with your database architecture and the queries you are making. Novices often fetch too much data either breadth-wise (columns) or depth-wise (rows). Linq-to-Entities is a great help here for many developers because it defers the actual database work to the latest possible moment when all of the query parameters are known and thus does the least possible amount of work.

Many developers think that caching is the answer on #2 but I advise you to stay well away from trying to implement your own caching scheme. Use what’s provided for you in Linq-to-Entities or some other ORM but don’t try to roll you own, you are likely to introduce all manner of bugs related to stale data and concurrency issues. And if your queries aren’t efficient to begin with you application will not survive a restart under heavy load.

#3 is the hardest of all and the least likely to produce any benefit to the end user. Often it will be detrimental. My advice is to stay well away unless (i) you understand AMD and Intel processor architecture, and (ii) you have a loop that’s executed a billion times or more. Microsoft, Intel, AMD and others have spent decades optimizing inefficient code at the CPU and compiler level. Sadly this means that the cleverest algorithm can sometimes be significantly slower than the simplest ‘dumb’ algorithm. A tight loop doing a dumb linear search in O(n) can perform better than some clever tree search that theoretically runs in O(log(n)) time because the former can fit entirely in on-chip cache and can execute many times faster than the latter. If you are tempted to use a tree or other complex data-structure to ‘optimize’ access to a small number of items, think again: a simple array might be better and the code will probably be more maintainable and easier to read.

Overall, in terms of optimization, developers need to understand that CPU time is no longer a premium. These days the priorities are A) Network access, B) Disk access, C) Memory access, D) CPU time. If you can use a million CPU cycles to avoid some expensive disk access then it’s probably worth it. Too often I see people thinking they are ‘optimizing’ their code by stuffing a private member variable with some calculated value so they don’t have to recalculate it. Guess what, the calculation will take microseconds but the extra RAM you just used to store a million extra private member variables has now caused your application to page more often (both on-chip to off-chip and RAM to disk).

Before you do any optimization you should get a tool like JetBrain’s dotTrace, and for memory usage profiling, the Scitech .NET Memory Profiler. These will give you an insight into what needs fixing first and will help you understand where you can trim memory consumption.

Another key rule of optimization is to compare your development (and maintenance costs) for optimized code to the costs of a new server. You could easily spend more trying to get 10% performance gain than it would cost to just get a new quad-core server that will run 20% faster!

## It’s all about disk speed

Recently I upgraded to a 10,000 rpm drive in my desktop and a 7,200 rpm drive in my laptop. Quite frankly this is the most cost effective way to improve performance on any machine. CPU speeds have grown dramatically over the last few years, so has memory capacity and CPU to memory bandwidth, but disk seek speeds and disk transfer rates have grown much more slowly. Chances are your CPU spends most of its time either idle or waiting for the disk drive now.

Windows Vista was designed with assumptions about us all having faster computers, more memory and bigger disk drives but without faster seek speeds and higher disk to memory performance, any move towards ‘bloat’ is going to result in a more sluggish computer with worse battery life.

Personally I run Windows 2003 Server on all my computers including the laptop – it’s the most stable OS Microsoft has ever shipped and, to me, stability beats features and flashy UI.

## Cache optimized scanning of pairwise combinations of values

Suppose you wanted to calculate all of the pairwise combinations of the values in a vector of n elements. The result would be a 2D matrix of dimensions n squared. A naive approach to this would be two nested for loops evaluating each of the n squared values. But now suppose that the values in the source vector are actually really large objects, too large to hold all of them in memory and you need to page them in from disk as necessary, caching maybe just a few of them.

If you take the naive approach to this you will thrash the cache, every single calculation will result in a cache miss, and a disk access.

So I went looking for a way to improve the cache hits by ordering the data in a better order and here’s what I found:

Various space-filling curves such as the Z-Curve, The Peano Curve, a Hilbert Curve exist that can turn a 2D space into an ordered list of items such that items along the list are close spatially.

Applying a simple Z-curve to the problem improved the cache hit rate by 6x in my case!

The same techniques can be applied to in-memory data structures to try to keep them in the CPU cache.