Continuous Integration -> Continuous Deployment

There was some interesting discussion today around the topic of continuous deployment (pushing incremental builds to production rapidly rather than batching them up.) Here’s some thoughts on that topic:

Personally I think you should be able to make a fix and deploy it to production in a matter of minutes but that you should do this rarely and it depends on the type of fix.

During the early stage of a lean startup you can go without a staging environment: the minimum is a continuous integration server deploying to a local development environment and then the ability to manually push to production as needed. (Note: I consider continuous integration a bare minimum.)

How do you define quality? The formula I have for perceived quality is:-

Sum (Severity of bug x number of times experienced by users)

Clearly data-loss bugs rank high on severity and should be avoided before you get to production but for many other classes of bug IF you can fix it quickly the overall user impact can be minimized more effectively by having a faster build-deploy cycle than by having a perfect test suite. And since the perfect test suite is in any case impossible, you may as well invest in improving your build-deploy cycle process first. In an ideal world no user ever experiences the same bug twice and no two users experience the same bug.

This means however that you need a rock-solid build and deployment system that you trust as much as you trust the compiler. (And, btw if you have this, the need to pull configuration settings into XML files that you can edit on production servers goes away: if you trust your build and deployment system you can make the change in code and rely on the process to make the change in production. As an added benefit you now also have an audit trail as to who changed what and when and you can lock down your production environment so that very few people have direct access beyond deploying new bits to it using the prescribed process. As we all know most problems in production are caused by human error and if you allow people to make random changes there it’s hard to figure out who/what/when and where for any critical error).

Another corollary of this approach is that you need in depth logging and exception reporting. You need to be able to understand what caused a bug to appear and how to reproduce it from a single instance of it happening. Your logging around an exception should include the entire state (which file, which user, cookies, referrer, steps leading up to it, …). You should record exceptions in a database so you can see which ones are most common and can correlate their occurrence to any changes you made to the site. Your error reporting also needs to encompass the javascript running on your customers computers with every javascript exception reported back to your site using a web service. After all, my formula is ‘bugs experienced’ NOT ‘bugs experienced that we happen to know about’!

Another trick a lean startup can use for deployment is to employ Subversion as a binary version control system. i.e. your continuous integration server does a build and then checks the binaries into a different Subversion tree. To deploy (a no-database changes fix) you simply do an SVN update on the production server. It’s fast, efficient and most importantly atomic (unlike XCOPY). It also provides an immediate roll-back capability - simply go back a revision. Another advantage is that you can apply fixes to just one file (e.g. an image or html file) by updating just that file and can be sure that no other files changed in the process. And, again it gives you a complete audit trail so you can see how any file has changed over time and relate that to any changes in exceptions being logged.

So, in summary: major ui changes, database changes should be pushed to production infrequently and in a very controlled fashion but minor ui changes, critical bug fixes, … can happen all day long all the time if you have the right process in place.

Tue Dec 29 2009 22:27:00 GMT-0800 (Pacific Standard Time)

Next page: Why don't you trust your build system?

Previous page: A simple state machine in C#

Disqus goes here