operations
Why don’t you trust your build system?
Mar 31st
In this post I’m going to challenge the conventional wisdom that the best place to store configuration information is in XML config files or database entries rather than in code files. A typical comment I see goes something like this “I can’t change routes without recompiling and deploying code. I thought it better to configure routes in a more dynamic environment, specifically the database …”
Some observations:-
1) Why is ‘recompiling and deploying’ an issue? If you have continuous integration and continuous deployment to a test server it should be a non-issue.
2) Databases (alone) lack any tracking metadata – who changed what and when? Who do you blame when your site stops working because someone edited the configuration?
3) Databases and XML files do not provide strong-typing and Intellisense like you can get in the IDE to accesss your configuration settings.
4) Do you *want* your ops team to be able to edit your configuration settings?
So instead of using XML files that can be changed in production environments by anyone without any tracking, why not build your configuration settings into code files where they can be strongly-typed and are always under version control giving you a full history as to who changed what and when it was changed. If, like me, you deploy several times a day because you trust your build and deployment system this is actually the easiest and safest way to make configuration changes.
Looking forward to the new year and our new data center
Dec 31st
In the new year I’m going to be moving all our servers over to a data center in Issaquah, WA. I’m looking forward to having some faster hardware and a better connection able to provide a better experience to all the customers of our digital signage solution.
Happy New Year!
Continuous Integration -> Continuous Deployment
Dec 29th
There was some interesting discussion today around the topic of continuous deployment (pushing incremental builds to production rapidly rather than batching them up.) Here’s some thoughts on that topic:
Personally I think you should be able to make a fix and deploy it to production in a matter of minutes but that you should do this rarely and it depends on the type of fix.
During the early stage of a lean startup you can go without a staging environment: the minimum is a continuous integration server deploying to a local development environment and then the ability to manually push to production as needed. (Note: I consider continuous integration a bare minimum.)
How do you define quality? The formula I have for perceived quality is:-
Sum (Severity of bug x number of times experienced by users)
Clearly data-loss bugs rank high on severity and should be avoided before you get to production but for many other classes of bug IF you can fix it quickly the overall user impact can be minimized more effectively by having a faster build-deploy cycle than by having a perfect test suite. And since the perfect test suite is in any case impossible, you may as well invest in improving your build-deploy cycle process first. In an ideal world no user ever experiences the same bug twice and no two users experience the same bug.
This means however that you need a rock-solid build and deployment system that you trust as much as you trust the compiler. (And, btw if you have this, the need to pull configuration settings into XML files that you can edit on production servers goes away: if you trust your build and deployment system you can make the change in code and rely on the process to make the change in production. As an added benefit you now also have an audit trail as to who changed what and when and you can lock down your production environment so that very few people have direct access beyond deploying new bits to it using the prescribed process. As we all know most problems in production are caused by human error and if you allow people to make random changes there it’s hard to figure out who/what/when and where for any critical error).
Another corollary of this approach is that you need in depth logging and exception reporting. You need to be able to understand what caused a bug to appear and how to reproduce it from a single instance of it happening. Your logging around an exception should include the entire state (which file, which user, cookies, referrer, steps leading up to it, …). You should record exceptions in a database so you can see which ones are most common and can correlate their occurrence to any changes you made to the site. Your error reporting also needs to encompass the javascript running on your customers computers with every javascript exception reported back to your site using a web service. After all, my formula is ‘bugs experienced’ NOT ‘bugs experienced that we happen to know about’!
Another trick a lean startup can use for deployment is to employ Subversion as a binary version control system. i.e. your continuous integration server does a build and then checks the binaries into a different Subversion tree. To deploy (a no-database changes fix) you simply do an SVN update on the production server. It’s fast, efficient and most importantly atomic (unlike XCOPY). It also provides an immediate roll-back capability – simply go back a revision. Another advantage is that you can apply fixes to just one file (e.g. an image or html file) by updating just that file and can be sure that no other files changed in the process. And, again it gives you a complete audit trail so you can see how any file has changed over time and relate that to any changes in exceptions being logged.
So, in summary: major ui changes, database changes should be pushed to production infrequently and in a very controlled fashion but minor ui changes, critical bug fixes, … can happen all day long all the time if you have the right process in place.
When will people learn to backup?
Dec 11th
Had another friend lose a hard drive today without a proper backup. Pain!
I now have at least 3 copies of everything with staggered backups to different hardware. For the digital signage software I manage there are two copies on the servers locally and another copy in the cloud on Amazon’s S3 storage which is itself replicated multiple times.
The basic concept that people need to use is “SHARED NOTHING”.
RAID is NOT the answer to data security, it’s a convenient recovery mechanism for failed hard drives but if your data is on two drives connected to the same RAID controller card on the same computer in the same room you have plenty of opportunities to lose it all.
Backups should ROTATE. Backing up to the same location risks a failure in the backup that could wipe both copies, or copy bad data over good before the mistake is discovered. For really critical data I have a daily backup, a weekly backup, and a monthly backup. I also have two backup schedules, morning that copies to one set of drives on controller A and an evening backup that copies to a different set of drives on a different controller.