I knew one software company that failed their SaaS transition because they chose to cut a few corners with the operations. Since they were software engineers, they did not really want to spend time on such mundane tasks as security, auditing, backups and so on. One day they let a disgruntled employee go, the guy went to an internet cafe, logged into the hosting account with the shared admin credentials, and deleted all customer data.
There were no backups or data replicas to bring the data back, no personal admin accounts or procedures to prevent such an incident from happening, and even no monitoring to learn about the issue before customers did. This was the end of this SaaS application – it just never recovered.
Cloud business is more than just putting some code online (and collecting money ;)) Whether you are offering Software-as-a-Service (SaaS) web application, Platform-as-a-Service (PaaS) or Infrastructure-as-a-Service (IaaS) – what you are offering is more than just your code – it is your service.
Even if you do not offer a formal service level agreement (SLA) and have a statement in your Terms of Service that you are not liable for anything, your online application or platform is still a service so your customers expect it to be reliable and secure.
At our recent WSO2Con, Chamith Kumarage delivered an excellent session on how our Cloud DevOps team works. If you are delivering a service online (or considering doing so) – make sure to watch the recording (quick registration required).
Here’s my quick summary of Chamith’s advice:
1. Automate everything: repetitive tasks not only are inefficient and mundane, and eat your time. When done manually they are unreliable. Humans tend to do things slightly differently each time they do them, or not do them at all.
2. Tasks are really parts of processes: when you come up with something that needs to be done, ask yourself what is the process flow for this task? For example, a data backup is really a part of a process that includes:
- Scheduled (e.g. at 1 a.m. every day) script which creates a backup,
- Some sort of monitoring system which verifies that the script ran and the backup got created,
- Notifications on failures and procedures that need to be followed in not,
- Backup testing: automated and/or regular manual recovery drills (if manual then documented and performed by different team members).
3. Design for failure: everything will be failing so make sure that your system can sustain the failures. For example, if your system uses multiple virtual machines in the cloud, keep running a “chaos monkey” script which keeps randomly killing the instances and automated tests which ensure that these instance failures do not affect the overall system (by the way, see how Netflix does that.)
4. Self-healing and success verification are critical for all tasks. Any task and operation can fail (see above) so the system should not get “surprised” but should always automatically validate the action results and if something didn’t go right – implement the healing procedures (start new instances, retry, and so on).
5. Enforce discipline, processes, automation, checklists. Document everything. This will make your processes repeatable and reliable.
“Bus monkey test” (related to the above) if one of your team members gets hit by a bus – all operations should keep working: everything needs to be documented and tried by other team members. (* This is a mental experiment – do not actually hit your team-members by busses :))
6. Monitoring and analytics: the key is not to collect and show tons of data and alerts, but be able to quickly detect abnormal behavior.
7. Communications: your dashboards should quickly and clearly give you the big picture and relevant details. Key metrics and system state should be something that everyone sees and understands, effective drill-downs should make it easy to understand and fix stuff.
8. Agile delivery: waterfall processes in the cloud are bad and stressful.The smaller the changes and the more often and in more automated fashion they are – the more mundane they become: which lowers the risks and improves the skills and reliability. Cloud and big-bang releases do not go well together.
9. Use standard tools and native systems of underlying platforms – do not reinvent the wheels. For example, if the platform gives you SQL-as-a-service (Amazon RDS, Azure SQL and so on) – use those and not your own MySQL running on a virtual machine.
10. Post-mortem analysis is a must. If something did get wrong after all, you need a formal investigation process:
- What happened?
- Why and what needs to be done to prevent this in the future?
- If automated monitoring didn’t catch it, why and what needs to be done to prevent this in the future?
- If validation and self-healing didn’t catch it, why and what needs to be done to prevent this?