Rules for GitOps - Part 2
If you read part 1 of my series on the rules for GitOps, you’ll know I feel strongly about 2 points - first, don’t do ANYTHING manually, and second, make everything auditable.
Rule 3: Everything must be declarative
Maybe this should this be a subset of Rule 1, but’s it’s so important, I think it deserves it’s own spot. Everything from charts to monitoring to security rules needs to be declared declaratively. Obviously this is as opposed to doing something manually, but I’d also separate out init-only containers and other batch processes (even automated) that can go wrong.
K8S with Prometheus has done a great job at this, and is a great example. With the current prometheus helm chart, once deployed, to start it scraping a particular pod / service, you add the following metadata to the service definition:
stuff
That is truly beautiful, it encapsulates so much of what is powerful of GitOps. First, there’s no central repository
to update for new services - a pod gets deployed and the K8S cluster just magically starts monitoring it. This can work
for dashboards and alerting too, though for dashboards you tend to need to modify a central repo. But, as the tooling
catches up, a given service will be able to decoratively define it’s monitors, alerts, and charts / dashboards, and
everything will just work at the ops level. This is one of the reasons I tend to like Terraform over
Rule 4: Only the CD runner and script runner have full access to production
The idea of removing access to the production environment from engineers is somewhat radical, but I think it fits with this model of thinking of your platform as a services that your engineers are consuming. Just like I don’t need access to Github’s servers to use the platform, maybe we don’t need anyone to have access to the platform we’re building, just the tooling layers.
You’ll still need a few super-admins on the project, but ideally that will be a small number of people, with highly audited access, and maybe even just permissions limited to defining other accounts with high permissions.
The idea here is simple - insiders can do bad things, intentionally or not. Some of the worst security incidents I’ve seen in my career have been based on malicious internal users, so limiting that access can be good. Of course, a lot of this advice depends on the sensitivity of your workload, but more and more companies are having to seriously consider what it means to be under attack from a nation state. I’m not sure Facebook, Twitter, or even Gmail thought when they first started that nations with near limitless resources would be attacking them, yet here we are. In that scenario, it is literally nothing for a state to put a bad actor on your team. Is this going to happen to everyone? Is this overly paranoid? Definitely, and your mileage may vary. But, is there a chance a junior dev will mis-type a command and wipe out all production machines? Definitely. So, mitigating human interaction at this level, unless it’s necessary, is definitely a goal.
There’s an old joke among pilots that eventually aircraft auto pilot will be so advanced that you’ll only need a pilot and a dog in the cockpit. The pilot is there to make the passengers feel comfortable, and the dog is there to bite the pilot if he tries to touch the controls. Add the bite to your process.
In the next part of this series, we’re going to take a break from rule making and discuss how to actually implement some of the more complex concepts here, like fully declarative monitoring / dashboards and maintenance tasks.