Rules for GitOps - Part 1
I’ve recently been involved in two major efforts to retool developer / operations at large cloud companies, and as such I’ve had a front row seat to the incredibly fast progression from development automation to now Infrastructure as Code and GitOps. The changes have been so fast and furious, that in all honesty, I’m not sure I could really articulate what GitOps is. Sure, if you google around you find articles and solutions to run some infrastructure on kubernetes, but that can’t really be the crux of GitOps, right?
That being said, I can tell intuitively that GitOps is a methodology I would like, even if I couldn’t necessarily find a concrete set of guidelines that I could read and agree / disagree with. So, I decided to put something together. I wanted a set of rules - a set of firm guidelines that said, if you’re doing this, you’re doing GitOps. If you’re not, you may be on the road, but you’re not there yet.
This is a tough challenge, I’m not sure a list like this even exists for DevOps, and that’s been around longer that GitOps. But, the difference is GitOps is more of a set of processes, while DevOps tends to be more concerned with org structures and pager duty responsibilities. But, while a tough challenge, I think this is very tractable.
Rule 1: You can only setup 2 things manually - your AWS / GCP account and your git repo
Ok, I thought I’d start at the beginning. We’re working on a new project, and I wanted to draw some lines in the sand. I had a colleague I really liked who had a tradition on his platform team - if anyone ever did anything manual on the servers (install a patch, reboot, whatever) they had to wear a pink sombrero for the rest of the day. This was to instill in the engineers the importance of automation for even the simplest of tasks.
It actuall was this pink sombrero!
For GitOps, I think we should follow in his footsteps - every single change you make to your infrastructure has to go through a git checkin, and every action on the platform must be executed from some script, preferably not one you’ve cloned to your desktop (more on that later).
While this seems somewhat simple at first glance, when I say everything, I mean everything. A notoriously manual part of many platform teams is the monitoring and alerting tooling. In a GitOps org, dashboards, alerts, metrics gathering, everything has to come from a git checkin, essentially from code.
Other places that can be tricky are scripting solutions - obviously the script themselves can be checked into git, but how do you really bake that intro your process? I don’t think anyone would want to handle kicking a server through a git checkin, so what’s the solution? Finally, security. Security groups, permissions, ACLs are all places that people will traditionally take to manual setup. No more!
Rule 2: Every action has to be auditable
As so often is the case, we’ve gotten so far trying to figure out if we can do GitOps, perhaps we’ve forgotten to ask if we should do GitOps? There are a few major reasons GitOps and Infrastructure as Code appeal to me - first, I’ve run commands on servers at scale during upgrades, and I’ve messed it up. At one of my first jobs I was screwing around, mistyped something, and dropped a production database. Humans make mistakes, and humans entering commands make more than their fair share.
However, another incredible benefit of GitOps is the audit log. The idea of seeing exactly who upgraded a server, or when, and who reviewed it, and if they raised any concerns, is incredibly powerful. For what it’s worth, I’ve been part of a number of organizations that keep very careful track through external change management systems of system changes, but things were always missed, and I never really felt deep confidence in the change log. GitOps has the potential to change that so every action has full traceability.
This is far from the totality of GitOps rules, but it’s probably enough for one post. Next time we’ll tackle rules around GitOps security and how to handle service deployments (not just infra).