GitLab is an awesome product! Although I don’t use their hosted service at GitLab.com, I’ve been a very happy user of the product in an internally hosted setup.
They had a pretty bad (and well publicized) incident a couple of days back which started with spammers hammering their Postgres DBs and unfortunately ending up with a sysadmin accidentally removing almost 300GB of production data.
I can empathize (#HugOps) with the engineers who were working tirelessly to rectify the situation. Shit can hit the fan anytime you have a production system with so many users open to the wild internet. The transparency shown by the GitLab team to keep their users informed during the incident was awesome and required amazing guts!
Now most blogs/experts talk about the technical aspects of the unfortunate incident. These mainly focus on DB backup, replication and restoration processes, which are no doubt, highly valid points.
I’d like to suggest another key aspect that came to my mind when going through the incident report, the human aspect!
This aspect seems to be ignored by many. From all accounts it looks like the team member working on the database issue was alone, tired and frustrated. The data removal disaster may have been averted if, not one but two engineers were working on the problem together. Think pair-programming. Obviously, screen sharing can be used if the engineers are not co-located.
I know this still does not guarantee a serious f*ck up, but as a company/startup you would probably have better odds on your side.
An engineer should never work alone when fixing a highly critical production issue.
When trying to fix critical production issues in software systems its super important to have a aircraft style co-pilot working with you on the look out for potential howlers that can occur, e.g. rm -rfing the wrong folder.
There is always something to learn from adversity, Rock-on GitLab! Still a big fan.