Posted on: 2018-12-26
Many years ago I "inherited" a website which contains a lot of user generated content. I can't go into details here in order to protect the innocent guilty - and myself.
It was based on a framework I knew nothing about so the external company which built it - let's call them Initech - was also maintaining the code. We were managing the servers. The site had different types of entries which were all stored in separate tables. Minor maintenance was done by us - like changing HTML templates or optimizing database queries to reduce server load - but new features hat to be built by Initech. Over time a few things got bolted on and the site jugged along just fine. Until one day I got a call: "Simon, no one can log in anymore!". I checked the server and found most of the users missing. Just a few days ago a new coworker took over the management of the data on that site so we assumed that he clicked something wrong. I restored the user data from the backup and hoped it wouldn't do that again but still worrying why it happened in the first place. Just to be on the safe side we increased the backup frequency.
As you can imagine, it happened again. Right the next day. The strange thing was that not all of the users were gone but around 90%. The rest remained. Before I did another restore I exported the decimated user table in order to figure out what was so special about those remaining 10% and set up a dev instance to try out various things myself.
We could have just contacted Initech but that would have taken a few days and meanwhile I didn't want to constantly restore stuff that went missing for unknown reasons. Maybe someone even had broken into the site. Also I was very curious.
First I wanted to figure out where all the deletions happened in the code. Since this was written in a scripted language I could just grep through the whole codebase. Most of the deletions where benign but one thing struck me as odd. It came from a new part that was hastily bolted on to the website recently which deleted "expired" records. The odd thing was that it took the table name for the delete-query from a variable and there was no list of which tables to delete from. On a hunch I tried to match that query to the user table and, guess what, it fit. All the fields overlapped and by some coincidence a column which was used to check the expiry of a record was also present in the user table but had a different meaning there which meant it was set on almost all users. Ok, so that's the most likely culprit. But when is it even called?
When this auto-deletion thing got implemented we didn't get any cronjob to set up and as far as I know this framework has no built in cron facility. So where does it happen? Yes, of course it happens when you attempt to view the tables in the backend. Because that's where to put it, right? .... right...? There was a facility to edit user accounts so no one ever directly viewed the user table through this system. Except for the new coworker who wanted to familiarize himself with the system and clicked on all the tables in the list. When I tried it in my dev setup, the user table took quite a while to load while the MySQL was grinding over the harddisk, deleting all the user accounts created more than a year ago with that specific flag set.
I quickly developed a one-line fix which checks the table names against a whitelist and deployed it into production straight away.
Also I sent a mail to Initech about the changes - and that we would have preferred a cronjob over this mess ...