Ergo

The ephemeral nature of good decisions

Per Lönn Wege

Sometimes good ideas are simply good ideas. Sometimes they’re not. Today was such a day.

For the last month or so I’ve been working on migrating our GCP Cloud functions from Gen1 to Gen2. It’s a necessary move, Gen1 has been deprecated for some time now and is slated for complete removal in the coming year. Gen2 also does improve many things, adding the ability to have concurrent calls to the Cloud function running, longer living processes, and more.

To take a step back and describe my role, I’m a Software Engineer at <Large-scale IT company>. I’m technically working with the entire stack though my interests rest solely in the backend, maintenence, ops, and CI. As such I dove into the task of migrating the slice of Cloud functions my team controls, around 50. Over the course of around 5 weeks I went through every line of relevant code to build and deploy the functions, upgrading our config management, going over each Cloud function and making sure the init-code is using the new pattern and that every message is received and handled correctly.

And things seemed great. The code got a nice spit-shine, everything was upgraded cleanly, every initial test of the code seemed to work well. The PR was approved and we hit the button.

As it is with all things it rarely goes right on the first try. We realised Gen1 to Gen2 is not a straight upgrade in GCP (Gen1 supports camelCased names, Gen2 does not, so we had doubles of each Cloud function). The metrics and dashboards had changed names and fields. But we managed to iron it all out eventually, and finally pushing it to production.

This is where the proverbial shit hit the fan. Everything went down. Our connections to Redis errored out. Services were unresponsive. After a lot of panic and poking at the system we finally made it work again. Small errors aside everything worked fine.

Until the next morning when everything died once again. We’ve spent the day debugging and re-running our deployment to get the system back in a stable state. As far as we can tell it’s in part related to us using global state in code that can now live for a long time. Our Gen1 Cloud functions were all getting recycled regularily, forcing a cold start so often that we never noticed the imperfections in our init, that things never really had time to time out. As such we never saw issues with our Redis code, we never ran into connections dieing. And never saw the complete deadlock that plagues our system at the moment.

At the time of this writing I’m in the middle of reverting all our changed code, pulling everything back, cleaning up and resetting, and pushing it out to our staging environment. For the time being we’ll need to stay on Gen1. We will do another try another day with the learnings we bring from today, but it’s a very bitter lesson.

We learned a lot the hard way about how Gen2 manages function lifecycles differently. Our Redis connections and global state weren’t designed for that kind of longevity, and it caused serious issues. Despite the downtime and frustration, we’ve now pinpointed several problems with our codebase and will refactor our connection logic before trying again. Sometimes a lesson learned the hard way is the one that sticks.

Sometimes good ideas are simply good ideas. Sometimes they’re not. But you always learn something along the way.

Onward!