Linux in the Time of COVID-19

Can’t believe I managed to spend more than a year without updating this blog! In hindsight, perhaps sticking to open-source communication channels should have been in my 2019 resolutions…

I’ve been meaning to write a comprehensive status update, but perhaps starting small and updating more often is the way to go. So here goes…

On WFH and moving

Thanks to the #COVID19 situation, everyone at Facebook has been working from home for a few weeks now. It’s a blessing and a curse – my wife and I were in the process of relocating, so while not having to commute means I have more time to pack, it also means – I was working from a tiny apartment where everything was in the process of being boxed up!

The move itself was, luckily, relatively uneventful. If you don’t count finding out the day before, just as we’re about to check in, that the airline canceled all flights on our departure date, and we had to scramble to rebook and fly out on the same day. On the bright side, the airport was virtually deserted, everyone was doing the physical distancing thing, and our household goods and vehicle were both also delivered early. Even the cable guy showed up early to set up our Internet access!

I still don’t have my home office set up yet - another week for that - but my blood pressure monitor has been much happier with me this week.

On Fedora, CentOS, and that conference talk

My team had a presentation accepted for SCaLE 18x in early March, but unfortunately had to call it off - we were already on site in Pasadena, but between our company’s updated travel guidance, and not wanting to be sick (or worse, infecting others) just before an expensive house move, I figured caution is warranted. The talk was going to be an update on our Flock 2019 talk ( video here), so since the material has already been approved for public use, here goes.

The upgrade firehose problem

Note: This is of increasing relevance as everyone is working from home, and problems that can be easily fixed on site – worse come to worst, you can just reimage a machine to a known good state – can leave an employee without access to the tools they need to perform their job.

I’ve quipped at work that “move fast, break things” (which to be fair we abandon as a slogan years ago) should really be “move slow, fix things” - not my phrasing, I’m borrowing it from someone on Mastodon but will need to dig up my old toots to see who I got it from.

Sadly some Linux components are just updated too fast and does not get enough quality assurance. Not throwing anyone under the bus here; as long as hardware manufacturers don’t consider Linux as a primary platform, which is the case on desktops and laptops, there’ll never be enough eyes to catch regressions. But it does seem to get worse recently – in short order we get hit by:

  • trackpad issues, affecting some but not all new ThinkPad T490s (because of course within the same SKU you might still get different components)
  • Intel video lock-ups in kernel 5.4
  • ThinkPads with Nvidia GPUs locking up if you connect an external monitor, again on kernel 5.4
  • built-in microphone not recognized if you have an updated ALSA library but the latest stable Pulseaudio
  • after a pre-release Pulseaudio got pushed to Fedora, it fixes that issue but breaks Bluetooth audio for those on some hardware

Ideally we either have at least one Fedora tester for each hardware type in our fleet, but that’s not going to be practical in the short term. So what to do instead?

Move slow

Rather than consuming the latest upstream kernel within roughly a month of it coming out (when Fedora releases its build), why not use the CentOS kernel? It’s stable (only critical fixes are backported), and since CentOS 8 is relatively new it happens to be the newest kernel officially supported by Nvidia anyway.

For Chef users, we open sourced cpe_kernel_channel, our cookbook for opting to use the CentOS kernel instead of the regular Fedora kernel.

The next obvious step is to run CentOS itself rather than Fedora. Happily CentOS 8 runs well enough even on most recent ThinkPad laptops (let’s forget about that Yoga with a suspend issue). The one notable exception is Bluetooth audio support - bouncing bluetooth and pulseaudio repeatedly to get A2DP working is nobody’s idea of fun. We might need to ship backported Fedora components to address this (ironic, yes). If you see recent commits to our IT-CPE repo adding CentOS support, that’s why.

Fix things

Apart from moving slower on updates to reduce the chance of regressions being rolled out, the flip side of the coin is to be able to quickly revert from such breakages.

Switching to Btrfs would help here - being able to snapshot before every Chef run, and rolling back in case a bad change is deployed, would be a huge time saver. There’s some work to do in kickstarting a system with both LUKS and Btrfs, but that’s not an intractable problem; more interesting would be getting Btrfs readded to CentOS.

What’s next?

One resolution I failed to keep is to learn a new programming language - being stuck using Python and Ruby at work starts to feel constraining after a while, and somehow Go never really appealed to me personally. Hoping to work on my Rust personal project in the next few weeks!

Avatar
Michel Alexandre Salim
Production Engineer

Michel Alexandre Salim is a Production Engineer at Facebook. He likes automating things.

Related