Building Resilience 🏁

Building resilient data pipelines in F1.

Aug 02, 2024

When we think F1, we think “high stakes” and “fast paced”. To win races, feedback from the drivers who make use of the engineering has to be considered. For common tech products, it could take a couple weeks even months to get clear feedback on the piece of technology and what needs to be changed. In F1, this can take from a couple hours to a couple days. The results of choices made in the design, tech and engineering involved in the car make-up is quickly observed. Hence, things have to change fast. We say MLOps mostly differs from DevOps in that models change very quickly, well ML operations within F1 changes much quicker than most sectors.

We are talking multiple simulations and modelling for each race track and race driver to ensure they have the best car to be competitive. How then can you build for scale and resilience in this fast paced environment?

Technological Systems can break down at any point, sensors could suddenly develop a fault disrupting real-time data streaming, and there could be major outages from tech providers just like the recent crowdstrike incident. There is largely no room for error in the F1 sport as already indicated. Every data point counts and systems need to be up and running therefore situations like this must be thoroughly addressed.

One obvious way to ensure resilience is having enough backup to ensure quick restoration. Whilst this may be costly, it affords a high level of assurance that systems will be running and performant. For example, it is useful to have two Cloud providers/platforms. Load balancing, self healing and restoration with tools like kubernetes and cloud platforms like AWS are already commonly employed. However, what happens if the cloud platform experiences an outage, then real panic sets in. You can have pre-designed replicas of existing infrastructure on multiple cloud platforms, only that one is kept running per time to reduce cost.
I once worked as a technician for a fibre optic company managing fibre cables for a well known mobile network provider. Each time there was a failure, we would have to restore the faulty cable cores within what is called a “mean time to restoration” (MTTR), which was often pre-agreed by the stakeholders involved. These cables were sometimes on bridges with a lot of traffic, in the woods sometimes (forest areas and the likes) and even through small water bodies. I and the team always had to do whatever it took to restore connectivity during the MTTR. One thing however that helped us achieve this was we always had several inches of fibre optic cables as back up for replacement. We also had good rapport with other fibre techies managing other provider cables, so we could often burrow some of their cores or cables temporarily to ensure connectivity as quick as possible.

Another way to ensure resilience is have an optimal default configuration for each track. Questions to create this could include:
What times ideally do the tyres get bad on this track?,
What parts of the track required more power? and so on.
It is not very easy to get this right, especially during the first try, but eventually, a team can create templates that are able to provide a good enough performance of the car even if all their real-time data systems were down. That way the drivers do not panic when unforeseen outages occur, and can remain focused on racing. Events like this could arise at any point within the start of the race and the end. It is the responsibility of the engineers managing these systems to ensure they help their drivers remain confident and have the best configuration possible at any point in time.

MLOPs Substack

Discussion about this post