Alain, our system administrator, is back on the blog today following up on his first article. Today we’re sharing more secrets about RadioKing’s infrastructure!
Hello everybody, it’s been a while since I’ve written on the blog, so here I am! Today, I’d like to talk to you about RadioKing’s infrastructure. Before we dive in, here’s a short glossary that may help you better understand the rest of the article:
- Cluster: refers to a group (or a “cluster”) (for example, we will talk about a radio cluster or an API cluster)
- Kubernetes: technology that orchestrates containers (for example, containers may contain a radio station)
- Git Repository: a place where the code is stored so that you can make changes to it without losing the old version.
The new radio hosting system
As I explained in Part 1, we have set up a new system to manage and host your radios. This system allows each radio to have its own operating environment. We have therefore isolated your radio stations from each other in order to allow us to better manage them.
Restarts, updates or even testing of new versions can therefore be done independently, without impacting other radio stations. This new system, which is fully automated, is based on Kubernetes. A Git repository is in charge of maintaining all the code allowing the management and deployment of the radios. This allows us to guarantee the validity of the configuration applied to the different clusters.
This system, called “GitOps” (I’m not a big fan of buzzwords), ensures resilience in the sense that, if the cluster were to be completely destroyed (which I hope not) it would be possible to rebuild an identical one quickly.
The new Radio Manager
By now, you’ve probably taken the new Radio Manager in hand (and we’re super proud of that!) This new manager is better thought out and, above all, better equipped.
We have re-examined all the APIs that serve the functionalities offered by this New Manager. Following the logic of stateless (=destructible) micro-services (here the APIs), it is also a Kubernetes cluster that hosts all these APIs. They are separated from the radios cluster to ensure a better resilience.
Once again, management, updates and troubleshooting are much easier and more reliable. By following the principle “Cattle vs Pets” (replaceable = cattle, indispensable = pet), every component of this infrastructure is replaceable without damage.
The new file storage system
You may have experienced an outage affecting our broadcasting service during the month of June 2020. We have worked hard to provide you with more reliable and distributed file storage. The new technology used is state of the art, and is in the production phase. We have implemented a Ceph cluster, hosted on powerful machines with a large storage volume.
All servers benefited from the latest updates, making them more secure and reliable. Each component of this infrastructure has been tested as part of the Disaster Recovery Plan. This means that we have simulated a failure of all or part of the system and analysed the behaviour of the system in response to the malfunction caused. We are thus able to maintain the distributed file system more efficiently, allowing the storage of all the audio tracks that you add to your Media Library.
Finally, knowing that two precautions are better than one, this system is also backed up at another host, on another server. Just like our actual system, by the way.
Metrics, metrics and more metrics…
We’re continuing to work on the overview of the entire infrastructure. A large number of metrics, concerning servers, virtual machines and RadioKing containers, are collected (CPU, RAM, network, etc…) in order to be alerted at an early stage so that we can intervene quickly and precisely in case of a malfunction.
These numerous metrics provide increased visibility of the infrastructure, but also an overall performance index. We are therefore able to scale the various servers that make up our infrastructure.
In addition, the collection of server logs (history) is now generalized and centralized. We are currently building dashboards to visualize the log flows. They also give us the possibility to detect abnormal behaviour of the infrastructure.