Overview

At the time of this writing, the latest version of Prometheus is 2.25, so the following discussion is based on the features of Promethues 2.25.

What problem does Federation solve

Federation is a feature that only partially solves the single point performance bottleneck problem, where Prometheus provides two usage scenarios for federation, which are

Hierarchical Federation

In Prometheus Federation, Hierarchical Federation serves to distribute detailed metrics to different sub-prometheus, each of which manages its own scope of detailed metrics.

Also, for an overall view of the metrics, then an overall prometheus will fetch them from each sub-prometheus.

As a detailed example, if there is a 3-node cluster with many services running in each cluster, then each node runs a prometheus that is responsible for monitoring all metrics for all services on that node. Then, for the metric of the overall view of the cluster, such as host CPU usage, a leader node can be selected in the cluster and run a separate prometheus, which pulls metrics like CPU usage from the prometheus on the 3 nodes, thus spreading the pressure on each prometheus. This spreads the pressure on each prometheus.

Cross-service federation

Another model of Prometheus Federation is the so-called Cross-services federation. In this model, the Prometheus is organized in a service-oriented manner.

For example, in a 3-node cluster, on each node (host), there may be a cluster service (e.g. K8S) and a business service (e.g. Web Server) running, each of which uses a different Prometheus for crawling and storing metrics. However, the business Prometheus may also be concerned about the CPU usage of a particular business application, so the business Prometheus can fetch the application CPU usage metric from the platform Prometheus, thus forming a cross-service federation.

Summary

Prometheus Federation solves the single point performance bottleneck of Promtheus to some extent, but it does not solve other problems, for example, for cross-service federation, if the business Prometheus hangs up, then the business Metrics cannot be captured, and you hang up for as long as there are no Metrics; and for cross-federation For example, for cross-service federation, if you want to see the top 10 CPU usage of all services, you may not be able to do that. It’s still weird.

Prometheus official has also answered the problem of data loss at a single point of failure in some issues, and their proposed solution is to do a primary backup, that is, for the business Prometheus, open multiple instances of Prometheus to collect the same endpoint at the same frequency, but when you query Metirc, only one of them to query it.

Will a primary backup solution work? Maybe, but not in most cases in 2B scenario, because Metrics is real-time data, even if the two Prometheus request at the same time, it is likely to get different values, so if the value collected by Prometheus A can trigger an alarm, and the value collected by Prometheus B cannot trigger an alarm, should I trigger an alarm? Or not? If the alarm is triggered and A hangs, and you view the historical data through B, how do you explain it?

Ref