Title: Explaining modern server monitoring stacks for self-hosting | |
Author: Solène | |
Date: 11 September 2022 | |
Tags: nixos monitoring efficiency nocloud | |
Description: In this article, I'm exploring four different servers | |
monitoring setups involving Grafana, VictoriaMetrics, Prometheus and | |
Collectd. | |
#!/bin/introduction | |
Hello 👋🏻, it's been a long time I didn't have to take a look at | |
monitoring servers. I've set up a Grafana server six years ago, and I | |
was using Munin for my personal servers. | |
However, I recently moved my server to a small virtual machine which | |
has CPU and memory constraints (1 core / 1 GB of memory), and Munin | |
didn't work very well. I was curious to learn if the Grafana stack | |
changed since the last time I used it, and YES. | |
There is that project named Prometheus which is used absolutely | |
everywhere, it was time for me to learn about it. And as I like to go | |
against the flow, I tried various changes to the industry standard | |
stack by using VictoriaMetrics. | |
In this article, I'm using NixOS configuration for the examples, | |
however it should be obvious enough that you can still understand the | |
parts if you don't know anything about NixOS. | |
# The components | |
VictoriaMetrics is a Prometheus drop-in replacement that is a lot more | |
efficient (faster and use less resources), which also provides various | |
API such as Graphite or InfluxDB. It's the component storing data. It | |
comes with various programs like VictoriaMetrics agent to replace | |
various parts of Prometheus. | |
Update: a dear reader shown me VictoriaMetrics can be used to scrape | |
remote agents without the VictoriaMetrics agent, this reduce the memory | |
usage and configuration required. | |
VictoriaMetrics official website | |
VictoriaMetrics documentation "how to scrape prometheus exporters such as node … | |
Prometheus is a time series database, which also provide a collecting | |
agent named Node Exporter. It's also able to pull (scrape) data from | |
remote services offering a Prometheus API. | |
Prometheus official website | |
Node Exporter GitHub page | |
NixOS is an operating system built with the Nix package manager, it has | |
a declarative approach that requires to reconfigure the system when you | |
need to make a change. | |
NixOS official website | |
Collectd is a agent gathering metrics from the system and sending it to | |
a remote compatible database. | |
Collectd official website | |
Grafana is a powerful Web interface pulling data from time series | |
databases to render them under useful charts for analysis. | |
Grafana official website | |
Node exporter full Grafana dashboard | |
# Setup 1: Prometheus server scraping remote node_exporter | |
In this setup, a Prometheus server is running on a server along with | |
Grafana, and connects to remote servers running node_exporter to gather | |
data. | |
Running it on my server, Grafana takes 67 MB, the local node_exporter | |
12.5 MB and Prometheus 63 MB. | |
``` | |
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND | |
grafana 837975 0.1 6.7 1384152 67836 ? Ssl 01:19 1:07 grafana-serv… | |
node-ex+ 953784 0.0 1.2 941292 12512 ? Ssl 16:24 0:01 node_exporter | |
prometh+ 983975 0.3 6.3 1226012 63284 ? Ssl 17:07 0:00 prometheus | |
``` | |
Setup 1 diagram | |
* model: pull, Prometheus is connecting to all servers | |
## Pros | |
* it's the industry standard | |
* can use the "node exporter full" Grafana dashboard | |
## Cons | |
* uses memory | |
* you need to be able to reach all the remote nodes | |
## Server | |
``` | |
{ | |
services.grafana.enable = true; | |
services.prometheus.exporters.node.enable = true; | |
services.prometheus = { | |
enable = true; | |
scrapeConfigs = [ | |
{ | |
job_name = "kikimora"; | |
static_configs = [ | |
{targets = ["10.43.43.2:9100"];} | |
]; | |
} | |
{ | |
job_name = "interbus"; | |
static_configs = [ | |
{targets = ["127.0.0.1:9100"];} | |
]; | |
} | |
]; | |
}; | |
} | |
``` | |
## Client | |
``` | |
{ | |
networking.firewall.allowedTCPPorts = [9100]; | |
services.prometheus.exporters.node.enable = true; | |
} | |
``` | |
# Setup 2: VictoriaMetrics + node-exporter in pull model | |
In this setup, a VictoriaMetrics server is running on a server along | |
with Grafana. A VictoriaMetrics agent is running locally to gather | |
data from remote servers running node_exporter. | |
Running it on my server, Grafana takes 67 MB, the local node_exporter | |
12.5 MB, VictoriaMetrics 30 MB and its agent 13.8 MB. | |
``` | |
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND | |
grafana 837975 0.1 6.7 1384152 67836 ? Ssl 01:19 1:07 grafana-serv… | |
node-ex+ 953784 0.0 1.2 941292 12512 ? Ssl 16:24 0:01 node_exporter | |
victori+ 986126 0.1 3.0 1287016 30052 ? Ssl 18:00 0:03 victoria-met… | |
root 987944 0.0 1.3 1086276 13856 ? Sl 18:30 0:00 vmagent | |
``` | |
Setup 2 diagram | |
* model: pull, VictoriaMetrics agent is connecting to all servers | |
## Pros | |
* can use the "node exporter full" Grafana dashboard | |
* lightweight and more performant than Prometheus | |
## Cons | |
* you need to be able to reach all the remote nodes | |
## Server | |
``` | |
let | |
configure_prom = builtins.toFile "prometheus.yml" '' | |
scrape_configs: | |
- job_name: 'kikimora' | |
stream_parse: true | |
static_configs: | |
- targets: | |
- 10.43.43.1:9100 | |
- job_name: 'interbus' | |
stream_parse: true | |
static_configs: | |
- targets: | |
- 127.0.0.1:9100 | |
''; | |
in { | |
services.victoriametrics.enable = true; | |
services.grafana.enable = true; | |
systemd.services.export-to-prometheus = { | |
path = with pkgs; [victoriametrics]; | |
enable = true; | |
after = ["network-online.target"]; | |
wantedBy = ["multi-user.target"]; | |
script = "vmagent -promscrape.config=${configure_prom} -remoteWrite.url=htt… | |
}; | |
} | |
``` | |
## Client | |
``` | |
{ | |
networking.firewall.allowedTCPPorts = [9100]; | |
services.prometheus.exporters.node.enable = true; | |
} | |
``` | |
# Setup 3: VictoriaMetrics + node-exporter in push model | |
In this setup, a VictoriaMetrics server is running on a server along | |
with Grafana, on each server node_exporter and VictoriaMetrics agent | |
are running to export data to the central VictoriaMetrics server. | |
Running it on my server, Grafana takes 67 MB, the local node_exporter | |
12.5 MB, VictoriaMetrics 30 MB and its agent 13.8 MB, which is exactly | |
the same as the setup 2, except the VictoriaMetrics agent is running on | |
all remote servers. | |
``` | |
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND | |
grafana 837975 0.1 6.7 1384152 67836 ? Ssl 01:19 1:07 grafana-serv… | |
node-ex+ 953784 0.0 1.2 941292 12512 ? Ssl 16:24 0:01 node_exporter | |
victori+ 986126 0.1 3.0 1287016 30052 ? Ssl 18:00 0:03 victoria-met… | |
root 987944 0.0 1.3 1086276 13856 ? Sl 18:30 0:00 vmagent | |
``` | |
Setup 3 diagram | |
* model: push, each agent is connecting to the VictoriaMetrics server | |
## Pros | |
* can use the "node exporter full" Grafana dashboard | |
* memory efficient | |
* can bypass firewalls easily | |
## Cons | |
* you need to be able to reach all the remote nodes | |
* more maintenance as you have one extra agent on each remote | |
* may be bad for security, you need to allow remote servers to write to | |
your VictoriaMetrics server | |
## Server | |
{ | |
networking.firewall.allowedTCPPorts = [8428]; | |
services.victoriametrics.enable = true; | |
services.grafana.enable = true; | |
services.prometheus.exporters.node.enable = true; | |
} | |
``` | |
## Client | |
``` | |
let | |
configure_prom = builtins.toFile "prometheus.yml" '' | |
scrape_configs: | |
- job_name: '${config.networking.hostName}' | |
stream_parse: true | |
static_configs: | |
- targets: | |
- 127.0.0.1:9100 | |
''; | |
in { | |
services.prometheus.exporters.node.enable = true; | |
systemd.services.export-to-prometheus = { | |
path = with pkgs; [victoriametrics]; | |
enable = true; | |
after = ["network-online.target"]; | |
wantedBy = ["multi-user.target"]; | |
script = "vmagent -promscrape.config=${configure_prom} -remoteWrite.url=htt… | |
}; | |
} | |
``` | |
# Setup 4: VictoriaMetrics + Collectd | |
In this setup, a VictoriaMetrics server is running on a server along | |
with Grafana, servers are running Collectd sending data to | |
VictoriaMetrics graphite API. | |
Running it on my server, Grafana takes 67 MB, VictoriaMetrics 30 MB | |
and Collectd 172 kB (yes). | |
``` | |
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND | |
grafana 837975 0.1 6.7 1384152 67836 ? Ssl 01:19 1:07 grafana-serv… | |
victori+ 986126 0.1 3.0 1287016 30052 ? Ssl 18:00 0:03 victoria-met… | |
collectd 844275 0.0 0.0 610432 172 ? Ssl 02:07 0:00 collectd | |
``` | |
Setup 4 diagram | |
* model: push, VictoriaMetrics receives data from the Collectd servers | |
## Pros | |
* super memory efficient | |
* can bypass firewalls easily | |
## Cons | |
* you can't use the "node exporter full" Grafana dashboard | |
* may be bad for security, you need to allow remote servers to write to | |
your VictoriaMetrics server | |
* you need to configure Collectd for each host | |
## Server | |
The server requires VictoriaMetrics to run exposing its graphite API on | |
ports 2003. | |
Note that in Grafana, you will have to escape "-" characters using "\-" | |
in the queries. I also didn't find a way to automatically discover | |
hosts in the data to use variables in the dashboard. | |
UPDATE: Using write_tsdb exporter in collectd, and exposing a TSDB API | |
with VictoriaMetrics, you can set a label to each host, and then use | |
the query "label_values(status)" in Grafana to automatic discover | |
hosts. | |
``` | |
{ | |
networking.firewall.allowedTCPPorts = [2003]; | |
services.victoriametrics = { | |
enable = true; | |
extraOptions = [ | |
"-graphiteListenAddr=:2003" | |
]; | |
}; | |
services.grafana.enable = true; | |
} | |
``` | |
## Client | |
We only need to enable Collectd on the client: | |
``` | |
{ | |
services.collectd = { | |
enable = true; | |
autoLoadPlugin = true; | |
extraConfig = '' | |
Interval 30 | |
''; | |
plugins = { | |
"write_graphite" = '' | |
<Node "${config.networking.hostName}"> | |
Host "victoria-server.fqdn" | |
Port "2003" | |
Protocol "tcp" | |
LogSendErrors true | |
Prefix "collectd_" | |
</Node> | |
''; | |
cpu = '' | |
ReportByCpu false | |
''; | |
memory = ""; | |
df = '' | |
Mountpoint "/" | |
Mountpoint "/nix/store" | |
Mountpoint "/home" | |
ValuesPercentage True | |
ValuesAbsolute False | |
''; | |
load = ""; | |
uptime = ""; | |
swap = '' | |
ReportBytes false | |
ReportIO false | |
ValuesPercentage true | |
''; | |
interface = '' | |
ReportInactive false | |
''; | |
}; | |
}; | |
} | |
``` | |
# Trivia | |
The first section named #!/bin/introduction" is on purpose and not a | |
mistake. It felt super fun when I started writing the article, and | |
wanted to keep it that way. | |
The Collectd setup is the most minimalistic while still powerful, but | |
it requires lot of work to make the dashboards and configure the | |
plugins correctly. | |
The setup I like best is the setup 2. |