| Title: Explaining modern server monitoring stacks for self-hosting | |
| Author: Solène | |
| Date: 11 September 2022 | |
| Tags: nixos monitoring efficiency nocloud | |
| Description: In this article, I'm exploring four different servers | |
| monitoring setups involving Grafana, VictoriaMetrics, Prometheus and | |
| Collectd. | |
| #!/bin/introduction | |
| Hello 👋🏻, it's been a long time I didn't have to take a look at | |
| monitoring servers. I've set up a Grafana server six years ago, and I | |
| was using Munin for my personal servers. | |
| However, I recently moved my server to a small virtual machine which | |
| has CPU and memory constraints (1 core / 1 GB of memory), and Munin | |
| didn't work very well. I was curious to learn if the Grafana stack | |
| changed since the last time I used it, and YES. | |
| There is that project named Prometheus which is used absolutely | |
| everywhere, it was time for me to learn about it. And as I like to go | |
| against the flow, I tried various changes to the industry standard | |
| stack by using VictoriaMetrics. | |
| In this article, I'm using NixOS configuration for the examples, | |
| however it should be obvious enough that you can still understand the | |
| parts if you don't know anything about NixOS. | |
| # The components | |
| VictoriaMetrics is a Prometheus drop-in replacement that is a lot more | |
| efficient (faster and use less resources), which also provides various | |
| API such as Graphite or InfluxDB. It's the component storing data. It | |
| comes with various programs like VictoriaMetrics agent to replace | |
| various parts of Prometheus. | |
| Update: a dear reader shown me VictoriaMetrics can be used to scrape | |
| remote agents without the VictoriaMetrics agent, this reduce the memory | |
| usage and configuration required. | |
| VictoriaMetrics official website | |
| VictoriaMetrics documentation "how to scrape prometheus exporters such as node … | |
| Prometheus is a time series database, which also provide a collecting | |
| agent named Node Exporter. It's also able to pull (scrape) data from | |
| remote services offering a Prometheus API. | |
| Prometheus official website | |
| Node Exporter GitHub page | |
| NixOS is an operating system built with the Nix package manager, it has | |
| a declarative approach that requires to reconfigure the system when you | |
| need to make a change. | |
| NixOS official website | |
| Collectd is a agent gathering metrics from the system and sending it to | |
| a remote compatible database. | |
| Collectd official website | |
| Grafana is a powerful Web interface pulling data from time series | |
| databases to render them under useful charts for analysis. | |
| Grafana official website | |
| Node exporter full Grafana dashboard | |
| # Setup 1: Prometheus server scraping remote node_exporter | |
| In this setup, a Prometheus server is running on a server along with | |
| Grafana, and connects to remote servers running node_exporter to gather | |
| data. | |
| Running it on my server, Grafana takes 67 MB, the local node_exporter | |
| 12.5 MB and Prometheus 63 MB. | |
| ``` | |
| USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND | |
| grafana 837975 0.1 6.7 1384152 67836 ? Ssl 01:19 1:07 grafana-serv… | |
| node-ex+ 953784 0.0 1.2 941292 12512 ? Ssl 16:24 0:01 node_exporter | |
| prometh+ 983975 0.3 6.3 1226012 63284 ? Ssl 17:07 0:00 prometheus | |
| ``` | |
| Setup 1 diagram | |
| * model: pull, Prometheus is connecting to all servers | |
| ## Pros | |
| * it's the industry standard | |
| * can use the "node exporter full" Grafana dashboard | |
| ## Cons | |
| * uses memory | |
| * you need to be able to reach all the remote nodes | |
| ## Server | |
| ``` | |
| { | |
| services.grafana.enable = true; | |
| services.prometheus.exporters.node.enable = true; | |
| services.prometheus = { | |
| enable = true; | |
| scrapeConfigs = [ | |
| { | |
| job_name = "kikimora"; | |
| static_configs = [ | |
| {targets = ["10.43.43.2:9100"];} | |
| ]; | |
| } | |
| { | |
| job_name = "interbus"; | |
| static_configs = [ | |
| {targets = ["127.0.0.1:9100"];} | |
| ]; | |
| } | |
| ]; | |
| }; | |
| } | |
| ``` | |
| ## Client | |
| ``` | |
| { | |
| networking.firewall.allowedTCPPorts = [9100]; | |
| services.prometheus.exporters.node.enable = true; | |
| } | |
| ``` | |
| # Setup 2: VictoriaMetrics + node-exporter in pull model | |
| In this setup, a VictoriaMetrics server is running on a server along | |
| with Grafana. A VictoriaMetrics agent is running locally to gather | |
| data from remote servers running node_exporter. | |
| Running it on my server, Grafana takes 67 MB, the local node_exporter | |
| 12.5 MB, VictoriaMetrics 30 MB and its agent 13.8 MB. | |
| ``` | |
| USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND | |
| grafana 837975 0.1 6.7 1384152 67836 ? Ssl 01:19 1:07 grafana-serv… | |
| node-ex+ 953784 0.0 1.2 941292 12512 ? Ssl 16:24 0:01 node_exporter | |
| victori+ 986126 0.1 3.0 1287016 30052 ? Ssl 18:00 0:03 victoria-met… | |
| root 987944 0.0 1.3 1086276 13856 ? Sl 18:30 0:00 vmagent | |
| ``` | |
| Setup 2 diagram | |
| * model: pull, VictoriaMetrics agent is connecting to all servers | |
| ## Pros | |
| * can use the "node exporter full" Grafana dashboard | |
| * lightweight and more performant than Prometheus | |
| ## Cons | |
| * you need to be able to reach all the remote nodes | |
| ## Server | |
| ``` | |
| let | |
| configure_prom = builtins.toFile "prometheus.yml" '' | |
| scrape_configs: | |
| - job_name: 'kikimora' | |
| stream_parse: true | |
| static_configs: | |
| - targets: | |
| - 10.43.43.1:9100 | |
| - job_name: 'interbus' | |
| stream_parse: true | |
| static_configs: | |
| - targets: | |
| - 127.0.0.1:9100 | |
| ''; | |
| in { | |
| services.victoriametrics.enable = true; | |
| services.grafana.enable = true; | |
| systemd.services.export-to-prometheus = { | |
| path = with pkgs; [victoriametrics]; | |
| enable = true; | |
| after = ["network-online.target"]; | |
| wantedBy = ["multi-user.target"]; | |
| script = "vmagent -promscrape.config=${configure_prom} -remoteWrite.url=htt… | |
| }; | |
| } | |
| ``` | |
| ## Client | |
| ``` | |
| { | |
| networking.firewall.allowedTCPPorts = [9100]; | |
| services.prometheus.exporters.node.enable = true; | |
| } | |
| ``` | |
| # Setup 3: VictoriaMetrics + node-exporter in push model | |
| In this setup, a VictoriaMetrics server is running on a server along | |
| with Grafana, on each server node_exporter and VictoriaMetrics agent | |
| are running to export data to the central VictoriaMetrics server. | |
| Running it on my server, Grafana takes 67 MB, the local node_exporter | |
| 12.5 MB, VictoriaMetrics 30 MB and its agent 13.8 MB, which is exactly | |
| the same as the setup 2, except the VictoriaMetrics agent is running on | |
| all remote servers. | |
| ``` | |
| USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND | |
| grafana 837975 0.1 6.7 1384152 67836 ? Ssl 01:19 1:07 grafana-serv… | |
| node-ex+ 953784 0.0 1.2 941292 12512 ? Ssl 16:24 0:01 node_exporter | |
| victori+ 986126 0.1 3.0 1287016 30052 ? Ssl 18:00 0:03 victoria-met… | |
| root 987944 0.0 1.3 1086276 13856 ? Sl 18:30 0:00 vmagent | |
| ``` | |
| Setup 3 diagram | |
| * model: push, each agent is connecting to the VictoriaMetrics server | |
| ## Pros | |
| * can use the "node exporter full" Grafana dashboard | |
| * memory efficient | |
| * can bypass firewalls easily | |
| ## Cons | |
| * you need to be able to reach all the remote nodes | |
| * more maintenance as you have one extra agent on each remote | |
| * may be bad for security, you need to allow remote servers to write to | |
| your VictoriaMetrics server | |
| ## Server | |
| { | |
| networking.firewall.allowedTCPPorts = [8428]; | |
| services.victoriametrics.enable = true; | |
| services.grafana.enable = true; | |
| services.prometheus.exporters.node.enable = true; | |
| } | |
| ``` | |
| ## Client | |
| ``` | |
| let | |
| configure_prom = builtins.toFile "prometheus.yml" '' | |
| scrape_configs: | |
| - job_name: '${config.networking.hostName}' | |
| stream_parse: true | |
| static_configs: | |
| - targets: | |
| - 127.0.0.1:9100 | |
| ''; | |
| in { | |
| services.prometheus.exporters.node.enable = true; | |
| systemd.services.export-to-prometheus = { | |
| path = with pkgs; [victoriametrics]; | |
| enable = true; | |
| after = ["network-online.target"]; | |
| wantedBy = ["multi-user.target"]; | |
| script = "vmagent -promscrape.config=${configure_prom} -remoteWrite.url=htt… | |
| }; | |
| } | |
| ``` | |
| # Setup 4: VictoriaMetrics + Collectd | |
| In this setup, a VictoriaMetrics server is running on a server along | |
| with Grafana, servers are running Collectd sending data to | |
| VictoriaMetrics graphite API. | |
| Running it on my server, Grafana takes 67 MB, VictoriaMetrics 30 MB | |
| and Collectd 172 kB (yes). | |
| ``` | |
| USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND | |
| grafana 837975 0.1 6.7 1384152 67836 ? Ssl 01:19 1:07 grafana-serv… | |
| victori+ 986126 0.1 3.0 1287016 30052 ? Ssl 18:00 0:03 victoria-met… | |
| collectd 844275 0.0 0.0 610432 172 ? Ssl 02:07 0:00 collectd | |
| ``` | |
| Setup 4 diagram | |
| * model: push, VictoriaMetrics receives data from the Collectd servers | |
| ## Pros | |
| * super memory efficient | |
| * can bypass firewalls easily | |
| ## Cons | |
| * you can't use the "node exporter full" Grafana dashboard | |
| * may be bad for security, you need to allow remote servers to write to | |
| your VictoriaMetrics server | |
| * you need to configure Collectd for each host | |
| ## Server | |
| The server requires VictoriaMetrics to run exposing its graphite API on | |
| ports 2003. | |
| Note that in Grafana, you will have to escape "-" characters using "\-" | |
| in the queries. I also didn't find a way to automatically discover | |
| hosts in the data to use variables in the dashboard. | |
| UPDATE: Using write_tsdb exporter in collectd, and exposing a TSDB API | |
| with VictoriaMetrics, you can set a label to each host, and then use | |
| the query "label_values(status)" in Grafana to automatic discover | |
| hosts. | |
| ``` | |
| { | |
| networking.firewall.allowedTCPPorts = [2003]; | |
| services.victoriametrics = { | |
| enable = true; | |
| extraOptions = [ | |
| "-graphiteListenAddr=:2003" | |
| ]; | |
| }; | |
| services.grafana.enable = true; | |
| } | |
| ``` | |
| ## Client | |
| We only need to enable Collectd on the client: | |
| ``` | |
| { | |
| services.collectd = { | |
| enable = true; | |
| autoLoadPlugin = true; | |
| extraConfig = '' | |
| Interval 30 | |
| ''; | |
| plugins = { | |
| "write_graphite" = '' | |
| <Node "${config.networking.hostName}"> | |
| Host "victoria-server.fqdn" | |
| Port "2003" | |
| Protocol "tcp" | |
| LogSendErrors true | |
| Prefix "collectd_" | |
| </Node> | |
| ''; | |
| cpu = '' | |
| ReportByCpu false | |
| ''; | |
| memory = ""; | |
| df = '' | |
| Mountpoint "/" | |
| Mountpoint "/nix/store" | |
| Mountpoint "/home" | |
| ValuesPercentage True | |
| ValuesAbsolute False | |
| ''; | |
| load = ""; | |
| uptime = ""; | |
| swap = '' | |
| ReportBytes false | |
| ReportIO false | |
| ValuesPercentage true | |
| ''; | |
| interface = '' | |
| ReportInactive false | |
| ''; | |
| }; | |
| }; | |
| } | |
| ``` | |
| # Trivia | |
| The first section named #!/bin/introduction" is on purpose and not a | |
| mistake. It felt super fun when I started writing the article, and | |
| wanted to keep it that way. | |
| The Collectd setup is the most minimalistic while still powerful, but | |
| it requires lot of work to make the dashboards and configure the | |
| plugins correctly. | |
| The setup I like best is the setup 2. |