[00:40:13] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:01:27] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [01:07:43] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:08:05] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:18:49] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:22:07] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:23:57] PROBLEM - k8s API server requests latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:26:03] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [01:30:31] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:32:45] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:35:21] PROBLEM - k8s API server requests latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:35:25] PROBLEM - k8s API server requests latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:38:51] PROBLEM - k8s API server requests latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:39:17] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=get https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:39:31] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:41:09] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:44:11] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:47:17] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [01:48:23] PROBLEM - k8s API server requests latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:55:09] PROBLEM - k8s API server requests latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=PUT https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:55:35] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=get https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:56:31] PROBLEM - k8s API server requests latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:04:45] RECOVERY - k8s API server requests latencies on argon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:05:27] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:06:37] RECOVERY - k8s API server requests latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:06:49] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:08:07] RECOVERY - k8s API server requests latencies on neon is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [02:40:21] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19754608 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:41:59] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 25288 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:30:55] (03PS1) 10DannyS712: Add `autopatrol` to translation administrators on mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 [03:32:35] (03PS2) 10DannyS712: Add `autopatrol` to translation administrators on mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 [03:33:27] (03PS3) 10DannyS712: Add `autopatrol` to translation administrators on mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057 [06:47:13] !log delete old cron entry 'xenon_generate_svgs' (user xenon) on webperf[12]002 to reduce cronspam [06:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:30] Cc: gilles, Krinkle --^ [07:16:00] (03CR) 10Volans: "I did mostly a python-style pass, comments inline." (0314 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188 (owner: 10Jbond) [07:17:50] (03CR) 10Volans: [C: 03+1] "LGTM, minor typos inline" (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540849 (owner: 10Jbond) [07:21:59] (03CR) 10Volans: [C: 03+1] "LGTM, minor nits inline" (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540850 (owner: 10Jbond) [09:54:03] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:15:15] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [10:54:07] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [11:10:21] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [11:54:09] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:05:33] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:08:49] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:20:11] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:23:25] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:31:31] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:39:39] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [12:42:22] elukey: I'm not aware of an old cron on those hosts. More details? [12:43:10] elukey: are these from before the xenon>arc lamp rename perhaps? [15:59:05] +/13 [16:15:43] (03CR) 10Volans: [C: 03+1] "At first sight seems equivalent to the existing one, so lgtm." [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/540851 (owner: 10Jbond) [17:03:01] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 2 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) Thanks @BBlack for the very detailed and precise summary. >>! In T233183#5544784, @BBlack wrote: > == `$ORIGIN` issues and empty `$IN... [17:58:45] PROBLEM - Check systemd state on labsdb1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:23] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [17:59:41] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [18:02:01] RECOVERY - Check systemd state on labsdb1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:51] PROBLEM - Check systemd state on labsdb1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:15] PROBLEM - haproxy failover on dbproxy1019 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [19:00:43] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [19:03:05] RECOVERY - Check systemd state on labsdb1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:25] !log Reload haproxy on dbproxy1010, dbproxy1011, dbproxy1018, dbproxy1019 [19:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:53] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [19:16:29] RECOVERY - haproxy failover on dbproxy1019 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [19:16:59] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [19:17:11] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [19:28:47] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:39:25] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:50:51] (03CR) 10Ori.livneh: "Cool!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540647 (https://phabricator.wikimedia.org/T156095) (owner: 10Krinkle) [20:11:50] !log mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Racconish /home/urbanecm/T234741 (T234741) [20:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:56] T234741: Server side upload for Racconish - https://phabricator.wikimedia.org/T234741 [21:34:03] (03PS1) 10Paladox: gerrit: Remove master_host variable from profile::gerrit::server [puppet] - 10https://gerrit.wikimedia.org/r/541108 [21:35:05] (03PS2) 10Paladox: gerrit: Remove master_host variable from profile::gerrit::server [puppet] - 10https://gerrit.wikimedia.org/r/541108 [21:35:31] (03PS3) 10Paladox: gerrit: Remove master_host variable from profile::gerrit::server [puppet] - 10https://gerrit.wikimedia.org/r/541108 [21:35:36] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/541108 (owner: 10Paladox) [21:38:30] (03CR) 10Paladox: "Puppet compiler: https://puppet-compiler.wmflabs.org/compiler1001/297/" [puppet] - 10https://gerrit.wikimedia.org/r/541108 (owner: 10Paladox) [21:38:53] (03PS1) 10Paladox: Gerrit: Switch master from cobalt to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541110 [21:40:07] (03PS2) 10Paladox: Gerrit: Switch master from cobalt to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541110 [21:40:17] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/541110 (owner: 10Paladox) [21:45:15] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for banwiki - https://phabricator.wikimedia.org/T234770 (10jhsoby) [21:47:10] (03CR) 10Paladox: "Puppet Compiler: https://puppet-compiler.wmflabs.org/compiler1001/298/" [puppet] - 10https://gerrit.wikimedia.org/r/541110 (owner: 10Paladox) [21:52:18] (03PS3) 10Paladox: Gerrit: Switch master from cobalt to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/541110 [21:52:25] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/541110 (owner: 10Paladox) [21:54:28] (03CR) 10Paladox: "Updated Puppet Compiler: https://puppet-compiler.wmflabs.org/compiler1001/299/" [puppet] - 10https://gerrit.wikimedia.org/r/541110 (owner: 10Paladox) [21:56:32] (03PS1) 10Paladox: Switch gerrit.wikimedia.org backend to gerrit1001 [dns] - 10https://gerrit.wikimedia.org/r/541111 [22:01:38] (03PS2) 10Paladox: Switch gerrit.wikimedia.org backend to gerrit1001 [dns] - 10https://gerrit.wikimedia.org/r/541111 [22:03:38] (03PS1) 10Paladox: Remove gerrit-slave from dns [dns] - 10https://gerrit.wikimedia.org/r/541112 [22:04:29] (03PS2) 10Paladox: Remove gerrit-slave from dns [dns] - 10https://gerrit.wikimedia.org/r/541112 [22:08:29] (03CR) 10Paladox: [C: 04-1] "@Dzahn we can do this when we upgrade to 2.15.17!" [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [22:08:45] (03PS3) 10Paladox: Revert "Gerrit: Set base url for commitlink" [puppet] - 10https://gerrit.wikimedia.org/r/532391 [22:11:30] (03PS5) 10Paladox: Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) [22:11:33] (03CR) 10Paladox: Gerrit: Migrate theme to support Polymer 2 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [22:12:27] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [22:12:45] (03PS1) 10Paladox: Gerrit: Disable auto reloading replication config [puppet] - 10https://gerrit.wikimedia.org/r/541115 [22:13:50] (03PS2) 10Paladox: Gerrit: Disable auto reloading replication config [puppet] - 10https://gerrit.wikimedia.org/r/541115 [22:13:56] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/541115 (owner: 10Paladox) [22:16:46] (03PS8) 10Paladox: Gerrit: Get cobalt to replicate to gerrit1001 [puppet] - 10https://gerrit.wikimedia.org/r/540164 [22:17:10] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [22:18:47] (03CR) 10Paladox: "Puppet Compiler: https://puppet-compiler.wmflabs.org/compiler1001/301/" [puppet] - 10https://gerrit.wikimedia.org/r/540164 (owner: 10Paladox) [22:36:53] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [22:43:21] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [22:48:27] (03PS2) 10Mathew.onipe: wdqs: add data-reload cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) [22:49:05] (03CR) 10Mathew.onipe: wdqs: add data-reload cookbook (0310 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/540153 (https://phabricator.wikimedia.org/T230588) (owner: 10Mathew.onipe) [22:59:39] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [23:04:31] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [23:23:51] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [23:24:05] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [23:25:41] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [23:45:05] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers