[00:00:13] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4107106 (10Dzahn) @faidon @mark Since Monday is a US holiday so i guess no ops meeting but i will be on duty next week:  If you want to approve th...
[00:13:26] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4131191 (10Dzahn)
[00:20:12] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4131201 (10Dzahn) The package providing `/usr/bin/check_postgres_hot_standby_delay ` is **check-postgres**  https://tracker.debian.org/pkg/check-postgres  Confirmed this is alre...
[00:24:21] <wikibugs>	 (03PS1) 10Bstorm: wiki replicas: Add new MCR tables to views [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446)
[00:28:15] <wikibugs>	 (03PS1) 10Dzahn: netbox: add postgreSQL slave monitoring [puppet] - 10https://gerrit.wikimedia.org/r/426326 (https://phabricator.wikimedia.org/T185504)
[00:31:17] <wikibugs>	 (03PS2) 10Dzahn: netbox: add postgreSQL slave monitoring [puppet] - 10https://gerrit.wikimedia.org/r/426326 (https://phabricator.wikimedia.org/T185504)
[00:32:28] <wikibugs>	 (03CR) 10Dzahn: [C: 032] netbox: add postgreSQL slave monitoring [puppet] - 10https://gerrit.wikimedia.org/r/426326 (https://phabricator.wikimedia.org/T185504) (owner: 10Dzahn)
[00:52:34] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4131264 (10Dzahn) - puppet added the nrpe config part on netmon2001 (the slave) but not on netmon1002 (good) - puppet added the icinga config part on einsteinium  - we got this...
[00:53:23] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4131278 (10Dzahn) `ERROR: FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL on FATAL: no pg_hba.conf entry for ho...
[01:17:01] <icinga-wm>	 PROBLEM - Host labpuppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100%
[01:19:11] <icinga-wm>	 RECOVERY - Host labpuppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms
[01:20:44] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4131306 (10Dzahn) regarding a health check on the master itself:  Which type of check exactly would that be on the master? So far we just have the replication check on slaves an...
[01:22:01] <icinga-wm>	 PROBLEM - Keyholder SSH agent on labpuppetmaster1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it.
[05:40:22] <icinga-wm>	 PROBLEM - Apache HTTP on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:41:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw2261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.111 second response time
[06:29:22] <icinga-wm>	 PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.eqiad.wmnet.crt]
[06:31:21] <icinga-wm>	 PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt]
[06:31:52] <icinga-wm>	 PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt-upgrade-activity]
[06:43:51] <icinga-wm>	 RECOVERY - Disk space on labtestnet2001 is OK: DISK OK
[06:56:12] <icinga-wm>	 RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[06:56:52] <icinga-wm>	 RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:59:21] <icinga-wm>	 RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:19:47] <wikibugs>	 10Operations, 10Ops-Access-Requests, 10Maps-Sprint: sudoer access for pnorman on maps servers - https://phabricator.wikimedia.org/T192115#4131484 (10Gehel) >>! In T192115#4131164, @Dzahn wrote: > How about a new group that lets you run any command as the users postgres and osmupdater (as requested) but not a...
[10:04:39] <wikibugs>	 (03CR) 10Aaron Schulz: [C: 031] Make xenon-log line-buffered [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles)
[10:58:52] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 63794.91625124624 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[10:59:51] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[13:31:38] <wikibugs>	 10Operations, 10Traffic, 10Wikimedia-Apache-configuration: Remove wildcard vhost for *.wikimedia.org - https://phabricator.wikimedia.org/T192206#4131644 (10EddieGP)
[13:33:47] <wikibugs>	 (03CR) 10EddieGP: "See the thourough explanation at T192206. We're talking about 18 subdomains using this wildcard. I'll amend this patch to make all of thos" [puppet] - 10https://gerrit.wikimedia.org/r/424707 (https://phabricator.wikimedia.org/T173887) (owner: 10EddieGP)
[15:02:11] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 44.16, 36.81, 32.48
[15:34:21] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.41, 35.49, 33.18
[16:07:22] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.56, 35.75, 31.24
[16:23:00] <wikibugs>	 (03PS3) 10EddieGP: mediawiki: Move www.wikimedia.org to wwwportals.conf [puppet] - 10https://gerrit.wikimedia.org/r/424707 (https://phabricator.wikimedia.org/T173887)
[17:05:31] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 46.81, 37.52, 32.54
[17:07:44] <wikibugs>	 (03CR) 10EddieGP: "I think this can be abandoned. The comma-count article method has been removed from core, and we now have a maintenance script to update *" [puppet] - 10https://gerrit.wikimedia.org/r/363639 (owner: 10Reedy)
[17:13:14] <wikibugs>	 (03CR) 10EddieGP: "This seems to do mostly the same as https://gerrit.wikimedia.org/r/c/422571/" [puppet] - 10https://gerrit.wikimedia.org/r/322425 (owner: 10Alex Monk)
[17:22:31] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 39.80, 33.69, 32.57
[17:32:32] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 43.00, 34.25, 32.76
[17:53:41] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 41.21, 36.19, 32.63
[18:21:51] <icinga-wm>	 PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.121 second response time
[18:36:51] <icinga-wm>	 RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1965 bytes in 0.090 second response time
[18:42:41] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 53.72, 37.50, 31.43
[18:55:42] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 37.73, 33.28, 32.45
[19:14:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 35.64, 33.72, 32.01
[19:20:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 45.94, 35.39, 32.73
[19:31:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 41.61, 33.30, 32.36
[19:40:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 37.25, 32.69, 32.11
[19:49:51] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 42.11, 34.45, 33.04
[21:38:11] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 21.17, 22.76, 23.95
[23:31:41] <icinga-wm>	 PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=75%)