[00:00:13] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to shell (snapshot, dumpsdata) for springle - https://phabricator.wikimedia.org/T191478#4107106 (10Dzahn) @faidon @mark Since Monday is a US holiday so i guess no ops meeting but i will be on duty next week: If you want to approve th... [00:13:26] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4131191 (10Dzahn) [00:20:12] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4131201 (10Dzahn) The package providing `/usr/bin/check_postgres_hot_standby_delay ` is **check-postgres** https://tracker.debian.org/pkg/check-postgres Confirmed this is alre... [00:24:21] (03PS1) 10Bstorm: wiki replicas: Add new MCR tables to views [puppet] - 10https://gerrit.wikimedia.org/r/426325 (https://phabricator.wikimedia.org/T184446) [00:28:15] (03PS1) 10Dzahn: netbox: add postgreSQL slave monitoring [puppet] - 10https://gerrit.wikimedia.org/r/426326 (https://phabricator.wikimedia.org/T185504) [00:31:17] (03PS2) 10Dzahn: netbox: add postgreSQL slave monitoring [puppet] - 10https://gerrit.wikimedia.org/r/426326 (https://phabricator.wikimedia.org/T185504) [00:32:28] (03CR) 10Dzahn: [C: 032] netbox: add postgreSQL slave monitoring [puppet] - 10https://gerrit.wikimedia.org/r/426326 (https://phabricator.wikimedia.org/T185504) (owner: 10Dzahn) [00:52:34] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4131264 (10Dzahn) - puppet added the nrpe config part on netmon2001 (the slave) but not on netmon1002 (good) - puppet added the icinga config part on einsteinium - we got this... [00:53:23] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4131278 (10Dzahn) `ERROR: FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL on FATAL: no pg_hba.conf entry for ho... [01:17:01] PROBLEM - Host labpuppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100% [01:19:11] RECOVERY - Host labpuppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [01:20:44] 10Operations, 10monitoring, 10Patch-For-Review: Netbox: add Icinga check for PostgreSQL - https://phabricator.wikimedia.org/T185504#4131306 (10Dzahn) regarding a health check on the master itself: Which type of check exactly would that be on the master? So far we just have the replication check on slaves an... [01:22:01] PROBLEM - Keyholder SSH agent on labpuppetmaster1002 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [05:40:22] PROBLEM - Apache HTTP on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:41:21] RECOVERY - Apache HTTP on mw2261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.111 second response time [06:29:22] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ssl/localcerts/api.svc.eqiad.wmnet.crt] [06:31:21] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/share/ca-certificates/RapidSSL_SHA256_CA_-_G3.crt] [06:31:52] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt-upgrade-activity] [06:43:51] RECOVERY - Disk space on labtestnet2001 is OK: DISK OK [06:56:12] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:21] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:19:47] 10Operations, 10Ops-Access-Requests, 10Maps-Sprint: sudoer access for pnorman on maps servers - https://phabricator.wikimedia.org/T192115#4131484 (10Gehel) >>! In T192115#4131164, @Dzahn wrote: > How about a new group that lets you run any command as the users postgres and osmupdater (as requested) but not a... [10:04:39] (03CR) 10Aaron Schulz: [C: 031] Make xenon-log line-buffered [puppet] - 10https://gerrit.wikimedia.org/r/426224 (https://phabricator.wikimedia.org/T169249) (owner: 10Gilles) [10:58:52] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))): 63794.91625124624 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:59:51] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1003.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [13:31:38] 10Operations, 10Traffic, 10Wikimedia-Apache-configuration: Remove wildcard vhost for *.wikimedia.org - https://phabricator.wikimedia.org/T192206#4131644 (10EddieGP) [13:33:47] (03CR) 10EddieGP: "See the thourough explanation at T192206. We're talking about 18 subdomains using this wildcard. I'll amend this patch to make all of thos" [puppet] - 10https://gerrit.wikimedia.org/r/424707 (https://phabricator.wikimedia.org/T173887) (owner: 10EddieGP) [15:02:11] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 44.16, 36.81, 32.48 [15:34:21] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.41, 35.49, 33.18 [16:07:22] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 47.56, 35.75, 31.24 [16:23:00] (03PS3) 10EddieGP: mediawiki: Move www.wikimedia.org to wwwportals.conf [puppet] - 10https://gerrit.wikimedia.org/r/424707 (https://phabricator.wikimedia.org/T173887) [17:05:31] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 46.81, 37.52, 32.54 [17:07:44] (03CR) 10EddieGP: "I think this can be abandoned. The comma-count article method has been removed from core, and we now have a maintenance script to update *" [puppet] - 10https://gerrit.wikimedia.org/r/363639 (owner: 10Reedy) [17:13:14] (03CR) 10EddieGP: "This seems to do mostly the same as https://gerrit.wikimedia.org/r/c/422571/" [puppet] - 10https://gerrit.wikimedia.org/r/322425 (owner: 10Alex Monk) [17:22:31] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 39.80, 33.69, 32.57 [17:32:32] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 43.00, 34.25, 32.76 [17:53:41] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 41.21, 36.19, 32.63 [18:21:51] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.121 second response time [18:36:51] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1965 bytes in 0.090 second response time [18:42:41] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 53.72, 37.50, 31.43 [18:55:42] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 37.73, 33.28, 32.45 [19:14:51] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 35.64, 33.72, 32.01 [19:20:51] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 45.94, 35.39, 32.73 [19:31:51] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 41.61, 33.30, 32.36 [19:40:51] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 37.25, 32.69, 32.11 [19:49:51] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 42.11, 34.45, 33.04 [21:38:11] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 21.17, 22.76, 23.95 [23:31:41] PROBLEM - Disk space on labtestnet2001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=75%)