[00:04:08] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:14] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:56] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:14] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:40] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:35] Logstash still looks good from here. I estimate the logging queue will burn down in about four hours. [03:09:56] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 68 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:15:50] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:49:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:51:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:07:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:17:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:19:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:32:46] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 7668 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:33:18] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [05:34:40] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 4.667 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:52:24] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:54:08] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:56:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:02:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:45:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:47:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:54:22] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:00:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:32:38] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 216 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [14:35:03] (03PS1) 10Tks4Fish: Change of 'rollbacker' group settings at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614575 (https://phabricator.wikimedia.org/T258339) [15:37:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:41:20] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:45:39] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613658 (https://phabricator.wikimedia.org/T258100) (owner: 10Tks4Fish) [17:48:29] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614575 (https://phabricator.wikimedia.org/T258339) (owner: 10Tks4Fish) [18:06:42] !log Run mwscript emptyUserGroup.php --wiki=testwiki contestadmin (T256555) [18:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:49] T256555: Run emptyUserGroup on testwiki - https://phabricator.wikimedia.org/T256555 [18:07:29] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612629 (https://phabricator.wikimedia.org/T257925) (owner: 10Chico Venancio) [18:08:02] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612636 (https://phabricator.wikimedia.org/T257925) (owner: 10Chico Venancio) [18:11:00] (03PS2) 10Urbanecm: Create archiver group at itwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) [18:11:19] (03PS3) 10Urbanecm: Create closer group at itwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) [18:14:42] (03PS1) 10Urbanecm: Convert ukwikisource ns:250 and ns:251 to have subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614579 (https://phabricator.wikimedia.org/T255930) [18:26:16] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton [18:27:23] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10wkandek) Should dbtree reflect the replication lag on db1124? [18:28:06] RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton [18:34:28] PROBLEM - Host db1085 is DOWN: PING CRITICAL - Packet loss = 100% [18:36:58] (03PS1) 10Urbanecm: Add media.farsnews.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614584 (https://phabricator.wikimedia.org/T253800) [18:43:22] PROBLEM - MariaDB Replica IO: s6 on db1125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1085.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1085.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:44:54] sigh, *two* this weekend? [18:44:56] Another host down [18:44:57] right [18:45:00] same batch [18:45:02] same old batch [18:45:12] !log cdanis@cumin1001 dbctl commit (dc=all): 'db1085 also crashed', diff saved to https://phabricator.wikimedia.org/P11952 and previous config saved to /var/cache/conftool/dbconfig/20200719-184511-cdanis.json [18:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:24] we should preemptively replace the rest, if it is gonna be like this [18:45:26] you were faster than me [18:45:42] those are scheduled to be refreshed yep [18:45:44] next q [18:45:54] I have downtimed it [18:46:01] not sure if it will prevent the page [18:46:06] RECOVERY - Host db1085 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [18:46:18] 10Operations, 10DBA: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10CDanis) [18:46:19] I am sure it is the same issue [18:47:00] cool, was about to do that [18:47:01] I think it will [18:47:14] 10Operations, 10DBA: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10CDanis) p:05Triage→03High [18:47:33] description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists. [18:47:42] HP being HP [18:47:44] lol [18:47:48] batteries are bad [18:47:59] It is bad that they cause reboots [18:48:03] it only happens on HPs... [18:48:03] yeah that's... sigh [18:48:07] "if condition persists" [18:48:13] They make it sound like the battery might magically fix itself [18:48:24] Reedy: sometimes they do flap [18:48:28] anyway that batch of HP machines / those BBUs are definitely towards the right end of the bathtub curve [18:48:30] but they are obviously broken [18:48:42] cdanis: yeah, we are refreshing those hosts next q [18:48:46] cool :) [18:48:47] but maybe we should accelerate that [18:50:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:50:15] 10Operations, 10DBA: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) BBU issues as expected. This host is also scheduled to be refreshed next Q: ` /system1/log1/record16 Targets Properties number=16 severity=Caution date=07/19/2020 time=18:44 description=POST E... [18:51:18] PROBLEM - MariaDB Replica Lag: s6 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1103.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:51:41] !log Upgrade and reboot db1082 T258336 [18:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:47] T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336 [18:54:35] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [18:57:31] !log Start mysql on db1082 T258336 [18:57:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:36] T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336 [18:59:33] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) I have rebooted the host, the BBU looks ok. MySQL started ok as well, did the recovery finely too. Replication is started. [18:59:34] RECOVERY - MariaDB Replica IO: s5 on db1124 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:00:04] RECOVERY - MariaDB read only s5 on db1082 is OK: Version 10.1.44-MariaDB, Uptime 159s, read_only: True, read_only: True, 2186.28 QPS, connection latency: 0.003472s, query latency: 0.000552s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [19:00:04] I'm late to the party marostegui, but if I can help you let me know, I'm around [19:04:26] hey volans thanks - all sorted for now! [19:04:54] sorry for being late [19:07:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:07:07] volans: go back to your sunday [19:07:33] * volans go back to the sunday's sunset :) [19:08:19] marostegui: have you rebooted the host with the --do-not-crash-on-weekends option? [19:09:02] Yes, I just compiled the kernel with that option [19:09:16] jokes apart, have we checked if those crashes are triggered by the battery learning cycle maybe? I think we could force them to not be during weekend with some effort [19:09:18] volans: finishing to reload LiLo and we should be good [19:09:24] lol [19:09:29] volans: the learning is disabled [19:10:03] volans: Unfortunately those old HP hosts have issues with the BBU, we've seen that 3 years ago with the previous batch of them [19:10:21] ok [19:12:35] 10Operations, 10DBA: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) And the BBU is gone: ` root@db1085:~# hpssacli controller all show detail | grep -i Battery No-Battery Write Cache: Disabled Battery/Capacitor Count: 0 ` [19:13:57] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) p:05Triage→03Medium [19:14:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:14:33] (03PS1) 10Zoranzoki21: Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614039 (https://phabricator.wikimedia.org/T258346) [19:14:35] (03PS1) 10Marostegui: db1085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/614590 (https://phabricator.wikimedia.org/T258360) [19:15:12] (03CR) 10Marostegui: [C: 03+2] db1085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/614590 (https://phabricator.wikimedia.org/T258360) (owner: 10Marostegui) [19:15:38] ACKNOWLEDGEMENT - HP RAID on db1085 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T258362 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:15:41] 10Operations, 10ops-eqiad: Degraded RAID on db1085 - https://phabricator.wikimedia.org/T258362 (10ops-monitoring-bot) [19:16:10] !log Upgrade and reboot db1085 T258360 [19:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:16] T258360: db1085 crashed - https://phabricator.wikimedia.org/T258360 [19:16:43] 10Operations, 10ops-eqiad: Degraded RAID on db1085 - https://phabricator.wikimedia.org/T258362 (10Marostegui) This is a broken BBU being tracked at T258360 [19:16:58] 10Operations, 10ops-eqiad: Degraded RAID on db1085 - https://phabricator.wikimedia.org/T258362 (10Marostegui) [19:17:00] 10Operations, 10DBA, 10Patch-For-Review: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) [19:20:28] 10Operations, 10DBA, 10Patch-For-Review: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) [19:23:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:26:22] RECOVERY - MariaDB Replica IO: s6 on db1125 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:26:46] 10Operations, 10DBA, 10Patch-For-Review: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) Host upgraded and rebooted. MySQL looks ok, replication started [19:28:46] RECOVERY - MariaDB Replica Lag: s6 on db1125 is OK: OK slave_sql_lag Replication lag: 0.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:34:43] (03PS1) 10Gerrit Patch Uploader: Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) [19:34:46] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader) [19:35:08] (03Abandoned) 10Zoranzoki21: Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614039 (https://phabricator.wikimedia.org/T258346) (owner: 10Zoranzoki21) [19:53:17] (03CR) 10Urbanecm: [C: 03+1] Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader) [19:55:02] (03PS2) 10Urbanecm: Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (owner: 10Gerrit Patch Uploader) [19:55:10] (03PS3) 10Urbanecm: Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader) [20:16:02] (03CR) 10MarcoAurelio: [C: 03+1] Convert ukwikisource ns:250 and ns:251 to have subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614579 (https://phabricator.wikimedia.org/T255930) (owner: 10Urbanecm) [20:25:02] (03CR) 10MarcoAurelio: [C: 03+1] "If uca-bs is the desired collation (vid. T258346#6317852) then it looks good to me. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader) [20:29:48] (03CR) 10MarcoAurelio: "They've just replied and said they'd prefer uca-bs-u-kn on Task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader) [20:38:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:41:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:44:18] (03CR) 10MarcoAurelio: [C: 03+1] Change of 'rollbacker' group settings at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614575 (https://phabricator.wikimedia.org/T258339) (owner: 10Tks4Fish) [20:45:28] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:45:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:47:22] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:47:32] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:56:16] (03CR) 10MarcoAurelio: [C: 03+1] "Couple of nitpicks but LGTM." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) (owner: 10Urbanecm) [21:17:38] RECOVERY - MariaDB Replica Lag: s5 on db1124 is OK: OK slave_sql_lag Replication lag: 0.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [21:36:06] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 51.78 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [21:55:19] (03PS4) 10Urbanecm: Create closer group at itwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) [21:55:35] (03CR) 10Urbanecm: Create closer group at itwikinews (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) (owner: 10Urbanecm) [21:57:58] 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10wkandek) PST UTC Lag Minutes Lag Minutes Lag Decay Rate/minute 12:30 19:30 17:43 12:56 19:56 12:44 0:26 299 11.50 13:42 20:42 5:38 0:46 426 9.26 13:52 20:52 4:31 0:10 67... [22:22:09] 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T258364 (10CGlenn) [22:35:13] 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Setup Mailman3 in Cloud VPS - https://phabricator.wikimedia.org/T258365 (10Ladsgroup) [23:12:32] PROBLEM - SSH on webperf2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:14:16] RECOVERY - SSH on webperf2002 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:29:22] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:33:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:40:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:42:26] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:59:49] PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton