[00:04:08] <icinga-wm>	 PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:14] <icinga-wm>	 PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:56] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:14] <icinga-wm>	 PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:40] <icinga-wm>	 PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:58:35] <shdubsh>	 Logstash still looks good from here. I estimate the logging queue will burn down in about four hours.
[03:09:56] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 68 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[03:15:50] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 565 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[03:49:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:51:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:07:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:17:20] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:19:12] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:32:46] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 7668 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:33:18] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[05:34:40] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 4.667 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[06:52:24] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[06:54:08] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[07:56:24] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:02:02] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:45:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:47:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:54:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:00:02] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:32:38] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 216 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[14:35:03] <wikibugs>	 (03PS1) 10Tks4Fish: Change of 'rollbacker' group settings at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614575 (https://phabricator.wikimedia.org/T258339)
[15:37:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:41:20] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:45:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613658 (https://phabricator.wikimedia.org/T258100) (owner: 10Tks4Fish)
[17:48:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614575 (https://phabricator.wikimedia.org/T258339) (owner: 10Tks4Fish)
[18:06:42] <Urbanecm>	 !log Run mwscript emptyUserGroup.php --wiki=testwiki contestadmin (T256555)
[18:06:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:06:49] <stashbot>	 T256555: Run emptyUserGroup on testwiki - https://phabricator.wikimedia.org/T256555
[18:07:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612629 (https://phabricator.wikimedia.org/T257925) (owner: 10Chico Venancio)
[18:08:02] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612636 (https://phabricator.wikimedia.org/T257925) (owner: 10Chico Venancio)
[18:11:00] <wikibugs>	 (03PS2) 10Urbanecm: Create archiver group at itwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927)
[18:11:19] <wikibugs>	 (03PS3) 10Urbanecm: Create closer group at itwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927)
[18:14:42] <wikibugs>	 (03PS1) 10Urbanecm: Convert ukwikisource ns:250 and ns:251 to have subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614579 (https://phabricator.wikimedia.org/T255930)
[18:26:16] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton
[18:27:23] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10wkandek) Should dbtree reflect the replication lag on db1124?
[18:28:06] <icinga-wm>	 RECOVERY - proton LVS eqiad on proton.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Proton
[18:34:28] <icinga-wm>	 PROBLEM - Host db1085 is DOWN: PING CRITICAL - Packet loss = 100%
[18:36:58] <wikibugs>	 (03PS1) 10Urbanecm: Add media.farsnews.ir to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614584 (https://phabricator.wikimedia.org/T253800)
[18:43:22] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db1125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1085.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1085.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:44:54] <cdanis>	 sigh, *two* this weekend?
[18:44:56] <marostegui>	 Another host down
[18:44:57] <marostegui>	 right
[18:45:00] <marostegui>	 same batch
[18:45:02] <marostegui>	 same old batch
[18:45:12] <logmsgbot>	 !log cdanis@cumin1001 dbctl commit (dc=all): 'db1085 also crashed', diff saved to https://phabricator.wikimedia.org/P11952 and previous config saved to /var/cache/conftool/dbconfig/20200719-184511-cdanis.json
[18:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:45:24] <cdanis>	 we should preemptively replace the rest, if it is gonna be like this
[18:45:26] <marostegui>	 you were faster than me
[18:45:42] <marostegui>	 those are scheduled to be refreshed yep
[18:45:44] <marostegui>	 next q
[18:45:54] <marostegui>	 I have downtimed it
[18:46:01] <marostegui>	 not sure if it will prevent the page
[18:46:06] <icinga-wm>	 RECOVERY - Host db1085 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[18:46:18] <wikibugs>	 10Operations, 10DBA: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10CDanis)
[18:46:19] <marostegui>	 I am sure it is the same issue
[18:47:00] <cdanis>	 cool, was about to do that
[18:47:01] <cdanis>	 I think it will
[18:47:14] <wikibugs>	 10Operations, 10DBA: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10CDanis) p:05Triage→03High
[18:47:33] <marostegui>	     description=POST Error: 313-HPE Smart Storage Battery 1 Failure - Battery Shutdown Event Code: 0x0400. Action: Restart system. Contact HPE support if condition persists.
[18:47:42] <marostegui>	 HP being HP
[18:47:44] <cdanis>	 lol
[18:47:48] <cdanis>	 batteries are bad
[18:47:59] <marostegui>	 It is bad that they cause reboots
[18:48:03] <marostegui>	 it only happens on HPs...
[18:48:03] <cdanis>	 yeah that's... sigh
[18:48:07] <Reedy>	 "if condition persists"
[18:48:13] <Reedy>	 They make it sound like the battery might magically fix itself
[18:48:24] <marostegui>	 Reedy: sometimes they do flap 
[18:48:28] <cdanis>	 anyway that batch of HP machines / those BBUs are definitely towards the right end of the bathtub curve
[18:48:30] <marostegui>	 but they are obviously broken
[18:48:42] <marostegui>	 cdanis: yeah, we are refreshing those hosts next q
[18:48:46] <cdanis>	 cool :)
[18:48:47] <marostegui>	 but maybe we should accelerate that
[18:50:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:50:15] <wikibugs>	 10Operations, 10DBA: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) BBU issues as expected. This host is also scheduled to be refreshed next Q: ` /system1/log1/record16   Targets   Properties     number=16     severity=Caution     date=07/19/2020     time=18:44     description=POST E...
[18:51:18] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s6 on db1125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1103.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:51:41] <marostegui>	 !log  Upgrade and reboot db1082 T258336
[18:51:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:47] <stashbot>	 T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336
[18:54:35] <wikibugs>	 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui)
[18:57:31] <marostegui>	 !log  Start mysql on db1082 T258336
[18:57:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:36] <stashbot>	 T258336: db1082 crashed - https://phabricator.wikimedia.org/T258336
[18:59:33] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10Marostegui) I have rebooted the host, the BBU looks ok. MySQL started ok as well, did the recovery finely too. Replication is started.
[18:59:34] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s5 on db1124 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:00:04] <icinga-wm>	 RECOVERY - MariaDB read only s5 on db1082 is OK: Version 10.1.44-MariaDB, Uptime 159s, read_only: True, read_only: True, 2186.28 QPS, connection latency: 0.003472s, query latency: 0.000552s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[19:00:04] <volans>	 I'm late to the party marostegui, but if I can help you let me know, I'm around
[19:04:26] <marostegui>	 hey volans thanks - all sorted for now!
[19:04:54] <volans>	 sorry for being late
[19:07:00] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:07:07] <marostegui>	 volans: go back to your sunday
[19:07:33] * volans go back to the sunday's sunset :)
[19:08:19] <volans>	 marostegui: have you rebooted the host with the --do-not-crash-on-weekends option?
[19:09:02] <marostegui>	 Yes, I just compiled the kernel with that option
[19:09:16] <volans>	 jokes apart, have we checked if those crashes are triggered by the battery learning cycle maybe? I think we could force them to not be during weekend with some effort
[19:09:18] <marostegui>	 volans: finishing to reload LiLo and we should be good
[19:09:24] <volans>	 lol
[19:09:29] <marostegui>	 volans: the learning is disabled
[19:10:03] <marostegui>	 volans: Unfortunately those old HP hosts have issues with the BBU, we've seen that 3 years ago with the previous batch of them
[19:10:21] <volans>	 ok
[19:12:35] <wikibugs>	 10Operations, 10DBA: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) And the BBU is gone: ` root@db1085:~#  hpssacli controller all show detail | grep -i Battery    No-Battery Write Cache: Disabled    Battery/Capacitor Count: 0 `
[19:13:57] <wikibugs>	 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) p:05Triage→03Medium
[19:14:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:14:33] <wikibugs>	 (03PS1) 10Zoranzoki21: Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614039 (https://phabricator.wikimedia.org/T258346)
[19:14:35] <wikibugs>	 (03PS1) 10Marostegui: db1085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/614590 (https://phabricator.wikimedia.org/T258360)
[19:15:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1085: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/614590 (https://phabricator.wikimedia.org/T258360) (owner: 10Marostegui)
[19:15:38] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on db1085 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T258362 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[19:15:41] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1085 - https://phabricator.wikimedia.org/T258362 (10ops-monitoring-bot)
[19:16:10] <marostegui>	 !log Upgrade and reboot db1085 T258360
[19:16:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:16] <stashbot>	 T258360: db1085 crashed - https://phabricator.wikimedia.org/T258360
[19:16:43] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1085 - https://phabricator.wikimedia.org/T258362 (10Marostegui) This is a broken BBU being tracked at T258360
[19:16:58] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1085 - https://phabricator.wikimedia.org/T258362 (10Marostegui)
[19:17:00] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui)
[19:20:28] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui)
[19:23:56] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:26:22] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db1125 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:26:46] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: db1085 crashed - https://phabricator.wikimedia.org/T258360 (10Marostegui) Host upgraded and rebooted. MySQL looks ok, replication started
[19:28:46] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s6 on db1125 is OK: OK slave_sql_lag Replication lag: 0.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:34:43] <wikibugs>	 (03PS1) 10Gerrit Patch Uploader: Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346)
[19:34:46] <wikibugs>	 (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader)
[19:35:08] <wikibugs>	 (03Abandoned) 10Zoranzoki21: Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614039 (https://phabricator.wikimedia.org/T258346) (owner: 10Zoranzoki21)
[19:53:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader)
[19:55:02] <wikibugs>	 (03PS2) 10Urbanecm: Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (owner: 10Gerrit Patch Uploader)
[19:55:10] <wikibugs>	 (03PS3) 10Urbanecm: Set $wgCategoryCollation to 'uca-bs' on Bosnian Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader)
[20:16:02] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 03+1] Convert ukwikisource ns:250 and ns:251 to have subpages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614579 (https://phabricator.wikimedia.org/T255930) (owner: 10Urbanecm)
[20:25:02] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 03+1] "If uca-bs is the desired collation (vid. T258346#6317852) then it looks good to me. Thanks." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader)
[20:29:48] <wikibugs>	 (03CR) 10MarcoAurelio: "They've just replied and said they'd prefer uca-bs-u-kn on Task." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614591 (https://phabricator.wikimedia.org/T258346) (owner: 10Gerrit Patch Uploader)
[20:38:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:41:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:44:18] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 03+1] Change of 'rollbacker' group settings at jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614575 (https://phabricator.wikimedia.org/T258339) (owner: 10Tks4Fish)
[20:45:28] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.199, interfaces up: 35, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:45:38] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:47:22] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 208.80.154.199, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:47:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:56:16] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 03+1] "Couple of nitpicks but LGTM." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) (owner: 10Urbanecm)
[21:17:38] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1124 is OK: OK slave_sql_lag Replication lag: 0.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[21:36:06] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 51.78 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[21:55:19] <wikibugs>	 (03PS4) 10Urbanecm: Create closer group at itwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927)
[21:55:35] <wikibugs>	 (03CR) 10Urbanecm: Create closer group at itwikinews (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/612873 (https://phabricator.wikimedia.org/T257927) (owner: 10Urbanecm)
[21:57:58] <wikibugs>	 10Operations, 10DBA: db1082 crashed - https://phabricator.wikimedia.org/T258336 (10wkandek) PST         UTC Lag         Minutes Lag Minutes Lag Decay Rate/minute 12:30 19:30 17:43     12:56 19:56 12:44 0:26 299                 11.50 13:42 20:42 5:38 0:46 426                   9.26 13:52 20:52 4:31 0:10 67...
[22:22:09] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T258364 (10CGlenn)
[22:35:13] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Setup Mailman3 in Cloud VPS - https://phabricator.wikimedia.org/T258365 (10Ladsgroup)
[23:12:32] <icinga-wm>	 PROBLEM - SSH on webperf2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:14:16] <icinga-wm>	 RECOVERY - SSH on webperf2002 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:29:22] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_proton_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:33:08] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:40:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:42:26] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:59:49] <icinga-wm>	 PROBLEM - proton LVS eqiad on proton.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Proton