[00:01:37] <icinga-wm>	 RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:04:57] <icinga-wm>	 PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:26:17] <icinga-wm>	 PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100%
[05:26:41] <icinga-wm>	 PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100%
[05:26:41] <icinga-wm>	 PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100%
[05:26:51] <icinga-wm>	 PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100%
[05:26:51] <icinga-wm>	 PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100%
[05:26:55] <icinga-wm>	 PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100%
[05:27:11] <icinga-wm>	 PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100%
[05:27:55] <icinga-wm>	 PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100%
[05:28:01] <icinga-wm>	 PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100%
[05:28:05] <icinga-wm>	 PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100%
[05:28:26] <Bsadowski1>	 ?
[05:28:27] <icinga-wm>	 PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100%
[05:30:01] <icinga-wm>	 RECOVERY - Host elastic2048 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms
[05:30:01] <icinga-wm>	 RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms
[05:30:01] <icinga-wm>	 RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms
[05:30:01] <icinga-wm>	 RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.29 ms
[05:30:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:30:03] <icinga-wm>	 RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms
[05:30:03] <icinga-wm>	 RECOVERY - Host cp2038 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms
[05:30:03] <icinga-wm>	 RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms
[05:30:03] <icinga-wm>	 RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms
[05:30:03] <icinga-wm>	 RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms
[05:30:05] <icinga-wm>	 RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms
[05:30:05] <icinga-wm>	 RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms
[05:30:15] <icinga-wm>	 PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp2037 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/HTTPS
[05:31:39] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:37:51] <icinga-wm>	 PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100%
[05:38:05] <icinga-wm>	 PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100%
[05:38:05] <icinga-wm>	 PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100%
[05:38:05] <icinga-wm>	 PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100%
[05:38:17] <icinga-wm>	 PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100%
[05:38:17] <icinga-wm>	 PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100%
[05:38:21] <icinga-wm>	 PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100%
[05:39:21] <icinga-wm>	 PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100%
[05:39:25] <icinga-wm>	 PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100%
[05:39:33] <icinga-wm>	 PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100%
[05:39:37] <icinga-wm>	 PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100%
[05:41:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:41:23] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[05:42:51] <icinga-wm>	 RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms
[05:42:51] <icinga-wm>	 RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms
[05:42:51] <icinga-wm>	 RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms
[05:42:51] <icinga-wm>	 RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms
[05:42:51] <icinga-wm>	 RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms
[05:42:51] <icinga-wm>	 RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms
[05:42:53] <icinga-wm>	 RECOVERY - Host cp2038 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms
[05:42:53] <icinga-wm>	 RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms
[05:42:55] <icinga-wm>	 RECOVERY - Host elastic2048 is UP: PING OK - Packet loss = 0%, RTA = 31.07 ms
[05:42:55] <icinga-wm>	 RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 31.82 ms
[05:42:57] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:42:57] <icinga-wm>	 RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 31.78 ms
[05:43:05] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[05:52:48] <wikibugs>	 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): Network flap on cloudbackup2002 - https://phabricator.wikimedia.org/T267864 (10Bstorm)
[06:28:57] <icinga-wm>	 PROBLEM - ores on ores2003 is CRITICAL: connect to address 10.192.16.63 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:29:59] <icinga-wm>	 PROBLEM - ores on ores2005 is CRITICAL: connect to address 10.192.32.173 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:40:25] <icinga-wm>	 RECOVERY - ores on ores2003 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:41:29] <icinga-wm>	 RECOVERY - ores on ores2005 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[07:44:42] <elukey>	 ing]
[07:44:42] <elukey>	 14:37 @<fdans> elukey: yea it's an rsync in the stats.pp file within dumps (IIRC)
[07:44:49] <elukey>	 ahhaah
[07:45:14] <elukey>	 no idea why I pasted that sorry, weird combination of keys
[07:46:15] <elukey>	 so rack C7 blipped for a bit in codfw
[07:59:09] <elukey>	 fpc7 seems to be online since 2h ago.. opening a task
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201115T0800)
[08:09:21] <wikibugs>	 10Operations, 10netops: Switch on rack C7 in codfw got rebooted - https://phabricator.wikimedia.org/T267865 (10elukey)
[08:18:03] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[08:23:23] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1098 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[08:24:16] <elukey>	 !log sudo truncate -s 10g /var/lib/hadoop/data/c/yarn/logs/application_1601916545561_173219/container_e25_1601916545561_173219_01_000019/stderr on an-worker1098
[08:24:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:03] <wikibugs>	 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): Network flap on cloudbackup2002 - https://phabricator.wikimedia.org/T267864 (10Peachey88) I suspect this may be {T267865}
[08:27:20] <elukey>	 !log truncate -s 10g /var/lib/hadoop/data/n/yarn/logs/application_1601916545561_173219/container_e25_1601916545561_173219_01_000177/stderr on an-worker1100
[08:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:21] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1100 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[08:29:53] <wikibugs>	 10Operations, 10netops: Switch on rack C7 in codfw got rebooted - https://phabricator.wikimedia.org/T267865 (10elukey)
[08:30:07] <wikibugs>	 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): Network flap on cloudbackup2002 - https://phabricator.wikimedia.org/T267864 (10elukey) >>! In T267864#6622450, @Peachey88 wrote: > I suspect this may be {T267865}  Good point @Peachey88, cloudbackup2002 is indeed in rack C7!   @Bstorm closing this...
[08:30:17] <wikibugs>	 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): Network flap on cloudbackup2002 - https://phabricator.wikimedia.org/T267864 (10elukey)
[08:44:35] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.0 200 OK - 23482 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[09:07:06] <Urbanecm>	 !log Change email for SUL user Botopol via resetUserEmail.php (T267866)
[09:07:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:13] <stashbot>	 T267866: User:Botopol has forgotten their password, need a reset via CLI - https://phabricator.wikimedia.org/T267866
[09:27:23] <icinga-wm>	 PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:23] <icinga-wm>	 PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:35] <icinga-wm>	 PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:45] <icinga-wm>	 PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:45] <icinga-wm>	 PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100%
[09:27:51] <icinga-wm>	 PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:03] <icinga-wm>	 PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:15] <icinga-wm>	 PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:35] <icinga-wm>	 PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100%
[09:28:59] <icinga-wm>	 PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100%
[09:29:03] <icinga-wm>	 PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100%
[09:29:40] <elukey>	 ouch this is c7 again
[09:31:23] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:31:33] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[09:33:34] <elukey>	 ok this time the router is down down
[09:34:05] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4022 is CRITICAL: 3.677e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022
[09:34:21] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2028 is CRITICAL: 4.263e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028
[09:35:13] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4031 is CRITICAL: 4.209e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031
[09:35:13] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4025 is CRITICAL: 4.328e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025
[09:35:15] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: 4.643e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028
[09:35:33] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2042 is CRITICAL: 4.625e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2042
[09:35:39] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2036 is CRITICAL: 4.9e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036
[09:35:51] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2039 is CRITICAL: 5.097e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039
[09:36:58] <elukey>	 this is weird, is it due to kafka-main2003 down?
[09:38:55] <icinga-wm>	 PROBLEM - configured eth on lvs2007 is CRITICAL: ens3f0np0 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[09:39:44] <elukey>	 yep purged[30904]: %4|1605433157.790|REQTMOUT|purged#consumer-1| [thrd:ssl://kafka-main2003.codfw.wmnet:909
[09:39:52] <elukey>	 vgutierrez, ema around?
[09:40:08] <elukey>	 I think it is a matter of restarting purged
[09:40:36] <vgutierrez>	 I can be online in 30m
[09:40:57] <elukey>	 vgutierrez: ok if I restart purged on a node?
[09:41:07] <vgutierrez>	 Go ahead please
[09:41:10] <elukey>	 super :)
[09:42:01] <elukey>	 !log restart purged on cp2028 (kafka-main2003 is down and there are connect timeouts errors)
[09:42:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:26] <elukey>	 takes ages to restart
[09:43:33] <elukey>	 ok seems to work, waiting a bit and then doing it on other nodes
[09:43:37] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 412 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001
[09:43:37] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 305 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002
[09:43:43] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[09:44:02] <elukey>	 ah really? We hit an ES master?
[09:44:04] <elukey>	 ahahaha
[09:44:17] <elukey>	 gehel, dcausse - around by any chance?
[09:44:23] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp2028 is OK: (C)5000 gt (W)3000 gt 184.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028
[09:44:29] <elukey>	 this is good --^
[09:47:17] <wikibugs>	 10Operations, 10netops: Switch on rack C7 in codfw got rebooted - https://phabricator.wikimedia.org/T267865 (10elukey) Went down again, but this time no recovery..
[09:47:32] <elukey>	 XioNoX: around? 
[09:47:40] <gehel>	 elukey: I am
[09:47:46] <elukey>	 gehel: hello! 
[09:47:56] <elukey>	 So rack C7's switch decided to take holidays
[09:48:15] <elukey>	 in codfw
[09:48:41] <elukey>	 and it brough with it elastic2048 and 2049/2059
[09:49:06] <gehel>	 Booting the PC
[09:49:39] <elukey>	 ack thanks :)
[09:49:49] <gehel>	 Loosing a master should not be an issue short term. And we should have more than enough capacity in codfw
[09:49:58] <elukey>	 perfect
[09:50:06] <gehel>	 For elastic at least
[09:50:54] <elukey>	 !log restart purged on cp4022 (consumer stuck due to kafka-main2003 down)
[09:50:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:24] <gehel>	 response times look OK on codfw search: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&from=now-6h&to=now
[09:52:29] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp4022 is OK: (C)5000 gt (W)3000 gt 1304 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022
[09:53:12] <elukey>	 !log restart purged on cp4031 (consumer stuck due to kafka-main2003 down)
[09:53:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:38] <gehel>	 not looking too bad on the elasticsearch side, the cluster is recovering the lost shards
[09:54:04] <gehel>	 it should be back to green in a few, but even now it should be serving requests just fine
[09:54:25] * gehel goes back to taking care of the kids, scream if you need me
[09:55:02] <elukey>	 thanks gehel !
[09:55:44] <elukey>	 !log restart purged on cp4025 (consumer stuck due to kafka-main2003 down)
[09:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:51] <elukey>	 !log restart purged on cp4028 (consumer stuck due to kafka-main2003 down)
[09:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:39] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp4025 is OK: (C)5000 gt (W)3000 gt 201 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025
[09:58:39] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp4031 is OK: (C)5000 gt (W)3000 gt 153.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031
[10:00:47] <elukey>	 !log cumin 'cp2042* or cp2036* or cp2039*' 'systemctl restart purged' -b 1
[10:00:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:01:15] <elukey>	 all right cp nodes should be taken care, and we have also ES recovering
[10:02:25] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp2036 is OK: (C)5000 gt (W)3000 gt 621.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036
[10:02:39] <elukey>	 lvs2007 looks "staged" in netbox, and it is in row A
[10:03:45] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp4028 is OK: (C)5000 gt (W)3000 gt 287.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028
[10:04:19] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp2039 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039
[10:07:23] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp2042 is OK: (C)5000 gt (W)3000 gt 329.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2042
[10:11:46] <elukey>	 ok so for lvs2007, I guess it is the link from the host to row C down? (possibly ending up on rack c7's switch)
[10:12:18] <elukey>	 elukey@asw-c-codfw> show interfaces descriptions | match lvs2007
[10:12:18] <elukey>	 xe-7/0/45                  lvs2007:enp175s0f0
[10:12:19] <elukey>	 yep
[10:12:28] <elukey>	 okok makes sense
[10:13:11] <elukey>	 nothing is really exploding so I'd wait for vgutierrez to double check, and then possibly alert netops via sms (not a call)
[10:15:07] <wikibugs>	 10Operations, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10elukey)
[10:15:25] <wikibugs>	 10Operations, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10elukey) p:05Triage→03High
[10:18:29] <wikibugs>	 10Operations, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10elukey) Current impact:  * purged on some cp2/cp4 nodes got stuck while connecting to kafka-main2003, a manual restart was needed. * the kafka-main cluster is currently in reduced capacity (2 nodes instead...
[10:20:11] <elukey>	 brb
[10:29:00] <XioNoX>	 elukey: yo
[10:30:00] <elukey>	 XioNoX: hey! 
[10:30:17] <XioNoX>	 elukey: did everything fail cleanly?
[10:30:57] <XioNoX>	 is it in a reboot loop or fully down?
[10:30:57] <elukey>	 XioNoX: afaics I think so, but if you could double check it would be great :)
[10:31:14] <elukey>	 XioNoX: rebooted a couple of times first, then it recovered and then it went down
[10:31:25] <XioNoX>	 fun times
[10:32:44] <wikibugs>	 10Operations, 10Traffic: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867 (10elukey)
[10:33:10] <XioNoX>	 console is not responsive
[10:33:47] <wikibugs>	 10Operations, 10Traffic: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867 (10elukey)
[10:34:56] <elukey>	 created the task for purged --^
[10:35:48] <XioNoX>	 opening a Juniper ticket
[10:41:06] <wikibugs>	 10Operations, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) Down around Nov 15 09:28:34 UTC.  Console is unresponsive.  Opening JTAC case for RMA.
[10:45:04] <elukey>	 XioNoX: the only thing that I have never seen is one lvs interface down (in this case, lvs2007(
[10:45:39] <XioNoX>	 elukey: are health checks going through that interface?
[10:45:41] <elukey>	 I also haven't checked if 2007 is the active of the backup
[10:46:13] <XioNoX>	 looks active https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?viewPanel=7&orgId=1
[10:46:14] <elukey>	 I have no idea, I've only checked that it is the interface that connects to row-c 
[10:47:11] <elukey>	 ah wow it is also a spine
[10:47:17] <elukey>	 lovely
[10:47:52] <elukey>	 anyway, the only thing that I'd check now is lvs, the rest looks fine
[10:47:53] <XioNoX>	 we're at codfw nadir, so that's nice
[10:48:17] <elukey>	 TIL nadir
[10:48:18] <XioNoX>	 elukey: can we see the state of the LVS2007 healthchecks somewhere?
[10:48:45] <XioNoX>	 if everything is fine they should be failing for row D appservers
[10:48:50] <XioNoX>	 er, row C
[10:50:17] <wikibugs>	 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) We also have spares QFX5100, so on monday we can swap the dead one.
[10:50:45] <elukey>	 XioNoX: there is /var/log/pybal.log on lvs2007
[10:51:13] <vgutierrez>	 back online.. getting to the laptop after 100k on the bicycle is kinda painful
[10:51:48] <elukey>	 vgutierrez: hahaha I am so sorry, every time I ping you in the least preferrable moment :(
[10:51:52] <XioNoX>	 elukey: https://grafana.wikimedia.org/d/000000421/pybal?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-server=lvs2007&var-service=All&from=now-6h&to=now
[10:52:02] <XioNoX>	 so looks like pooled servers went down
[10:52:12] <vgutierrez>	 elukey: don't worry, that can be handled with some beer on the next IRL offiste
[10:52:34] <elukey>	 XioNoX: yes so cp2035.codfw.wmnet for example is in C2 and pyball shows failures
[10:52:36] <XioNoX>	 we can see the two reboots and the final down
[10:52:46] <XioNoX>	 so no need to touch anything
[10:53:22] <elukey>	 XioNoX: mmm but in theory lvs2007 is row-c blind no? (trying to understand)
[10:53:58] <XioNoX>	 elukey: yeah
[10:54:09] <XioNoX>	 so it loads balance traffic between the other rows
[10:54:15] <XioNoX>	 that's my guess at least
[10:54:26] <XioNoX>	 I'll let traffic to decide if we should depool it
[10:54:40] <elukey>	 ok so either we leave as it is, or we failover to the standby
[10:55:13] <vgutierrez>	 not even pybal... lvs2007 can't even ping cp2035 or 2037
[10:55:13] <XioNoX>	 yep
[10:55:23] <elukey>	 and we have a fresh vgutierrez (with legs destroyed after 100k) that will decide! :D
[10:55:32] <XioNoX>	 vgutierrez: yeah that's expected
[10:56:22] <elukey>	 (be back in a few)
[10:56:25] <XioNoX>	 vgutierrez: yeah that's expected, one of its interface has row C IP config, so it tries to route through there
[10:56:39] <XioNoX>	 I'll be not too far from my laptop all day btw
[10:56:47] <vgutierrez>	 yup, I'll be around as well
[10:57:14] <elukey>	 ok so consensus to leave things as they are, and take action only if needed?
[10:57:41] <elukey>	 if so let's write it in the task and the go back to our sundays :)
[10:58:04] <elukey>	 (I have to step away for a bit but I'll do it after if nobody does it now)
[10:59:12] <vgutierrez>	 +1
[11:04:56] <wikibugs>	 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) Netbox device and list of connected servers: https://netbox.wikimedia.org/dcim/devices/1892/
[11:12:35] <vgutierrez>	 !log depooling lvs2007, lvs2010 taking over text traffic on codfw - T267865
[11:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:43] <stashbot>	 T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865
[11:14:52] <vgutierrez>	 lvs2010 taking over as expected: https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=7&from=now-30m&to=now&refresh=5s
[11:16:03] <wikibugs>	 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10Vgutierrez) switching over to lvs2010 as it will allow us to recover cp2035, only losing cp2037 on text and cp2038 on upload VS losing cp2035 and cp2037 on text with lvs2007
[11:16:27] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2007 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal
[11:16:41] <vgutierrez>	 yey that's expected
[11:16:44] <vgutierrez>	 let me downtime those :)
[11:17:23] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:17:37] <icinga-wm>	 PROBLEM - pybal on lvs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[11:17:43] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[11:17:53] <icinga-wm>	 PROBLEM - swift codfw container availability low on alert1001 is CRITICAL: cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[11:21:02] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime
[11:21:03] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:21:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:03] <elukey>	 thanks all!
[11:31:30] <elukey>	 (just acked some alerts in icinga as well)
[11:32:09] <vgutierrez>	 thx <3
[11:35:11] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100%
[11:38:23] <icinga-wm>	 PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[11:42:51] <icinga-wm>	 PROBLEM - swift codfw object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[11:44:05] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 14.47 ms
[11:46:43] <icinga-wm>	 RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms
[13:16:09] <icinga-wm>	 RECOVERY - Host cp2037 is UP: PING WARNING - Packet loss = 77%, RTA = 30.16 ms
[13:16:09] <icinga-wm>	 RECOVERY - Host cp2038 is UP: PING WARNING - Packet loss = 66%, RTA = 30.44 ms
[13:16:09] <icinga-wm>	 RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms
[13:16:09] <icinga-wm>	 RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 30.68 ms
[13:16:09] <icinga-wm>	 RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms
[13:16:11] <icinga-wm>	 RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms
[13:16:15] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[13:16:17] <icinga-wm>	 RECOVERY - Host elastic2048 is UP: PING WARNING - Packet loss = 33%, RTA = 30.96 ms
[13:16:19] <icinga-wm>	 RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 32.57 ms
[13:16:19] <icinga-wm>	 RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 32.58 ms
[13:16:25] <icinga-wm>	 RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms
[13:16:25] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[13:16:37] <icinga-wm>	 RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 33.32 ms
[13:17:27] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002
[13:17:45] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:18:07] <icinga-wm>	 RECOVERY - swift codfw object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[13:18:39] <icinga-wm>	 PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100%
[13:18:41] <icinga-wm>	 PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100%
[13:18:57] <icinga-wm>	 PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100%
[13:19:03] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001
[13:19:19] <icinga-wm>	 PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100%
[13:19:29] <icinga-wm>	 PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:19:35] <icinga-wm>	 PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100%
[13:19:39] <icinga-wm>	 PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100%
[13:20:29] <icinga-wm>	 PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100%
[13:20:29] <icinga-wm>	 PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100%
[13:20:29] <icinga-wm>	 PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:20:53] <icinga-wm>	 PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100%
[13:21:33] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:22:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:22:59] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[13:23:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:31:48] <elukey>	 ufff
[13:33:11] <icinga-wm>	 PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[13:34:09] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 305 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002
[13:34:41] <icinga-wm>	 PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:45] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 412 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001
[14:00:08] <cdanis>	 !log powercycling ms-be1022 via mgmt
[14:00:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:47] <icinga-wm>	 RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms
[14:07:20] <wikibugs>	 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10CDanis)
[14:07:46] <cdanis>	 elukey: anything else needed right now wrt: codfw c7?
[14:11:32] <elukey>	 cdanis: goood morning! I think nothing, but if you have time please double check the task to make sure that I got all.. There is a swift/thanos alert that should be ok as well 
[14:11:45] <elukey>	 the weird thing is that it keeps flapping
[14:11:52] <cdanis>	 yeah, strange
[14:11:53] <elukey>	 (the c7 switch)
[14:25:45] <icinga-wm>	 PROBLEM - HP RAID on ms-be1022 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[14:25:47] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be1022 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T267872 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform
[14:26:13] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T267872 (10ops-monitoring-bot)
[15:01:57] <icinga-wm>	 PROBLEM - swift codfw container availability low on alert1001 is CRITICAL: cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[15:21:43] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:25:05] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:40:32] <wikibugs>	 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) >>! In T257066#6598861, @Beeswaxcandle wrote: >>>! In T257066#6576540, @FordPrefect42 wrote: >...
[15:42:05] <icinga-wm>	 PROBLEM - swift codfw object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[17:09:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:11:09] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:38:28] <wikibugs>	 (03CR) 10Kaartic: "Just pointing out one change to the Hindi translation that I notice." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup)
[19:25:18] <wikibugs>	 (03PS9) 10Ladsgroup: varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656)
[19:26:09] <wikibugs>	 (03CR) 10Ladsgroup: varnish: Improve wording of the browser security error a bit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup)
[19:26:31] <wikibugs>	 (03CR) 10Ladsgroup: "This is good to go now. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup)
[20:41:21] <icinga-wm>	 RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms
[20:41:21] <icinga-wm>	 RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms
[20:41:21] <icinga-wm>	 RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.15 ms
[20:41:21] <icinga-wm>	 RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms
[20:41:21] <icinga-wm>	 RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms
[20:41:21] <icinga-wm>	 RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms
[20:41:21] <icinga-wm>	 RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms
[20:41:22] <icinga-wm>	 RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms
[20:41:22] <icinga-wm>	 RECOVERY - Host cp2038 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms
[20:41:29] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[20:41:43] <icinga-wm>	 RECOVERY - Host elastic2048 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms
[20:41:47] <icinga-wm>	 RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms
[20:41:53] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:41:53] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[20:42:59] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002
[20:43:39] <icinga-wm>	 RECOVERY - swift codfw object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[20:44:37] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001
[20:47:03] <icinga-wm>	 RECOVERY - swift codfw container availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[20:50:25] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp2037 is CRITICAL: cluster=cache_text instance=cp2037 job=purged layer=frontend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037
[20:53:29] <icinga-wm>	 RECOVERY - configured eth on lvs2007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[20:55:25] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp2037 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037
[20:55:55] <icinga-wm>	 PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100%
[20:55:59] <icinga-wm>	 PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100%
[20:55:59] <icinga-wm>	 PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100%
[20:56:07] <icinga-wm>	 PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:56:07] <icinga-wm>	 PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100%
[20:56:23] <icinga-wm>	 PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100%
[20:56:23] <icinga-wm>	 PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100%
[20:56:25] <icinga-wm>	 PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100%
[20:56:55] <icinga-wm>	 PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100%
[20:57:33] <icinga-wm>	 PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100%
[20:57:39] <icinga-wm>	 PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100%
[20:58:39] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[21:00:19] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:02:09] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4022 is CRITICAL: 3.33e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022
[21:03:09] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4025 is CRITICAL: 3.981e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025
[21:03:15] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4031 is CRITICAL: 4.47e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031
[21:03:23] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2042 is CRITICAL: 4.281e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2042
[21:03:25] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: 4.297e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028
[21:03:31] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2036 is CRITICAL: 4.531e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036
[21:03:45] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2039 is CRITICAL: 4.753e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039
[21:03:49] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2028 is CRITICAL: 4.515e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028
[21:05:03] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2030 is CRITICAL: 5.34e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2030
[21:13:03] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 412 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001
[21:13:05] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 305 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002
[22:03:38] <cdanis>	 !log T267867 T267865 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕔🍺 sudo cumin -b2 -s10 'A:cp and A:codfw' 'systemctl restart purged'
[22:03:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:47] <stashbot>	 T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865
[22:03:47] <stashbot>	 T267867: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867
[22:05:12] <cdanis>	 argh... the switch failure affected purgeds in ulsfo too??
[22:05:51] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp2028 is OK: (C)5000 gt (W)3000 gt 766.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028
[22:07:07] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp2030 is OK: (C)5000 gt (W)3000 gt 189.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2030
[22:08:55] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp2036 is OK: (C)5000 gt (W)3000 gt 264.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036
[22:10:27] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp2042 is OK: (C)5000 gt (W)3000 gt 1048 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2042
[22:10:34] <cdanis>	 !log restart some purgeds in ulsfo as well T267865 T267867
[22:10:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:42] <stashbot>	 T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865
[22:10:42] <stashbot>	 T267867: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867
[22:10:49] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp2039 is OK: (C)5000 gt (W)3000 gt 2714 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039
[22:11:55] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp4025 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025
[22:12:37] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp4022 is OK: (C)5000 gt (W)3000 gt 637.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022
[22:13:41] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp4031 is OK: (C)5000 gt (W)3000 gt 2874 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031
[22:14:00] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp4028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028
[22:20:15] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp4031 is CRITICAL: cluster=cache_text instance=cp4031 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031
[22:20:59] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp4028 is CRITICAL: cluster=cache_text instance=cp4028 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028
[22:22:01] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4031 is CRITICAL: 1.055e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031
[22:22:13] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: 1.131e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028
[22:23:41] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 108 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:25:21] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 14 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:30:27] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp4031 is OK: (C)5000 gt (W)3000 gt 101.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031
[22:30:37] <icinga-wm>	 RECOVERY - Time elapsed since the last kafka event processed by purged on cp4028 is OK: (C)5000 gt (W)3000 gt 151.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028
[22:31:07] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp4028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028
[22:32:01] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp4031 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031
[22:33:39] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:47:45] <icinga-wm>	 PROBLEM - swift codfw container availability low on alert1001 is CRITICAL: cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[23:00:27] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:12:47] <icinga-wm>	 PROBLEM - swift codfw object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[23:31:03] <icinga-wm>	 RECOVERY - Host cp2038 is UP: PING WARNING - Packet loss = 66%, RTA = 30.25 ms
[23:31:05] <icinga-wm>	 RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 31.04 ms
[23:31:05] <icinga-wm>	 RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms
[23:31:07] <icinga-wm>	 RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms
[23:31:07] <icinga-wm>	 RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms
[23:31:07] <icinga-wm>	 RECOVERY - Host elastic2048 is UP: PING OK - Packet loss = 0%, RTA = 30.21 ms
[23:31:07] <icinga-wm>	 RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms
[23:31:07] <icinga-wm>	 RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 31.51 ms
[23:31:09] <icinga-wm>	 RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 30.97 ms
[23:31:19] <icinga-wm>	 RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms
[23:31:19] <icinga-wm>	 RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms
[23:31:47] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:31:47] <icinga-wm>	 RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[23:32:17] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001
[23:32:19] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002
[23:32:59] <icinga-wm>	 RECOVERY - swift codfw object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[23:33:01] <icinga-wm>	 RECOVERY - swift codfw container availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw
[23:35:49] <icinga-wm>	 PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100%
[23:35:49] <icinga-wm>	 PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100%
[23:36:03] <icinga-wm>	 PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100%
[23:36:07] <icinga-wm>	 PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100%
[23:36:09] <icinga-wm>	 PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100%
[23:36:25] <icinga-wm>	 PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100%
[23:37:23] <icinga-wm>	 PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100%
[23:37:25] <icinga-wm>	 PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100%
[23:37:25] <icinga-wm>	 PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100%
[23:37:41] <icinga-wm>	 PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100%
[23:37:41] <icinga-wm>	 PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100%
[23:38:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:38:29] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[23:42:41] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: 3.866e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028
[23:42:59] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2039 is CRITICAL: 4.322e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039
[23:43:01] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2028 is CRITICAL: 4.084e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028
[23:43:07] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4022 is CRITICAL: 4.104e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022
[23:44:05] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4025 is CRITICAL: 4.759e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025
[23:44:13] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp4031 is CRITICAL: 4.642e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031
[23:44:27] <icinga-wm>	 PROBLEM - Time elapsed since the last kafka event processed by purged on cp2036 is CRITICAL: 5.332e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036
[23:52:17] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 412 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001
[23:52:19] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 305 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002
[23:57:19] <wikibugs>	 10Operations, 10Traffic, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248 (10Seb35) FYI I opened [[https://github.com/certbot/certbot/issues/8456|a feature request on certbot]] to propose a delay before deployment as stated here, and will soo...