[00:01:37] RECOVERY - Check systemd state on centrallog1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:57] PROBLEM - Check systemd state on centrallog1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:26:17] PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100% [05:26:41] PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100% [05:26:41] PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100% [05:26:51] PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100% [05:26:51] PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [05:26:55] PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100% [05:27:11] PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100% [05:27:55] PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% [05:28:01] PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100% [05:28:05] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [05:28:26] ? [05:28:27] PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100% [05:30:01] RECOVERY - Host elastic2048 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [05:30:01] RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [05:30:01] RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 30.28 ms [05:30:01] RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.29 ms [05:30:03] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:03] RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms [05:30:03] RECOVERY - Host cp2038 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [05:30:03] RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms [05:30:03] RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [05:30:03] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [05:30:05] RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms [05:30:05] RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms [05:30:15] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp2037 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection timed out https://wikitech.wikimedia.org/wiki/HTTPS [05:31:39] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:37:51] PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100% [05:38:05] PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100% [05:38:05] PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100% [05:38:05] PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100% [05:38:17] PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [05:38:17] PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100% [05:38:21] PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100% [05:39:21] PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% [05:39:25] PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100% [05:39:33] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [05:39:37] PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100% [05:41:17] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:41:23] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [05:42:51] RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [05:42:51] RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [05:42:51] RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 33.40 ms [05:42:51] RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms [05:42:51] RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms [05:42:51] RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms [05:42:53] RECOVERY - Host cp2038 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms [05:42:53] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [05:42:55] RECOVERY - Host elastic2048 is UP: PING OK - Packet loss = 0%, RTA = 31.07 ms [05:42:55] RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 31.82 ms [05:42:57] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:42:57] RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 31.78 ms [05:43:05] RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [05:52:48] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): Network flap on cloudbackup2002 - https://phabricator.wikimedia.org/T267864 (10Bstorm) [06:28:57] PROBLEM - ores on ores2003 is CRITICAL: connect to address 10.192.16.63 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:29:59] PROBLEM - ores on ores2005 is CRITICAL: connect to address 10.192.32.173 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:40:25] RECOVERY - ores on ores2003 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:41:29] RECOVERY - ores on ores2005 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [07:44:42] ing] [07:44:42] 14:37 @ elukey: yea it's an rsync in the stats.pp file within dumps (IIRC) [07:44:49] ahhaah [07:45:14] no idea why I pasted that sorry, weird combination of keys [07:46:15] so rack C7 blipped for a bit in codfw [07:59:09] fpc7 seems to be online since 2h ago.. opening a task [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201115T0800) [08:09:21] 10Operations, 10netops: Switch on rack C7 in codfw got rebooted - https://phabricator.wikimedia.org/T267865 (10elukey) [08:18:03] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:23:23] RECOVERY - Disk space on Hadoop worker on an-worker1098 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:24:16] !log sudo truncate -s 10g /var/lib/hadoop/data/c/yarn/logs/application_1601916545561_173219/container_e25_1601916545561_173219_01_000019/stderr on an-worker1098 [08:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:03] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): Network flap on cloudbackup2002 - https://phabricator.wikimedia.org/T267864 (10Peachey88) I suspect this may be {T267865} [08:27:20] !log truncate -s 10g /var/lib/hadoop/data/n/yarn/logs/application_1601916545561_173219/container_e25_1601916545561_173219_01_000177/stderr on an-worker1100 [08:27:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:21] RECOVERY - Disk space on Hadoop worker on an-worker1100 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [08:29:53] 10Operations, 10netops: Switch on rack C7 in codfw got rebooted - https://phabricator.wikimedia.org/T267865 (10elukey) [08:30:07] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): Network flap on cloudbackup2002 - https://phabricator.wikimedia.org/T267864 (10elukey) >>! In T267864#6622450, @Peachey88 wrote: > I suspect this may be {T267865} Good point @Peachey88, cloudbackup2002 is indeed in rack C7! @Bstorm closing this... [08:30:17] 10Operations, 10ops-codfw, 10cloud-services-team (Hardware): Network flap on cloudbackup2002 - https://phabricator.wikimedia.org/T267864 (10elukey) [08:44:35] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3064 is OK: HTTP OK: HTTP/1.0 200 OK - 23482 bytes in 0.255 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [09:07:06] !log Change email for SUL user Botopol via resetUserEmail.php (T267866) [09:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:13] T267866: User:Botopol has forgotten their password, need a reset via CLI - https://phabricator.wikimedia.org/T267866 [09:27:23] PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:23] PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:35] PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:45] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:45] PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:51] PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:03] PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:15] PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:35] PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:59] PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:03] PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100% [09:29:40] ouch this is c7 again [09:31:23] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:31:33] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [09:33:34] ok this time the router is down down [09:34:05] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4022 is CRITICAL: 3.677e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022 [09:34:21] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2028 is CRITICAL: 4.263e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028 [09:35:13] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4031 is CRITICAL: 4.209e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [09:35:13] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4025 is CRITICAL: 4.328e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025 [09:35:15] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: 4.643e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [09:35:33] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2042 is CRITICAL: 4.625e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2042 [09:35:39] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2036 is CRITICAL: 4.9e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036 [09:35:51] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2039 is CRITICAL: 5.097e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039 [09:36:58] this is weird, is it due to kafka-main2003 down? [09:38:55] PROBLEM - configured eth on lvs2007 is CRITICAL: ens3f0np0 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [09:39:44] yep purged[30904]: %4|1605433157.790|REQTMOUT|purged#consumer-1| [thrd:ssl://kafka-main2003.codfw.wmnet:909 [09:39:52] vgutierrez, ema around? [09:40:08] I think it is a matter of restarting purged [09:40:36] I can be online in 30m [09:40:57] vgutierrez: ok if I restart purged on a node? [09:41:07] Go ahead please [09:41:10] super :) [09:42:01] !log restart purged on cp2028 (kafka-main2003 is down and there are connect timeouts errors) [09:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:26] takes ages to restart [09:43:33] ok seems to work, waiting a bit and then doing it on other nodes [09:43:37] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 412 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [09:43:37] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 305 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [09:43:43] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [09:44:02] ah really? We hit an ES master? [09:44:04] ahahaha [09:44:17] gehel, dcausse - around by any chance? [09:44:23] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2028 is OK: (C)5000 gt (W)3000 gt 184.2 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028 [09:44:29] this is good --^ [09:47:17] 10Operations, 10netops: Switch on rack C7 in codfw got rebooted - https://phabricator.wikimedia.org/T267865 (10elukey) Went down again, but this time no recovery.. [09:47:32] XioNoX: around? [09:47:40] elukey: I am [09:47:46] gehel: hello! [09:47:56] So rack C7's switch decided to take holidays [09:48:15] in codfw [09:48:41] and it brough with it elastic2048 and 2049/2059 [09:49:06] Booting the PC [09:49:39] ack thanks :) [09:49:49] Loosing a master should not be an issue short term. And we should have more than enough capacity in codfw [09:49:58] perfect [09:50:06] For elastic at least [09:50:54] !log restart purged on cp4022 (consumer stuck due to kafka-main2003 down) [09:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:24] response times look OK on codfw search: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&from=now-6h&to=now [09:52:29] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4022 is OK: (C)5000 gt (W)3000 gt 1304 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022 [09:53:12] !log restart purged on cp4031 (consumer stuck due to kafka-main2003 down) [09:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:38] not looking too bad on the elasticsearch side, the cluster is recovering the lost shards [09:54:04] it should be back to green in a few, but even now it should be serving requests just fine [09:54:25] * gehel goes back to taking care of the kids, scream if you need me [09:55:02] thanks gehel ! [09:55:44] !log restart purged on cp4025 (consumer stuck due to kafka-main2003 down) [09:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:51] !log restart purged on cp4028 (consumer stuck due to kafka-main2003 down) [09:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:39] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4025 is OK: (C)5000 gt (W)3000 gt 201 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025 [09:58:39] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4031 is OK: (C)5000 gt (W)3000 gt 153.4 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [10:00:47] !log cumin 'cp2042* or cp2036* or cp2039*' 'systemctl restart purged' -b 1 [10:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:15] all right cp nodes should be taken care, and we have also ES recovering [10:02:25] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2036 is OK: (C)5000 gt (W)3000 gt 621.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036 [10:02:39] lvs2007 looks "staged" in netbox, and it is in row A [10:03:45] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4028 is OK: (C)5000 gt (W)3000 gt 287.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [10:04:19] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2039 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039 [10:07:23] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2042 is OK: (C)5000 gt (W)3000 gt 329.5 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2042 [10:11:46] ok so for lvs2007, I guess it is the link from the host to row C down? (possibly ending up on rack c7's switch) [10:12:18] elukey@asw-c-codfw> show interfaces descriptions | match lvs2007 [10:12:18] xe-7/0/45 lvs2007:enp175s0f0 [10:12:19] yep [10:12:28] okok makes sense [10:13:11] nothing is really exploding so I'd wait for vgutierrez to double check, and then possibly alert netops via sms (not a call) [10:15:07] 10Operations, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10elukey) [10:15:25] 10Operations, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10elukey) p:05Triage→03High [10:18:29] 10Operations, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10elukey) Current impact: * purged on some cp2/cp4 nodes got stuck while connecting to kafka-main2003, a manual restart was needed. * the kafka-main cluster is currently in reduced capacity (2 nodes instead... [10:20:11] brb [10:29:00] elukey: yo [10:30:00] XioNoX: hey! [10:30:17] elukey: did everything fail cleanly? [10:30:57] is it in a reboot loop or fully down? [10:30:57] XioNoX: afaics I think so, but if you could double check it would be great :) [10:31:14] XioNoX: rebooted a couple of times first, then it recovered and then it went down [10:31:25] fun times [10:32:44] 10Operations, 10Traffic: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867 (10elukey) [10:33:10] console is not responsive [10:33:47] 10Operations, 10Traffic: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867 (10elukey) [10:34:56] created the task for purged --^ [10:35:48] opening a Juniper ticket [10:41:06] 10Operations, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) Down around Nov 15 09:28:34 UTC. Console is unresponsive. Opening JTAC case for RMA. [10:45:04] XioNoX: the only thing that I have never seen is one lvs interface down (in this case, lvs2007( [10:45:39] elukey: are health checks going through that interface? [10:45:41] I also haven't checked if 2007 is the active of the backup [10:46:13] looks active https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?viewPanel=7&orgId=1 [10:46:14] I have no idea, I've only checked that it is the interface that connects to row-c [10:47:11] ah wow it is also a spine [10:47:17] lovely [10:47:52] anyway, the only thing that I'd check now is lvs, the rest looks fine [10:47:53] we're at codfw nadir, so that's nice [10:48:17] TIL nadir [10:48:18] elukey: can we see the state of the LVS2007 healthchecks somewhere? [10:48:45] if everything is fine they should be failing for row D appservers [10:48:50] er, row C [10:50:17] 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) We also have spares QFX5100, so on monday we can swap the dead one. [10:50:45] XioNoX: there is /var/log/pybal.log on lvs2007 [10:51:13] back online.. getting to the laptop after 100k on the bicycle is kinda painful [10:51:48] vgutierrez: hahaha I am so sorry, every time I ping you in the least preferrable moment :( [10:51:52] elukey: https://grafana.wikimedia.org/d/000000421/pybal?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-server=lvs2007&var-service=All&from=now-6h&to=now [10:52:02] so looks like pooled servers went down [10:52:12] elukey: don't worry, that can be handled with some beer on the next IRL offiste [10:52:34] XioNoX: yes so cp2035.codfw.wmnet for example is in C2 and pyball shows failures [10:52:36] we can see the two reboots and the final down [10:52:46] so no need to touch anything [10:53:22] XioNoX: mmm but in theory lvs2007 is row-c blind no? (trying to understand) [10:53:58] elukey: yeah [10:54:09] so it loads balance traffic between the other rows [10:54:15] that's my guess at least [10:54:26] I'll let traffic to decide if we should depool it [10:54:40] ok so either we leave as it is, or we failover to the standby [10:55:13] not even pybal... lvs2007 can't even ping cp2035 or 2037 [10:55:13] yep [10:55:23] and we have a fresh vgutierrez (with legs destroyed after 100k) that will decide! :D [10:55:32] vgutierrez: yeah that's expected [10:56:22] (be back in a few) [10:56:25] vgutierrez: yeah that's expected, one of its interface has row C IP config, so it tries to route through there [10:56:39] I'll be not too far from my laptop all day btw [10:56:47] yup, I'll be around as well [10:57:14] ok so consensus to leave things as they are, and take action only if needed? [10:57:41] if so let's write it in the task and the go back to our sundays :) [10:58:04] (I have to step away for a bit but I'll do it after if nobody does it now) [10:59:12] +1 [11:04:56] 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10ayounsi) Netbox device and list of connected servers: https://netbox.wikimedia.org/dcim/devices/1892/ [11:12:35] !log depooling lvs2007, lvs2010 taking over text traffic on codfw - T267865 [11:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:43] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [11:14:52] lvs2010 taking over as expected: https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=7&from=now-30m&to=now&refresh=5s [11:16:03] 10Operations, 10ops-codfw, 10netops: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 (10Vgutierrez) switching over to lvs2010 as it will allow us to recover cp2035, only losing cp2037 on text and cp2038 on upload VS losing cp2035 and cp2037 on text with lvs2007 [11:16:27] PROBLEM - PyBal backends health check on lvs2007 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [11:16:41] yey that's expected [11:16:44] let me downtime those :) [11:17:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:17:37] PROBLEM - pybal on lvs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [11:17:43] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:17:53] PROBLEM - swift codfw container availability low on alert1001 is CRITICAL: cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [11:21:02] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:21:03] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:03] thanks all! [11:31:30] (just acked some alerts in icinga as well) [11:32:09] thx <3 [11:35:11] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [11:38:23] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:51] PROBLEM - swift codfw object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [11:44:05] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 14.47 ms [11:46:43] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [13:16:09] RECOVERY - Host cp2037 is UP: PING WARNING - Packet loss = 77%, RTA = 30.16 ms [13:16:09] RECOVERY - Host cp2038 is UP: PING WARNING - Packet loss = 66%, RTA = 30.44 ms [13:16:09] RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms [13:16:09] RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 30.68 ms [13:16:09] RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.34 ms [13:16:11] RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms [13:16:15] RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [13:16:17] RECOVERY - Host elastic2048 is UP: PING WARNING - Packet loss = 33%, RTA = 30.96 ms [13:16:19] RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 32.57 ms [13:16:19] RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 32.58 ms [13:16:25] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [13:16:25] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:16:37] RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 33.32 ms [13:17:27] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [13:17:45] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:18:07] RECOVERY - swift codfw object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [13:18:39] PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:41] PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:57] PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:03] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [13:19:19] PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:29] PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:35] PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100% [13:19:39] PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100% [13:20:29] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [13:20:29] PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100% [13:20:29] PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:20:53] PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% [13:21:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:22:47] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:22:59] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [13:23:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:31:48] ufff [13:33:11] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:34:09] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 305 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [13:34:41] PROBLEM - Host ms-be1022 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:45] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 412 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [14:00:08] !log powercycling ms-be1022 via mgmt [14:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:47] RECOVERY - Host ms-be1022 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [14:07:20] 10Operations, 10ops-eqiad, 10SRE-swift-storage: ms-be1022 smart storage battery failure; disk sdb possibly bad - https://phabricator.wikimedia.org/T267870 (10CDanis) [14:07:46] elukey: anything else needed right now wrt: codfw c7? [14:11:32] cdanis: goood morning! I think nothing, but if you have time please double check the task to make sure that I got all.. There is a swift/thanos alert that should be ok as well [14:11:45] the weird thing is that it keeps flapping [14:11:52] yeah, strange [14:11:53] (the c7 switch) [14:25:45] PROBLEM - HP RAID on ms-be1022 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:25:47] ACKNOWLEDGEMENT - HP RAID on ms-be1022 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T267872 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Inform [14:26:13] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T267872 (10ops-monitoring-bot) [15:01:57] PROBLEM - swift codfw container availability low on alert1001 is CRITICAL: cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [15:21:43] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:25:05] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:40:32] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Ankry) >>! In T257066#6598861, @Beeswaxcandle wrote: >>>! In T257066#6576540, @FordPrefect42 wrote: >... [15:42:05] PROBLEM - swift codfw object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [17:09:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:11:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:38:28] (03CR) 10Kaartic: "Just pointing out one change to the Hindi translation that I notice." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup) [19:25:18] (03PS9) 10Ladsgroup: varnish: Improve wording of the browser security error a bit [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) [19:26:09] (03CR) 10Ladsgroup: varnish: Improve wording of the browser security error a bit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup) [19:26:31] (03CR) 10Ladsgroup: "This is good to go now. Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/637850 (https://phabricator.wikimedia.org/T241656) (owner: 10Ladsgroup) [20:41:21] RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms [20:41:21] RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms [20:41:21] RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.15 ms [20:41:21] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [20:41:21] RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [20:41:21] RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [20:41:21] RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms [20:41:22] RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [20:41:22] RECOVERY - Host cp2038 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [20:41:29] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [20:41:43] RECOVERY - Host elastic2048 is UP: PING OK - Packet loss = 0%, RTA = 33.42 ms [20:41:47] RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [20:41:53] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:41:53] RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [20:42:59] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [20:43:39] RECOVERY - swift codfw object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [20:44:37] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [20:47:03] RECOVERY - swift codfw container availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [20:50:25] PROBLEM - Number of messages locally queued by purged for processing on cp2037 is CRITICAL: cluster=cache_text instance=cp2037 job=purged layer=frontend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [20:53:29] RECOVERY - configured eth on lvs2007 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:55:25] RECOVERY - Number of messages locally queued by purged for processing on cp2037 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [20:55:55] PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:59] PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100% [20:55:59] PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:07] PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:07] PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:23] PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:23] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:25] PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% [20:56:55] PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100% [20:57:33] PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100% [20:57:39] PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100% [20:58:39] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [21:00:19] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:02:09] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4022 is CRITICAL: 3.33e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022 [21:03:09] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4025 is CRITICAL: 3.981e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025 [21:03:15] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4031 is CRITICAL: 4.47e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [21:03:23] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2042 is CRITICAL: 4.281e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2042 [21:03:25] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: 4.297e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [21:03:31] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2036 is CRITICAL: 4.531e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036 [21:03:45] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2039 is CRITICAL: 4.753e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039 [21:03:49] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2028 is CRITICAL: 4.515e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028 [21:05:03] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2030 is CRITICAL: 5.34e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2030 [21:13:03] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 412 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [21:13:05] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 305 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [22:03:38] !log T267867 T267865 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕔🍺 sudo cumin -b2 -s10 'A:cp and A:codfw' 'systemctl restart purged' [22:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:47] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [22:03:47] T267867: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867 [22:05:12] argh... the switch failure affected purgeds in ulsfo too?? [22:05:51] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2028 is OK: (C)5000 gt (W)3000 gt 766.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028 [22:07:07] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2030 is OK: (C)5000 gt (W)3000 gt 189.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2030 [22:08:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2036 is OK: (C)5000 gt (W)3000 gt 264.1 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036 [22:10:27] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2042 is OK: (C)5000 gt (W)3000 gt 1048 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2042 [22:10:34] !log restart some purgeds in ulsfo as well T267865 T267867 [22:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:42] T267865: Switch on rack C7 in codfw is down - https://phabricator.wikimedia.org/T267865 [22:10:42] T267867: purged is not resilient to kafka main nodes going down - https://phabricator.wikimedia.org/T267867 [22:10:49] RECOVERY - Time elapsed since the last kafka event processed by purged on cp2039 is OK: (C)5000 gt (W)3000 gt 2714 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039 [22:11:55] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4025 is OK: (C)5000 gt (W)3000 gt 0 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025 [22:12:37] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4022 is OK: (C)5000 gt (W)3000 gt 637.7 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022 [22:13:41] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4031 is OK: (C)5000 gt (W)3000 gt 2874 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [22:14:00] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [22:20:15] PROBLEM - Number of messages locally queued by purged for processing on cp4031 is CRITICAL: cluster=cache_text instance=cp4031 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [22:20:59] PROBLEM - Number of messages locally queued by purged for processing on cp4028 is CRITICAL: cluster=cache_text instance=cp4028 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [22:22:01] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4031 is CRITICAL: 1.055e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [22:22:13] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: 1.131e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [22:23:41] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 108 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:25:21] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 14 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:30:27] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4031 is OK: (C)5000 gt (W)3000 gt 101.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [22:30:37] RECOVERY - Time elapsed since the last kafka event processed by purged on cp4028 is OK: (C)5000 gt (W)3000 gt 151.8 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [22:31:07] RECOVERY - Number of messages locally queued by purged for processing on cp4028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [22:32:01] RECOVERY - Number of messages locally queued by purged for processing on cp4031 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [22:33:39] PROBLEM - Check systemd state on ms-be1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:45] PROBLEM - swift codfw container availability low on alert1001 is CRITICAL: cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [23:00:27] RECOVERY - Check systemd state on ms-be1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:47] PROBLEM - swift codfw object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe2001 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [23:31:03] RECOVERY - Host cp2038 is UP: PING WARNING - Packet loss = 66%, RTA = 30.25 ms [23:31:05] RECOVERY - Host cp2037 is UP: PING OK - Packet loss = 0%, RTA = 31.04 ms [23:31:05] RECOVERY - Host kafka-main2003 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [23:31:07] RECOVERY - Host thanos-be2003 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [23:31:07] RECOVERY - Host ms-be2036 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms [23:31:07] RECOVERY - Host elastic2048 is UP: PING OK - Packet loss = 0%, RTA = 30.21 ms [23:31:07] RECOVERY - Host elastic2049 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [23:31:07] RECOVERY - Host ms-be2049 is UP: PING OK - Packet loss = 0%, RTA = 31.51 ms [23:31:09] RECOVERY - Host ms-be2054 is UP: PING OK - Packet loss = 0%, RTA = 30.97 ms [23:31:19] RECOVERY - Host ms-be2021 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [23:31:19] RECOVERY - Host elastic2059 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [23:31:47] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:31:47] RECOVERY - Juniper virtual chassis ports on asw-c-codfw is OK: OK: UP: 20 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [23:32:17] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [23:32:19] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [23:32:59] RECOVERY - swift codfw object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [23:33:01] RECOVERY - swift codfw container availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=codfw [23:35:49] PROBLEM - Host cp2038 is DOWN: PING CRITICAL - Packet loss = 100% [23:35:49] PROBLEM - Host kafka-main2003 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:03] PROBLEM - Host cp2037 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:07] PROBLEM - Host elastic2059 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:09] PROBLEM - Host elastic2049 is DOWN: PING CRITICAL - Packet loss = 100% [23:36:25] PROBLEM - Host ms-be2049 is DOWN: PING CRITICAL - Packet loss = 100% [23:37:23] PROBLEM - Host elastic2048 is DOWN: PING CRITICAL - Packet loss = 100% [23:37:25] PROBLEM - Host ms-be2021 is DOWN: PING CRITICAL - Packet loss = 100% [23:37:25] PROBLEM - Host thanos-be2003 is DOWN: PING CRITICAL - Packet loss = 100% [23:37:41] PROBLEM - Host ms-be2054 is DOWN: PING CRITICAL - Packet loss = 100% [23:37:41] PROBLEM - Host ms-be2036 is DOWN: PING CRITICAL - Packet loss = 100% [23:38:29] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 127, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:38:29] PROBLEM - Juniper virtual chassis ports on asw-c-codfw is CRITICAL: CRIT: Down: 5 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [23:42:41] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4028 is CRITICAL: 3.866e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [23:42:59] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2039 is CRITICAL: 4.322e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039 [23:43:01] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2028 is CRITICAL: 4.084e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2028 [23:43:07] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4022 is CRITICAL: 4.104e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4022 [23:44:05] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4025 is CRITICAL: 4.759e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4025 [23:44:13] PROBLEM - Time elapsed since the last kafka event processed by purged on cp4031 is CRITICAL: 4.642e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [23:44:27] PROBLEM - Time elapsed since the last kafka event processed by purged on cp2036 is CRITICAL: 5.332e+05 gt 5000 https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2036 [23:52:17] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 412 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [23:52:19] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2002 is CRITICAL: 305 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2002 [23:57:19] 10Operations, 10Traffic, 10HTTPS, 10Performance-Team (Radar): TLS certificates renewal process - https://phabricator.wikimedia.org/T196248 (10Seb35) FYI I opened [[https://github.com/certbot/certbot/issues/8456|a feature request on certbot]] to propose a delay before deployment as stated here, and will soo...