[00:04:05] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:07] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:35] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:41] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:17] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:13] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:52:13] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:58:03] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:08:01] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [05:32:01] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200823T0700) [09:20:01] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={2,3} prometheus=ops site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging- [09:20:01] ll&var-consumer_group=All [09:38:49] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 564 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:41:51] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [09:44:51] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 562 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:18:35] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:20:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:23:38] !log repool wdqs1006 - catched up on lag [11:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:47] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:20:08] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36675912 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:22:07] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 15496 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:58:52] (03PS1) 10Jayprakash12345: Added import sources for mlwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621741 [13:21:59] (03PS2) 10Jayprakash12345: Added import sources for mlwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621741 (https://phabricator.wikimedia.org/T260716) [13:40:50] (03PS3) 10Jayprakash12345: Added import sources for mlwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621741 (https://phabricator.wikimedia.org/T260716) [13:42:52] (03CR) 10Jayprakash12345: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621741 (https://phabricator.wikimedia.org/T260716) (owner: 10Jayprakash12345) [14:25:13] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:54:45] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:55:23] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:16:47] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:22:48] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 47 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:46:13] PROBLEM - Disk space on Hadoop worker on analytics1077 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [16:16:35] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:22:33] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:32:41] PROBLEM - Disk space on Hadoop worker on an-worker1078 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:47:31] 10Operations, 10TechCom-RFC, 10Traffic, 10MobileFrontend (Tracking), 10Readers-Web-Backlog (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Mobiledesktop) Have been somewhat expecting this to be fixed for the bett... [19:14:13] PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 16 GB (0% inode=99%): /var/lib/hadoop/data/l 28 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [19:26:49] (03PS1) 10Cmjohnson: Adding production dns for dbprov1003 [dns] - 10https://gerrit.wikimedia.org/r/621959 (https://phabricator.wikimedia.org/T258750) [19:28:03] PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 17 GB (0% inode=99%): /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [19:31:04] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for dbprov1003 [dns] - 10https://gerrit.wikimedia.org/r/621959 (https://phabricator.wikimedia.org/T258750) (owner: 10Cmjohnson) [19:32:01] RECOVERY - Disk space on Hadoop worker on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [19:36:24] (03PS1) 10Cmjohnson: Adding dbprov1003 mac address to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/621960 (https://phabricator.wikimedia.org/T258750) [19:37:14] (03CR) 10Cmjohnson: [C: 03+2] Adding dbprov1003 mac address to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/621960 (https://phabricator.wikimedia.org/T258750) (owner: 10Cmjohnson) [19:39:33] (03PS1) 10Cmjohnson: Updating site.pp file for dbprov1003 switched it to role:insetup to be safe [puppet] - 10https://gerrit.wikimedia.org/r/621961 (https://phabricator.wikimedia.org/T258750) [19:40:20] (03CR) 10Cmjohnson: [C: 03+2] Updating site.pp file for dbprov1003 switched it to role:insetup to be safe [puppet] - 10https://gerrit.wikimedia.org/r/621961 (https://phabricator.wikimedia.org/T258750) (owner: 10Cmjohnson) [19:41:21] PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 14 GB (0% inode=99%): /var/lib/hadoop/data/j 25 GB (0% inode=99%): /var/lib/hadoop/data/i 22 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [19:44:38] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` dbprov1003.eqiad.wmnet ` The lo... [19:45:47] PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 31 GB (0% inode=99%): /var/lib/hadoop/data/l 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [19:59:00] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1003.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbprov1003.eqiad.wmnet'... [20:00:33] PROBLEM - Disk space on Hadoop worker on analytics1077 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:00:41] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:06] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` dbprov1003.eqiad.wmnet ` The lo... [20:02:44] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:19] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1003.eqiad.wmnet'] ` Of which those **FAILED**: ` ['dbprov1003.eqiad.wmnet'] ` [20:14:36] (03PS1) 10Cmjohnson: Adding pki1001 to dhcp file and update netboot.cfg file [puppet] - 10https://gerrit.wikimedia.org/r/621962 (https://phabricator.wikimedia.org/T259826) [20:15:02] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:18] (03PS1) 10Cmjohnson: Adding pki1001 server to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/621963 (https://phabricator.wikimedia.org/T259826) [20:17:27] (03CR) 10Cmjohnson: [C: 03+2] Adding pki1001 to dhcp file and update netboot.cfg file [puppet] - 10https://gerrit.wikimedia.org/r/621962 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson) [20:17:49] (03CR) 10Cmjohnson: [C: 03+2] Adding pki1001 server to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/621963 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson) [20:17:51] PROBLEM - Disk space on Hadoop worker on analytics1070 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:18:45] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:08] (03PS1) 10Cmjohnson: Fixing dbprov1003 site.pp [puppet] - 10https://gerrit.wikimedia.org/r/621964 (https://phabricator.wikimedia.org/T258750) [20:21:21] PROBLEM - Disk space on Hadoop worker on an-worker1078 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 20 GB (0% inode=99%): /var/lib/hadoop/data/g 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:21:38] (03CR) 10Cmjohnson: [C: 03+2] Fixing dbprov1003 site.pp [puppet] - 10https://gerrit.wikimedia.org/r/621964 (https://phabricator.wikimedia.org/T258750) (owner: 10Cmjohnson) [20:22:31] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` dbprov1003.eqiad.wmnet ` The lo... [20:27:45] PROBLEM - Disk space on Hadoop worker on an-worker1083 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:28:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` pki1001.eqiad.wmnet ` The log can be found in... [20:36:28] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:32] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:39] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10Cmjohnson) [20:40:59] PROBLEM - Disk space on Hadoop worker on analytics1070 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [20:56:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10Cmjohnson) I ran into a roadblock with the installer, the partman recipe that I used is apparently not working. I used the same as codfw but that may be w... [20:58:38] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1003.eqiad.wmnet'] ` and were **ALL** successful. [20:59:32] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10Cmjohnson) 05Open→03Resolved @jcrespo This server is ready for you, I did update the site.pp role to insetup. I didn't want to install... [21:01:21] PROBLEM - Disk space on Hadoop worker on an-worker1081 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/m 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:13:19] PROBLEM - Disk space on Hadoop worker on an-worker1084 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 12 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [21:26:46] 10Operations, 10Wikimedia-Mailing-lists: Disable google code in mailinglists - https://phabricator.wikimedia.org/T261084 (10Urbanecm) [22:30:07] RECOVERY - Disk space on Hadoop worker on analytics1077 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [22:31:45] 10Operations, 10observability, 10good first task: nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/ - https://phabricator.wikimedia.org/T252990 (10Southparkfan) The PID directory has been changed to /run since Buster: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=9... [22:31:51] 10Operations, 10observability, 10good first task: nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/ - https://phabricator.wikimedia.org/T252990 (10Southparkfan) a:03Southparkfan [22:34:28] (03PS1) 10Southparkfan: nagios-nrpe-server systemd unit: change PIDFile based on running OS In Buster, the PID directory was changed to /run. While using the legacy /var/run won't break anything, it generates warnings in the logs. [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) [22:36:03] (03PS2) 10Southparkfan: nagios-nrpe-server systemd unit: change PIDFile based on running OS [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) [23:17:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:19:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:23:21] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [23:25:09] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:40:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/621967 shouldn't there be some tests for this change?