[00:04:05] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:07] <icinga-wm>	 PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:35] <icinga-wm>	 PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:41] <icinga-wm>	 PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:17] <icinga-wm>	 PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:50:13] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:52:13] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:58:03] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:08:01] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 571 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[05:32:01] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200823T0700)
[09:20:01] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001 job=burrow partition={2,3} prometheus=ops site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-
[09:20:01] <icinga-wm>	 ll&var-consumer_group=All
[09:38:49] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 564 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:41:51] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[09:44:51] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 562 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:18:35] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:20:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:23:38] <gehel>	 !log repool wdqs1006 - catched up on lag
[11:23:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:47] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:20:08] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 36675912 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:22:07] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 15496 and 16 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:58:52] <wikibugs>	 (03PS1) 10Jayprakash12345: Added import sources for mlwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621741
[13:21:59] <wikibugs>	 (03PS2) 10Jayprakash12345: Added import sources for mlwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621741 (https://phabricator.wikimedia.org/T260716)
[13:40:50] <wikibugs>	 (03PS3) 10Jayprakash12345: Added import sources for mlwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621741 (https://phabricator.wikimedia.org/T260716)
[13:42:52] <wikibugs>	 (03CR) 10Jayprakash12345: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/621741 (https://phabricator.wikimedia.org/T260716) (owner: 10Jayprakash12345)
[14:25:13] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[14:54:45] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[14:55:23] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[15:16:47] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 52 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:22:48] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 47 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:46:13] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1077 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[16:16:35] <icinga-wm>	 PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 51 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:22:33] <icinga-wm>	 RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 555 (alerts on 50) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:32:41] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1078 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:47:31] <wikibugs>	 10Operations, 10TechCom-RFC, 10Traffic, 10MobileFrontend (Tracking), 10Readers-Web-Backlog (Tracking): RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL - https://phabricator.wikimedia.org/T214998 (10Mobiledesktop) Have been somewhat expecting this to be fixed for the bett...
[19:14:13] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 16 GB (0% inode=99%): /var/lib/hadoop/data/l 28 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[19:26:49] <wikibugs>	 (03PS1) 10Cmjohnson: Adding production dns for dbprov1003 [dns] - 10https://gerrit.wikimedia.org/r/621959 (https://phabricator.wikimedia.org/T258750)
[19:28:03] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 17 GB (0% inode=99%): /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[19:31:04] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for dbprov1003 [dns] - 10https://gerrit.wikimedia.org/r/621959 (https://phabricator.wikimedia.org/T258750) (owner: 10Cmjohnson)
[19:32:01] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1095 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[19:36:24] <wikibugs>	 (03PS1) 10Cmjohnson: Adding dbprov1003 mac address to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/621960 (https://phabricator.wikimedia.org/T258750)
[19:37:14] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding dbprov1003 mac address to dhcp file [puppet] - 10https://gerrit.wikimedia.org/r/621960 (https://phabricator.wikimedia.org/T258750) (owner: 10Cmjohnson)
[19:39:33] <wikibugs>	 (03PS1) 10Cmjohnson: Updating site.pp file for dbprov1003 switched it to role:insetup to be safe [puppet] - 10https://gerrit.wikimedia.org/r/621961 (https://phabricator.wikimedia.org/T258750)
[19:40:20] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Updating site.pp file for dbprov1003 switched it to role:insetup to be safe [puppet] - 10https://gerrit.wikimedia.org/r/621961 (https://phabricator.wikimedia.org/T258750) (owner: 10Cmjohnson)
[19:41:21] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1094 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 14 GB (0% inode=99%): /var/lib/hadoop/data/j 25 GB (0% inode=99%): /var/lib/hadoop/data/i 22 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[19:44:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` dbprov1003.eqiad.wmnet ` The lo...
[19:45:47] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1095 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 31 GB (0% inode=99%): /var/lib/hadoop/data/l 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[19:59:00] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1003.eqiad.wmnet'] `  Of which those **FAILED**: ` ['dbprov1003.eqiad.wmnet'...
[20:00:33] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1077 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[20:00:41] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime
[20:00:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:01:06] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` dbprov1003.eqiad.wmnet ` The lo...
[20:02:44] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:02:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:13:19] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1003.eqiad.wmnet'] `  Of which those **FAILED**: ` ['dbprov1003.eqiad.wmnet'] `
[20:14:36] <wikibugs>	 (03PS1) 10Cmjohnson: Adding pki1001 to dhcp file and update netboot.cfg file [puppet] - 10https://gerrit.wikimedia.org/r/621962 (https://phabricator.wikimedia.org/T259826)
[20:15:02] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime
[20:15:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:16:18] <wikibugs>	 (03PS1) 10Cmjohnson: Adding pki1001 server to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/621963 (https://phabricator.wikimedia.org/T259826)
[20:17:27] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding pki1001 to dhcp file and update netboot.cfg file [puppet] - 10https://gerrit.wikimedia.org/r/621962 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson)
[20:17:49] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding pki1001 server to site.pp insetup role [puppet] - 10https://gerrit.wikimedia.org/r/621963 (https://phabricator.wikimedia.org/T259826) (owner: 10Cmjohnson)
[20:17:51] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1070 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[20:18:45] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:08] <wikibugs>	 (03PS1) 10Cmjohnson: Fixing dbprov1003 site.pp [puppet] - 10https://gerrit.wikimedia.org/r/621964 (https://phabricator.wikimedia.org/T258750)
[20:21:21] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1078 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/e 20 GB (0% inode=99%): /var/lib/hadoop/data/g 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[20:21:38] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Fixing dbprov1003 site.pp [puppet] - 10https://gerrit.wikimedia.org/r/621964 (https://phabricator.wikimedia.org/T258750) (owner: 10Cmjohnson)
[20:22:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` dbprov1003.eqiad.wmnet ` The lo...
[20:27:45] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1083 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/k 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[20:28:11] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` pki1001.eqiad.wmnet ` The log can be found in...
[20:36:28] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime
[20:36:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:32] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[20:38:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:40:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10Cmjohnson)
[20:40:59] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1070 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/j 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[20:56:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10Cmjohnson) I ran into a roadblock with the installer, the partman recipe that I used is apparently not working. I used the same as codfw but that may be w...
[20:58:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1003.eqiad.wmnet'] `  and were **ALL** successful.
[20:59:32] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: (2020-08-15) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10Cmjohnson) 05Open→03Resolved @jcrespo This server is ready for you, I did update the site.pp role to insetup. I didn't want to install...
[21:01:21] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1081 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/m 15 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[21:13:19] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1084 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/d 12 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[21:26:46] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Disable google code in mailinglists - https://phabricator.wikimedia.org/T261084 (10Urbanecm)
[22:30:07] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1077 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[22:31:45] <wikibugs>	 10Operations, 10observability, 10good first task: nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/ - https://phabricator.wikimedia.org/T252990 (10Southparkfan) The PID directory has been changed to /run since Buster: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=9...
[22:31:51] <wikibugs>	 10Operations, 10observability, 10good first task: nagios-nrpe-server.service: systemd unit references path below legacy directory /var/run/ - https://phabricator.wikimedia.org/T252990 (10Southparkfan) a:03Southparkfan
[22:34:28] <wikibugs>	 (03PS1) 10Southparkfan: nagios-nrpe-server systemd unit: change PIDFile based on running OS In Buster, the PID directory was changed to /run. While using the legacy /var/run won't break anything, it generates warnings in the logs. [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990)
[22:36:03] <wikibugs>	 (03PS2) 10Southparkfan: nagios-nrpe-server systemd unit: change PIDFile based on running OS [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990)
[23:17:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:19:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:23:21] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[23:25:09] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[23:40:31] <SPF|Cloud>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/621967 shouldn't there be some tests for this change?