[00:00:04] twentyafterfour: That opportune time is upon us again. Time for a Phabricator update deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201015T0000). [00:01:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:13:21] !log updating phabricator [00:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:05] !log phabricator update was uneventful [00:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:39] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [00:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:28] !log Beginning rolling upgrade for cirrussearch `eqiad`. Cookbook will restart elasticsearch on 36 nodes total, 3 nodes at a time [00:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:20:13] (03CR) 10Dzahn: [C: 03+1] "new compiler output looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff) [02:13:28] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) [02:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:14:18] !log Rolling upgrade for cirrussearch `eqiad` is complete [02:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:18] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-upgrade [02:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:35] !log Rolling upgrade for cirrussearch `codfw` beginning [02:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:10:32] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) [04:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:02] !log Rolling upgrade for cirrus `codfw` complete [04:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:27] (03PS1) 10ArielGlenn: if dumps rsync produces output, it's raw bytes when chcking for errors [puppet] - 10https://gerrit.wikimedia.org/r/634144 [05:43:53] (03PS2) 10ArielGlenn: if dumps rsync produces output, it's raw bytes when checking for errors [puppet] - 10https://gerrit.wikimedia.org/r/634144 [05:50:20] (03CR) 10ArielGlenn: [C: 03+2] if dumps rsync produces output, it's raw bytes when checking for errors [puppet] - 10https://gerrit.wikimedia.org/r/634144 (owner: 10ArielGlenn) [06:06:42] (03PS1) 10Elukey: Decommission analytics1050 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/634145 (https://phabricator.wikimedia.org/T255140) [06:08:45] (03CR) 10Elukey: [C: 03+2] Decommission analytics1050 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/634145 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [06:17:14] (03CR) 10Elukey: [C: 03+1] "Very nice :) LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [06:30:47] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 115 probes of 652 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:37:54] (03PS1) 10Elukey: Enable admin list checks for Oozie in Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/634152 (https://phabricator.wikimedia.org/T262660) [06:39:01] (03CR) 10Elukey: [C: 03+2] Enable admin list checks for Oozie in Analytics Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/634152 (https://phabricator.wikimedia.org/T262660) (owner: 10Elukey) [06:52:29] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 8 probes of 652 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [06:53:25] (03PS1) 10Elukey: Enable admin list for Oozie in Analytics Hadoop - second attempt [puppet] - 10https://gerrit.wikimedia.org/r/634179 (https://phabricator.wikimedia.org/T262660) [07:00:41] (03CR) 10Elukey: [C: 03+2] Enable admin list for Oozie in Analytics Hadoop - second attempt [puppet] - 10https://gerrit.wikimedia.org/r/634179 (https://phabricator.wikimedia.org/T262660) (owner: 10Elukey) [07:24:55] (03Abandoned) 10Volans: Query: allow to extract random subset of hosts [software/cumin] - 10https://gerrit.wikimedia.org/r/409980 (https://phabricator.wikimedia.org/T186818) (owner: 10Volans) [07:40:10] (03Abandoned) 10Hashar: Link to static libclang [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/634080 (https://phabricator.wikimedia.org/T254465) (owner: 10Hashar) [07:40:21] (03Abandoned) 10Volans: Tox: find and check Python files without extension [puppet] - 10https://gerrit.wikimedia.org/r/357197 (https://phabricator.wikimedia.org/T144169) (owner: 10Volans) [07:45:20] (03PS4) 10Hashar: Merge tag 'debian/1.8.19-1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/621291 (https://phabricator.wikimedia.org/T254465) [07:55:29] (03CR) 10JMeybohm: "PCC for this: https://puppet-compiler.wmflabs.org/compiler1003/25719/" [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917) (owner: 10JMeybohm) [07:59:12] (03CR) 10Ayounsi: [C: 03+1] bump FNM mbps threshold [puppet] - 10https://gerrit.wikimedia.org/r/634115 (owner: 10CDanis) [08:03:50] (03CR) 10Ayounsi: [C: 03+1] netmon: remove stretch PHP 7.2 support [puppet] - 10https://gerrit.wikimedia.org/r/633824 (owner: 10Dzahn) [08:04:17] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: ensure new prometheus-rsyslog-exporter version [puppet] - 10https://gerrit.wikimedia.org/r/634112 (https://phabricator.wikimedia.org/T210137) (owner: 10Cwhite) [08:10:12] (03CR) 10Kormat: [C: 03+1] Add compiled Python files to gitignore [software] - 10https://gerrit.wikimedia.org/r/581993 (owner: 10Volans) [08:10:16] (03CR) 10Kormat: [C: 03+1] Relax max-line-length for flake8 [software] - 10https://gerrit.wikimedia.org/r/581994 (owner: 10Volans) [08:11:55] (03PS2) 10Volans: Add compiled Python files to gitignore [software] - 10https://gerrit.wikimedia.org/r/581993 [08:13:20] (03CR) 10Volans: [C: 03+2] Add compiled Python files to gitignore [software] - 10https://gerrit.wikimedia.org/r/581993 (owner: 10Volans) [08:14:08] (03Merged) 10jenkins-bot: Add compiled Python files to gitignore [software] - 10https://gerrit.wikimedia.org/r/581993 (owner: 10Volans) [08:14:40] (03PS2) 10Volans: Relax max-line-length for flake8 [software] - 10https://gerrit.wikimedia.org/r/581994 [08:16:01] (03CR) 10Volans: [C: 03+2] Relax max-line-length for flake8 [software] - 10https://gerrit.wikimedia.org/r/581994 (owner: 10Volans) [08:16:26] (03Merged) 10jenkins-bot: Relax max-line-length for flake8 [software] - 10https://gerrit.wikimedia.org/r/581994 (owner: 10Volans) [08:17:11] !log swift codfw-prod: bump object weight for ms-be2057 - T261633 [08:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:18] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [08:18:38] (03CR) 10Ayounsi: [C: 03+1] netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827 (owner: 10Dzahn) [08:33:52] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/25900/prometheus1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/633971 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [08:34:35] (03PS5) 10Ayounsi: Add Z side device/interface/vlan and cable to PuppetDB importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/634017 (https://phabricator.wikimedia.org/T262899) [08:37:41] (03PS3) 10Volans: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) [08:38:51] (03CR) 10jerkins-bot: [V: 04-1] sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [09:04:03] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:04:23] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:19] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:05:37] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:06:55] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:07:43] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:07:53] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:09:29] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [09:16:23] checking --^ [09:20:18] (03CR) 10Hashar: [C: 03+2] "Tested and it works, I will get it build and uploaded to apt.wikimedia.org" [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/621291 (https://phabricator.wikimedia.org/T254465) (owner: 10Hashar) [09:22:31] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:29] known ^ it is the session scope failing under high load, should recover shortly [09:24:13] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:21] (03PS1) 10Elukey: role::druid::public::worker: increase connection pool on historicals [puppet] - 10https://gerrit.wikimedia.org/r/634191 (https://phabricator.wikimedia.org/T226035) [09:30:07] (03Merged) 10jenkins-bot: Merge tag 'debian/1.8.19-1' into debian/buster-wikimedia [debs/doxygen] (debian/buster-wikimedia) - 10https://gerrit.wikimedia.org/r/621291 (https://phabricator.wikimedia.org/T254465) (owner: 10Hashar) [09:32:15] PROBLEM - Number of messages locally queued by purged for processing on cp3052 is CRITICAL: cluster=cache_text instance=cp3052 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [09:32:25] PROBLEM - Number of messages locally queued by purged for processing on cp5009 is CRITICAL: cluster=cache_text instance=cp5009 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [09:32:35] PROBLEM - Number of messages locally queued by purged for processing on cp5007 is CRITICAL: cluster=cache_text instance=cp5007 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [09:32:35] PROBLEM - Number of messages locally queued by purged for processing on cp1079 is CRITICAL: cluster=cache_text instance=cp1079 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [09:32:45] PROBLEM - Number of messages locally queued by purged for processing on cp1077 is CRITICAL: cluster=cache_text instance=cp1077 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [09:32:51] PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [09:33:25] PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [09:33:39] PROBLEM - Number of messages locally queued by purged for processing on cp3054 is CRITICAL: cluster=cache_text instance=cp3054 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [09:33:41] PROBLEM - Number of messages locally queued by purged for processing on cp3060 is CRITICAL: cluster=cache_text instance=cp3060 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [09:33:47] PROBLEM - Number of messages locally queued by purged for processing on cp3056 is CRITICAL: cluster=cache_text instance=cp3056 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [09:39:29] RECOVERY - Number of messages locally queued by purged for processing on cp5007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [09:39:41] RECOVERY - Number of messages locally queued by purged for processing on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [09:40:18] (03PS1) 10Alexandros Kosiaris: sretest: Experiment with preserving docker rules [puppet] - 10https://gerrit.wikimedia.org/r/634192 [09:40:59] PROBLEM - Number of messages locally queued by purged for processing on cp5009 is CRITICAL: cluster=cache_text instance=cp5009 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [09:41:19] PROBLEM - Number of messages locally queued by purged for processing on cp1077 is CRITICAL: cluster=cache_text instance=cp1077 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [09:41:25] PROBLEM - Number of messages locally queued by purged for processing on cp2037 is CRITICAL: cluster=cache_text instance=cp2037 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [09:46:29] PROBLEM - Number of messages locally queued by purged for processing on cp5008 is CRITICAL: cluster=cache_text instance=cp5008 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [09:46:41] PROBLEM - Number of messages locally queued by purged for processing on cp2033 is CRITICAL: cluster=cache_text instance=cp2033 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [09:48:01] PROBLEM - Number of messages locally queued by purged for processing on cp5007 is CRITICAL: cluster=cache_text instance=cp5007 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [09:48:13] PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [09:48:35] PROBLEM - Number of messages locally queued by purged for processing on cp2035 is CRITICAL: cluster=cache_text instance=cp2035 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [09:49:51] PROBLEM - Number of messages locally queued by purged for processing on cp1081 is CRITICAL: cluster=cache_text instance=cp1081 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1081 [09:49:53] PROBLEM - Number of messages locally queued by purged for processing on cp5010 is CRITICAL: cluster=cache_text instance=cp5010 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [09:51:59] (03CR) 10Arturo Borrero Gonzalez: "This patch changes the catalog for basically every sever in the fleet, to include the autogenerated ferm define." [puppet] - 10https://gerrit.wikimedia.org/r/634050 (owner: 10Arturo Borrero Gonzalez) [09:53:13] RECOVERY - Number of messages locally queued by purged for processing on cp1077 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [09:53:15] RECOVERY - Number of messages locally queued by purged for processing on cp1081 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1081 [09:53:16] ema: --^ [09:53:19] RECOVERY - Number of messages locally queued by purged for processing on cp5010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [09:53:29] RECOVERY - Number of messages locally queued by purged for processing on cp2033 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [09:54:59] PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [09:55:05] (03PS1) 10Elukey: sre.druid.roll-restart-workers: allow to restart only a subset of daemons [cookbooks] - 10https://gerrit.wikimedia.org/r/634193 [09:55:23] PROBLEM - Number of messages locally queued by purged for processing on cp2035 is CRITICAL: cluster=cache_text instance=cp2035 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [09:55:23] (03PS1) 10JMeybohm: admin: add maryana to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/634194 (https://phabricator.wikimedia.org/T265555) [09:55:39] (03PS6) 10Ayounsi: Add Z side device/interface/vlan and cable to PuppetDB importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/634017 (https://phabricator.wikimedia.org/T262899) [09:56:04] (03CR) 10jerkins-bot: [V: 04-1] Add Z side device/interface/vlan and cable to PuppetDB importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/634017 (https://phabricator.wikimedia.org/T262899) (owner: 10Ayounsi) [09:56:23] (03CR) 10JMeybohm: [C: 03+2] admin: add maryana to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/634194 (https://phabricator.wikimedia.org/T265555) (owner: 10JMeybohm) [09:56:43] PROBLEM - Number of messages locally queued by purged for processing on cp5008 is CRITICAL: cluster=cache_text instance=cp5008 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [09:58:49] RECOVERY - Number of messages locally queued by purged for processing on cp2035 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [10:00:04] mvolz: (Dis)respected human, time to deploy Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201015T1000). Please do the needful. [10:00:13] RECOVERY - Number of messages locally queued by purged for processing on cp5008 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [10:00:14] the amount of messages to the kafka topic in codfw doesn't seem that high [10:00:17] https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&from=now-24h&to=now&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=codfw.resource-purge [10:00:26] !log T264209. Initiate a docker pull of docker-registry.discovery.wmnet/mwcachedir:0.0.1 from all kubernetes and kubernetes staging nodes. [10:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:35] T264209: Run stress tests on docker images infrastructure - https://phabricator.wikimedia.org/T264209 [10:01:21] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS- on cp2030 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2019a-rsa-unified.ocsp is more than 259500 secs old! https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [10:01:53] RECOVERY - Number of messages locally queued by purged for processing on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [10:03:15] (03CR) 10Alexandros Kosiaris: "PCC https://puppet-compiler.wmflabs.org/compiler1002/25904/" [puppet] - 10https://gerrit.wikimedia.org/r/634192 (owner: 10Alexandros Kosiaris) [10:03:15] RECOVERY - Number of messages locally queued by purged for processing on cp5009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [10:03:25] RECOVERY - Number of messages locally queued by purged for processing on cp5007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [10:03:39] RECOVERY - Number of messages locally queued by purged for processing on cp2037 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [10:18:51] (03PS2) 10Filippo Giunchedi: hieradata: re-enable compaction for prometheus[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/633972 (https://phabricator.wikimedia.org/T261281) [10:18:53] (03PS1) 10Filippo Giunchedi: thanos: add thanos-bucket-web explorer [puppet] - 10https://gerrit.wikimedia.org/r/634198 (https://phabricator.wikimedia.org/T261281) [10:18:55] (03PS1) 10Filippo Giunchedi: role: add thanos bucket-web to frontend [puppet] - 10https://gerrit.wikimedia.org/r/634199 (https://phabricator.wikimedia.org/T261281) [10:20:02] (03CR) 10jerkins-bot: [V: 04-1] thanos: add thanos-bucket-web explorer [puppet] - 10https://gerrit.wikimedia.org/r/634198 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [10:21:25] RECOVERY - Number of messages locally queued by purged for processing on cp3054 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [10:21:25] RECOVERY - Number of messages locally queued by purged for processing on cp3060 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [10:21:43] RECOVERY - Number of messages locally queued by purged for processing on cp3052 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [10:22:11] PROBLEM - Freshness of OCSP Stapling files -ATS-TLS- on cp1079 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2019a-rsa-unified.ocsp is more than 259500 secs old! https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [10:23:13] RECOVERY - Number of messages locally queued by purged for processing on cp3056 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [10:27:59] RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [10:29:51] (03PS2) 10Filippo Giunchedi: thanos: add thanos-bucket-web explorer [puppet] - 10https://gerrit.wikimedia.org/r/634198 (https://phabricator.wikimedia.org/T261281) [10:29:53] (03PS2) 10Filippo Giunchedi: role: add thanos bucket-web to frontend [puppet] - 10https://gerrit.wikimedia.org/r/634199 (https://phabricator.wikimedia.org/T261281) [10:29:55] (03PS3) 10Filippo Giunchedi: hieradata: re-enable compaction for prometheus[12]003 [puppet] - 10https://gerrit.wikimedia.org/r/633972 (https://phabricator.wikimedia.org/T261281) [10:29:59] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:19] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/25905/thanos-fe1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/634199 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [10:33:35] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/25905/thanos-fe1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/634198 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [10:34:08] (03CR) 10Filippo Giunchedi: [C: 04-1] "To be merged on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/633972 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [10:36:21] ACKNOWLEDGEMENT - Freshness of OCSP Stapling files -ATS-TLS- on cp1079 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2019a-rsa-unified.ocsp is more than 259500 secs old! Valentin Gutierrez T265584 https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [10:36:21] ACKNOWLEDGEMENT - Freshness of OCSP Stapling files -ATS-TLS- on cp2030 is CRITICAL: CRITICAL: File /var/cache/ocsp/digicert-2019a-rsa-unified.ocsp is more than 259500 secs old! Valentin Gutierrez T265584 https://wikitech.wikimedia.org/wiki/HTTPS/Unified_Certificates [10:38:13] PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [10:39:55] RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [10:40:45] RECOVERY - Number of messages locally queued by purged for processing on cp1079 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [10:42:55] (03PS1) 10Vgutierrez: ATS: Remove digicert-2019a cert definition [puppet] - 10https://gerrit.wikimedia.org/r/634202 (https://phabricator.wikimedia.org/T265584) [10:44:09] PROBLEM - Check systemd state on ms-be2049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:47:12] !log restart ats-backend on cp3050 [10:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:48] (03CR) 10Volans: [C: 03+1] "LGTM, optional nits inline" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/634193 (owner: 10Elukey) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201015T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:02:19] (03CR) 10Vgutierrez: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/25907/ but we will need to wipe manually some files, especially the OCSP " [puppet] - 10https://gerrit.wikimedia.org/r/634202 (https://phabricator.wikimedia.org/T265584) (owner: 10Vgutierrez) [11:03:01] PROBLEM - SSH on ms-be2029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:04:09] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:45] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:19] RECOVERY - SSH on ms-be2029 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:09:57] PROBLEM - Number of messages locally queued by purged for processing on cp1079 is CRITICAL: cluster=cache_text instance=cp1079 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [11:10:17] PROBLEM - Number of messages locally queued by purged for processing on cp2037 is CRITICAL: cluster=cache_text instance=cp2037 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [11:10:25] PROBLEM - Number of messages locally queued by purged for processing on cp5008 is CRITICAL: cluster=cache_text instance=cp5008 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [11:10:39] PROBLEM - Number of messages locally queued by purged for processing on cp2035 is CRITICAL: cluster=cache_text instance=cp2035 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [11:10:45] vgutierrez: :( [11:11:15] PROBLEM - Number of messages locally queued by purged for processing on cp3056 is CRITICAL: cluster=cache_text instance=cp3056 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [11:11:25] PROBLEM - Number of messages locally queued by purged for processing on cp3052 is CRITICAL: cluster=cache_text instance=cp3052 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [11:11:43] RECOVERY - Check systemd state on ms-be2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:45] PROBLEM - Number of messages locally queued by purged for processing on cp5009 is CRITICAL: cluster=cache_text instance=cp5009 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [11:11:49] PROBLEM - Number of messages locally queued by purged for processing on cp1077 is CRITICAL: cluster=cache_text instance=cp1077 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [11:11:49] PROBLEM - Number of messages locally queued by purged for processing on cp1081 is CRITICAL: cluster=cache_text instance=cp1081 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1081 [11:11:53] PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [11:11:55] PROBLEM - Number of messages locally queued by purged for processing on cp5007 is CRITICAL: cluster=cache_text instance=cp5007 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [11:12:09] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:11] PROBLEM - Number of messages locally queued by purged for processing on cp5010 is CRITICAL: cluster=cache_text instance=cp5010 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [11:12:12] (03PS7) 10Ayounsi: Add Z side device/interface/vlan and cable to PuppetDB importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/634017 (https://phabricator.wikimedia.org/T262899) [11:12:35] PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [11:12:51] PROBLEM - Number of messages locally queued by purged for processing on cp3054 is CRITICAL: cluster=cache_text instance=cp3054 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [11:12:51] PROBLEM - Number of messages locally queued by purged for processing on cp3060 is CRITICAL: cluster=cache_text instance=cp3060 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [11:13:07] PROBLEM - Number of messages locally queued by purged for processing on cp5012 is CRITICAL: cluster=cache_text instance=cp5012 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [11:13:45] PROBLEM - Number of messages locally queued by purged for processing on cp2039 is CRITICAL: cluster=cache_text instance=cp2039 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039 [11:13:51] PROBLEM - Number of messages locally queued by purged for processing on cp2033 is CRITICAL: cluster=cache_text instance=cp2033 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [11:13:59] PROBLEM - Number of messages locally queued by purged for processing on cp4028 is CRITICAL: cluster=cache_text instance=cp4028 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [11:19:39] PROBLEM - Number of messages locally queued by purged for processing on cp4029 is CRITICAL: cluster=cache_text instance=cp4029 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4029 [11:19:39] PROBLEM - Number of messages locally queued by purged for processing on cp4032 is CRITICAL: cluster=cache_text instance=cp4032 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4032 [11:19:45] PROBLEM - Number of messages locally queued by purged for processing on cp4031 is CRITICAL: cluster=cache_text instance=cp4031 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [11:20:29] PROBLEM - Number of messages locally queued by purged for processing on cp4027 is CRITICAL: cluster=cache_text instance=cp4027 job=purged layer=backend site=ulsfo https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4027 [11:20:41] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:43] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:19] PROBLEM - Check systemd state on ms-be2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:37] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:28:09] RECOVERY - Number of messages locally queued by purged for processing on cp4029 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4029 [11:28:15] RECOVERY - Number of messages locally queued by purged for processing on cp4031 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4031 [11:29:12] (03PS25) 10Kormat: mariadb: core::multiinstance - generate sections. [puppet] - 10https://gerrit.wikimedia.org/r/620899 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [11:29:14] (03PS3) 10Kormat: mariadb: misc::multiinstance - generate sections. [puppet] - 10https://gerrit.wikimedia.org/r/634034 (https://phabricator.wikimedia.org/T256972) [11:29:16] (03PS3) 10Kormat: mariadb: sanitarium_multiinstance - generate sections. [puppet] - 10https://gerrit.wikimedia.org/r/634042 (https://phabricator.wikimedia.org/T256972) [11:29:18] (03PS3) 10Kormat: mariadb: dbstore_multiinstance - generate sections. [puppet] - 10https://gerrit.wikimedia.org/r/634044 (https://phabricator.wikimedia.org/T256972) [11:29:53] RECOVERY - Number of messages locally queued by purged for processing on cp4032 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4032 [11:30:45] RECOVERY - Number of messages locally queued by purged for processing on cp4027 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4027 [11:30:51] RECOVERY - Number of messages locally queued by purged for processing on cp2039 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2039 [11:31:03] RECOVERY - Number of messages locally queued by purged for processing on cp4028 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=ulsfo+prometheus/ops&var-instance=cp4028 [11:31:11] RECOVERY - Number of messages locally queued by purged for processing on cp2035 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [11:31:51] (03CR) 10Joal: "Works for me - I don't know if it'll make a diff :S" [puppet] - 10https://gerrit.wikimedia.org/r/634191 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [11:31:59] (03CR) 10Joal: [C: 03+1] role::druid::public::worker: increase connection pool on historicals [puppet] - 10https://gerrit.wikimedia.org/r/634191 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [11:32:13] RECOVERY - Check systemd state on ms-be2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:21] RECOVERY - Number of messages locally queued by purged for processing on cp1081 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1081 [11:32:41] RECOVERY - Number of messages locally queued by purged for processing on cp2033 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [11:32:45] RECOVERY - Number of messages locally queued by purged for processing on cp5008 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5008 [11:32:45] RECOVERY - Number of messages locally queued by purged for processing on cp5010 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5010 [11:33:11] RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:41] RECOVERY - Number of messages locally queued by purged for processing on cp5012 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5012 [11:34:03] RECOVERY - Number of messages locally queued by purged for processing on cp1077 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [11:34:03] RECOVERY - Number of messages locally queued by purged for processing on cp5009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [11:34:07] RECOVERY - Number of messages locally queued by purged for processing on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [11:34:13] RECOVERY - Number of messages locally queued by purged for processing on cp5007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [11:34:15] RECOVERY - Number of messages locally queued by purged for processing on cp2037 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [11:34:51] RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [11:35:26] (03CR) 10Elukey: sre.druid.roll-restart-workers: allow to restart only a subset of daemons (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/634193 (owner: 10Elukey) [11:36:51] RECOVERY - Number of messages locally queued by purged for processing on cp3054 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [11:36:53] RECOVERY - Number of messages locally queued by purged for processing on cp3060 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [11:36:59] RECOVERY - Number of messages locally queued by purged for processing on cp3056 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [11:37:09] RECOVERY - Number of messages locally queued by purged for processing on cp3052 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [11:39:05] RECOVERY - Number of messages locally queued by purged for processing on cp1079 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [11:42:01] (03PS1) 10Hnowlan: mtail: create separate metrics histogram for REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) [11:44:10] (03PS2) 10Hnowlan: mtail: create separate metrics histogram for REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) [11:45:50] (03PS1) 10Kormat: mariadb: misc::analytics::multiinstance - generate sections [puppet] - 10https://gerrit.wikimedia.org/r/634208 (https://phabricator.wikimedia.org/T256972) [11:47:37] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: basefirewall: accept ICMP packets [puppet] - 10https://gerrit.wikimedia.org/r/634209 (https://phabricator.wikimedia.org/T261724) [11:49:02] (03CR) 10Kormat: "PCC is a no-op: https://puppet-compiler.wmflabs.org/compiler1003/25908/" [puppet] - 10https://gerrit.wikimedia.org/r/634208 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [11:49:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: basefirewall: accept ICMP packets [puppet] - 10https://gerrit.wikimedia.org/r/634209 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:57:58] (03CR) 10Klausman: [C: 03+2] role::druid::public::worker: increase connection pool on historicals [puppet] - 10https://gerrit.wikimedia.org/r/634191 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [11:58:19] (03CR) 10Klausman: [C: 03+2] profile::analytics::cluster::packages::statistics: remove stretch bits [puppet] - 10https://gerrit.wikimedia.org/r/630578 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey) [12:00:50] (03CR) 10Klausman: [C: 03+1] "Looks good to me, but Luca should definitely give this a once-over" [puppet] - 10https://gerrit.wikimedia.org/r/634208 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [12:12:36] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [12:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:07] PROBLEM - SSH on analytics1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:15:03] PROBLEM - Number of messages locally queued by purged for processing on cp1079 is CRITICAL: cluster=cache_text instance=cp1079 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [12:16:47] RECOVERY - Number of messages locally queued by purged for processing on cp1079 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [12:19:32] (03PS1) 10Jbond: prometheus::rsyslog_exporter: update collector to listen on primary ip [puppet] - 10https://gerrit.wikimedia.org/r/634213 (https://phabricator.wikimedia.org/T265587) [12:19:57] (03CR) 10BBlack: [C: 03+1] "Good for now, we can make new entries when we get a replacement" [puppet] - 10https://gerrit.wikimedia.org/r/634202 (https://phabricator.wikimedia.org/T265584) (owner: 10Vgutierrez) [12:22:25] (03CR) 10Volans: [C: 03+1] "reply inline" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/634193 (owner: 10Elukey) [12:22:27] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/634213 (https://phabricator.wikimedia.org/T265587) (owner: 10Jbond) [12:32:34] (03PS1) 10BBlack: gdnsd: raise DoTLS timeout and conn tuneables [puppet] - 10https://gerrit.wikimedia.org/r/634217 [12:34:35] (03PS1) 10Arturo Borrero Gonzalez: Revert "openstack: l3_agent: introduce dmz_cidr-only l3 agent custom hack" [puppet] - 10https://gerrit.wikimedia.org/r/634102 [12:34:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "openstack: l3_agent: introduce dmz_cidr-only l3 agent custom hack" [puppet] - 10https://gerrit.wikimedia.org/r/634102 (owner: 10Arturo Borrero Gonzalez) [12:35:52] (03PS2) 10BBlack: gdnsd: raise DoTLS timeout and conn tuneables [puppet] - 10https://gerrit.wikimedia.org/r/634217 [12:36:57] (03PS1) 10Arturo Borrero Gonzalez: neutron: drop l3_agent_only_dmz_cidr option [puppet] - 10https://gerrit.wikimedia.org/r/634219 (https://phabricator.wikimedia.org/T247505) [12:38:46] (03Abandoned) 10Arturo Borrero Gonzalez: Revert "openstack: l3_agent: introduce dmz_cidr-only l3 agent custom hack" [puppet] - 10https://gerrit.wikimedia.org/r/634102 (owner: 10Arturo Borrero Gonzalez) [12:39:30] (03PS10) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) [12:40:24] (03CR) 10Hashar: "rebase to fix conflicts with 9458ff093a581fe4f6f4db06a7ebaa9c75b85e30 which removed electron, mobileapps and proton." [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:40:32] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:40:39] (03PS1) 10Jbond: prometheus::nic_saturation_exporter: add ability to listen on specific addr [puppet] - 10https://gerrit.wikimedia.org/r/634220 (https://phabricator.wikimedia.org/T265587) [12:40:41] (03CR) 10BBlack: [C: 03+2] gdnsd: raise DoTLS timeout and conn tuneables [puppet] - 10https://gerrit.wikimedia.org/r/634217 (owner: 10BBlack) [12:40:43] (03PS1) 10Jbond: P:prometheus::nic_saturation_exporter: configure listen address to primary ip [puppet] - 10https://gerrit.wikimedia.org/r/634221 (https://phabricator.wikimedia.org/T265587) [12:41:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1001/25910/" [puppet] - 10https://gerrit.wikimedia.org/r/634219 (https://phabricator.wikimedia.org/T247505) (owner: 10Arturo Borrero Gonzalez) [12:42:53] (03PS8) 10Hashar: scap::sources stop assuming mediawiki/services as a prefix [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) [12:43:09] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/610267 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:44:22] (03CR) 10Elukey: sre.druid.roll-restart-workers: allow to restart only a subset of daemons (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/634193 (owner: 10Elukey) [12:45:36] (03PS2) 10Elukey: sre.druid.roll-restart-workers: allow to restart only a subset of daemons [cookbooks] - 10https://gerrit.wikimedia.org/r/634193 [12:46:11] (03CR) 10Hashar: "Compiler https://puppet-compiler.wmflabs.org/compiler1003/580/" [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:49:59] (03PS4) 10Hashar: doc: switch to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/620368 (https://phabricator.wikimedia.org/T149924) [12:50:01] (03PS2) 10Hashar: doc: relocate published documents to /srv/doc [puppet] - 10https://gerrit.wikimedia.org/r/625644 (https://phabricator.wikimedia.org/T149924) [12:50:03] (03PS2) 10Hashar: doc: stop backup for old doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625649 (https://phabricator.wikimedia.org/T149924) [12:50:05] (03PS2) 10Hashar: doc: remove legacy doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625650 (https://phabricator.wikimedia.org/T149924) [12:50:26] (03PS1) 10Arturo Borrero Gonzalez: neutron: add option to disable customizations if cloudgw is enabled [puppet] - 10https://gerrit.wikimedia.org/r/634223 (https://phabricator.wikimedia.org/T261724) [12:51:01] (03CR) 10jerkins-bot: [V: 04-1] doc: switch to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/620368 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [12:51:11] (03CR) 10jerkins-bot: [V: 04-1] doc: relocate published documents to /srv/doc [puppet] - 10https://gerrit.wikimedia.org/r/625644 (https://phabricator.wikimedia.org/T149924) (owner: 10Hashar) [12:51:43] (03PS1) 10Nikerabbit: Stop defining wmgULSCompactLinksForNewAccounts and wmgULSCompactLinksEnableAnon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634224 [12:51:46] (03PS2) 10Jbond: prometheus::nic_saturation_exporter: add ability to listen on specific addr [puppet] - 10https://gerrit.wikimedia.org/r/634220 (https://phabricator.wikimedia.org/T265587) [12:53:10] (03PS5) 10Hashar: doc: switch to scap DocumentRoot [puppet] - 10https://gerrit.wikimedia.org/r/620368 (https://phabricator.wikimedia.org/T149924) [12:53:12] (03PS3) 10Hashar: doc: relocate published documents to /srv/doc [puppet] - 10https://gerrit.wikimedia.org/r/625644 (https://phabricator.wikimedia.org/T149924) [12:53:14] (03PS3) 10Hashar: doc: stop backup for old doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625649 (https://phabricator.wikimedia.org/T149924) [12:53:16] (03PS3) 10Hashar: doc: remove legacy doc directory [puppet] - 10https://gerrit.wikimedia.org/r/625650 (https://phabricator.wikimedia.org/T149924) [12:53:49] (03Abandoned) 10Hashar: Run integration tests on CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621762 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [12:54:03] PROBLEM - Number of messages locally queued by purged for processing on cp3056 is CRITICAL: cluster=cache_text instance=cp3056 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [12:54:11] PROBLEM - Number of messages locally queued by purged for processing on cp3052 is CRITICAL: cluster=cache_text instance=cp3052 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [12:54:17] PROBLEM - Number of messages locally queued by purged for processing on cp1079 is CRITICAL: cluster=cache_text instance=cp1079 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [12:54:25] PROBLEM - Number of messages locally queued by purged for processing on cp1077 is CRITICAL: cluster=cache_text instance=cp1077 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [12:54:29] PROBLEM - Number of messages locally queued by purged for processing on cp1087 is CRITICAL: cluster=cache_text instance=cp1087 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [12:54:43] PROBLEM - Number of messages locally queued by purged for processing on cp5009 is CRITICAL: cluster=cache_text instance=cp5009 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [12:54:51] PROBLEM - Number of messages locally queued by purged for processing on cp5007 is CRITICAL: cluster=cache_text instance=cp5007 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [12:54:52] (03PS2) 10Arturo Borrero Gonzalez: neutron: add option to disable customizations if cloudgw is enabled [puppet] - 10https://gerrit.wikimedia.org/r/634223 (https://phabricator.wikimedia.org/T261724) [12:55:21] PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [12:55:33] PROBLEM - Number of messages locally queued by purged for processing on cp3054 is CRITICAL: cluster=cache_text instance=cp3054 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [12:56:07] RECOVERY - Number of messages locally queued by purged for processing on cp1077 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1077 [12:56:11] RECOVERY - Number of messages locally queued by purged for processing on cp1087 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1087 [12:56:23] RECOVERY - Number of messages locally queued by purged for processing on cp5009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [12:57:10] (03PS3) 10Arturo Borrero Gonzalez: neutron: add option to disable customizations if cloudgw is enabled [puppet] - 10https://gerrit.wikimedia.org/r/634223 (https://phabricator.wikimedia.org/T261724) [12:57:15] PROBLEM - Number of messages locally queued by purged for processing on cp3060 is CRITICAL: cluster=cache_text instance=cp3060 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [12:57:44] (03CR) 10Jbond: "LGTM but see question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634050 (owner: 10Arturo Borrero Gonzalez) [12:58:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/25914/" [puppet] - 10https://gerrit.wikimedia.org/r/634223 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [12:58:15] RECOVERY - Number of messages locally queued by purged for processing on cp5007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5007 [12:58:39] !log gilles@deploy1001 Started deploy [performance/navtiming@dff55f8]: (no justification provided) [12:58:44] !log gilles@deploy1001 Finished deploy [performance/navtiming@dff55f8]: (no justification provided) (duration: 00m 05s) [12:58:45] RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [12:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:46] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/634220 (https://phabricator.wikimedia.org/T265587) (owner: 10Jbond) [12:58:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:55] RECOVERY - Number of messages locally queued by purged for processing on cp3054 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [12:58:59] RECOVERY - Number of messages locally queued by purged for processing on cp3060 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [12:59:15] RECOVERY - Number of messages locally queued by purged for processing on cp3052 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [12:59:43] (03PS2) 10Jbond: P:prometheus::nic_saturation_exporter: configure listen address to primary ip [puppet] - 10https://gerrit.wikimedia.org/r/634221 (https://phabricator.wikimedia.org/T265587) [13:00:49] RECOVERY - Number of messages locally queued by purged for processing on cp3056 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [13:02:47] RECOVERY - Number of messages locally queued by purged for processing on cp1079 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [13:03:59] (03CR) 10CDanis: [C: 03+1] "looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/634220 (https://phabricator.wikimedia.org/T265587) (owner: 10Jbond) [13:06:26] (03CR) 10Jbond: "Ready" [puppet] - 10https://gerrit.wikimedia.org/r/634221 (https://phabricator.wikimedia.org/T265587) (owner: 10Jbond) [13:09:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634023 (https://phabricator.wikimedia.org/T262647) (owner: 10Muehlenhoff) [13:10:55] PROBLEM - Number of messages locally queued by purged for processing on cp3054 is CRITICAL: cluster=cache_text instance=cp3054 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [13:11:21] PROBLEM - Number of messages locally queued by purged for processing on cp1079 is CRITICAL: cluster=cache_text instance=cp1079 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [13:12:25] PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [13:12:26] (03CR) 10Jbond: "Looks fine but see comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633825 (owner: 10Dzahn) [13:12:31] (03CR) 10Elukey: [C: 03+1] "Really nice thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/634208 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [13:12:41] PROBLEM - Number of messages locally queued by purged for processing on cp3060 is CRITICAL: cluster=cache_text instance=cp3060 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [13:12:51] PROBLEM - Number of messages locally queued by purged for processing on cp3056 is CRITICAL: cluster=cache_text instance=cp3056 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [13:12:59] PROBLEM - Number of messages locally queued by purged for processing on cp3052 is CRITICAL: cluster=cache_text instance=cp3052 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [13:13:27] (03CR) 10CDanis: "Can you add some LVS hosts to the PCC? :)" [puppet] - 10https://gerrit.wikimedia.org/r/634221 (https://phabricator.wikimedia.org/T265587) (owner: 10Jbond) [13:14:23] RECOVERY - Number of messages locally queued by purged for processing on cp3060 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [13:14:33] RECOVERY - Number of messages locally queued by purged for processing on cp3056 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [13:15:59] (03CR) 10CDanis: [C: 03+2] bump FNM mbps threshold [puppet] - 10https://gerrit.wikimedia.org/r/634115 (owner: 10CDanis) [13:17:11] (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/634221 (https://phabricator.wikimedia.org/T265587) (owner: 10Jbond) [13:17:33] RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [13:19:27] PROBLEM - Number of messages locally queued by purged for processing on cp3054 is CRITICAL: cluster=cache_text instance=cp3054 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [13:19:47] PROBLEM - Number of messages locally queued by purged for processing on cp3052 is CRITICAL: cluster=cache_text instance=cp3052 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [13:22:55] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/633827 (owner: 10Dzahn) [13:23:03] PROBLEM - Number of messages locally queued by purged for processing on cp3056 is CRITICAL: cluster=cache_text instance=cp3056 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [13:23:20] (03CR) 10Kosta Harlan: labs: Disable EditorJourney (UnderstandingFirstDay) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633514 (https://phabricator.wikimedia.org/T252391) (owner: 10Kosta Harlan) [13:26:15] RECOVERY - Number of messages locally queued by purged for processing on cp3054 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [13:26:27] RECOVERY - Number of messages locally queued by purged for processing on cp3056 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [13:28:17] RECOVERY - Number of messages locally queued by purged for processing on cp3052 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [13:30:03] RECOVERY - Number of messages locally queued by purged for processing on cp1079 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [13:35:33] Was there any recent deploy with CheckUserLog stuff? [13:36:53] (03CR) 10CDanis: [C: 03+1] P:prometheus::nic_saturation_exporter: configure listen address to primary ip [puppet] - 10https://gerrit.wikimedia.org/r/634221 (https://phabricator.wikimedia.org/T265587) (owner: 10Jbond) [13:37:08] I guess CUlog stuff should go to private bug? [13:41:02] (03CR) 10Jbond: "nice TIL @preserve; lg for sretest however see comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634192 (owner: 10Alexandros Kosiaris) [13:41:49] Hello, probably unbreak-now regression: https://phabricator.wikimedia.org/T265606 [13:43:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/620899 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [13:44:04] (03CR) 10Kormat: [C: 03+2] mariadb: core::multiinstance - generate sections. [puppet] - 10https://gerrit.wikimedia.org/r/620899 (https://phabricator.wikimedia.org/T257033) (owner: 10Jcrespo) [13:44:51] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634034 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [13:45:36] (03CR) 10Kormat: [C: 03+2] mariadb: misc::multiinstance - generate sections. [puppet] - 10https://gerrit.wikimedia.org/r/634034 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [13:45:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634042 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [13:47:50] (03CR) 10Kormat: [C: 03+2] mariadb: sanitarium_multiinstance - generate sections. [puppet] - 10https://gerrit.wikimedia.org/r/634042 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [13:50:17] (03CR) 10Jbond: "this is perhaps still WIP either-way see inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634044 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [13:51:32] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634208 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [13:54:55] (03CR) 10Kormat: mariadb: dbstore_multiinstance - generate sections. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634044 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [14:01:08] (03PS4) 10Kormat: mariadb: dbstore_multiinstance - generate sections. [puppet] - 10https://gerrit.wikimedia.org/r/634044 (https://phabricator.wikimedia.org/T256972) [14:01:58] (03CR) 10Jbond: [C: 03+1] "LGTM assuming PCC is good" [puppet] - 10https://gerrit.wikimedia.org/r/634044 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [14:02:47] (03CR) 10Milimetric: role::druid::public::worker: increase connection pool on historicals (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634191 (https://phabricator.wikimedia.org/T226035) (owner: 10Elukey) [14:03:59] (03CR) 10Jbond: [C: 03+2] prometheus::nic_saturation_exporter: add ability to listen on specific addr [puppet] - 10https://gerrit.wikimedia.org/r/634220 (https://phabricator.wikimedia.org/T265587) (owner: 10Jbond) [14:04:04] (03CR) 10Jbond: [C: 03+2] P:prometheus::nic_saturation_exporter: configure listen address to primary ip [puppet] - 10https://gerrit.wikimedia.org/r/634221 (https://phabricator.wikimedia.org/T265587) (owner: 10Jbond) [14:04:55] jayme: given your ops on duty, has there been anything recently deployed? I don't know how to properly read the logs [14:06:47] https://www.mediawiki.org/wiki/MediaWiki_1.36/wmf.13 might help? [14:07:58] revi: thanks [14:08:01] what exactly are you looking for? [14:08:03] Asking if "anything" has been deployed isn't helpfuly [14:08:23] PROBLEM - Check systemd state on mw2279 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:29] Numerous things get deployed throughout the week. From service changes, mediawiki code, mediawiki config... [14:08:31] probably something that might have caused T265606? [14:08:36] AmandaNP: apart from that https://sal.toolforge.org/ could give a clue. But anything is very unspecific [14:08:41] PROBLEM - Check systemd state on db2119 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:45] PROBLEM - Check systemd state on db1124 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:45] PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:49] sorry, yes what revi said [14:08:49] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:55] PROBLEM - Check systemd state on analytics1066 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:55] PROBLEM - Check systemd state on ganeti1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:59] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:59] PROBLEM - Check systemd state on wtp1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:01] PROBLEM - Check systemd state on db1074 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:03] PROBLEM - Check systemd state on lvs1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:05] PROBLEM - Check systemd state on druid1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:09] PROBLEM - Check systemd state on ganeti2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:13] PROBLEM - Check systemd state on mw2316 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:23] PROBLEM - Check systemd state on db2110 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:28] revi: AmandaNP: jayme: https://sal.toolforge.org/log/n-8sKXUBhxWNv8gI7edH possibly [14:09:29] PROBLEM - Check systemd state on mc1021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:31] PROBLEM - Check systemd state on db1123 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:31] PROBLEM - Check systemd state on db2138 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:31] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:31] PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:33] PROBLEM - Check systemd state on dbproxy1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:33] PROBLEM - Check systemd state on ganeti2024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:33] PROBLEM - Check systemd state on druid1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:35] PROBLEM - Check systemd state on elastic1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:37] PROBLEM - Check systemd state on mw2331 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:37] PROBLEM - Check systemd state on elastic2050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:39] PROBLEM - Check systemd state on cp2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:39] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:41] PROBLEM - Check systemd state on mw2262 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:41] PROBLEM - Check systemd state on ganeti2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:43] PROBLEM - Check systemd state on mc2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:45] PROBLEM - Check systemd state on mw1400 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:45] PROBLEM - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:47] PROBLEM - Check systemd state on wtp1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:49] PROBLEM - Check systemd state on mw2366 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:51] uh. [14:09:53] PROBLEM - Check systemd state on mw1350 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:55] looks like it came through https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201014T1900 [14:09:55] PROBLEM - Check systemd state on db2123 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:01] PROBLEM - Check systemd state on ganeti1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:03] PROBLEM - Check systemd state on ms-be1021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:03] PROBLEM - Check systemd state on mw2313 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:03] PROBLEM - Check systemd state on mc2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:04] not that we can read anything given the spam [14:10:05] PROBLEM - Check systemd state on mw2286 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:09] PROBLEM - Check systemd state on mw2293 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:09] PROBLEM - Check systemd state on db1120 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:11] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:11] PROBLEM - Check systemd state on parse2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:13] PROBLEM - Check systemd state on aqs1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:13] PROBLEM - Check systemd state on aqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:17] PROBLEM - Check systemd state on ganeti1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:17] PROBLEM - Check systemd state on wtp1035 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:19] PROBLEM - Check systemd state on db1139 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:21] PROBLEM - Check systemd state on an-worker1113 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:21] PROBLEM - Check systemd state on restbase2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:23] PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:23] PROBLEM - Check systemd state on cp1076 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:27] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:29] PROBLEM - Check systemd state on mwlog2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:29] PROBLEM - Check systemd state on conf1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:37] PROBLEM - Check systemd state on cp2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:37] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:41] PROBLEM - Check systemd state on db2082 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:41] PROBLEM - Check systemd state on mw1411 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:41] PROBLEM - Check systemd state on ms-be2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:41] PROBLEM - Check systemd state on elastic1056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:43] PROBLEM - Check systemd state on db2125 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:45] PROBLEM - Check systemd state on mw1399 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:45] PROBLEM - Check systemd state on mw1357 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:49] PROBLEM - Check systemd state on mw2283 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:51] PROBLEM - Check systemd state on an-worker1117 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:51] PROBLEM - Check systemd state on mw2374 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:53] PROBLEM - Check systemd state on an-worker1095 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:53] moving to #wikimedia-sre [14:10:56] PROBLEM - Check systemd state on wtp2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:01] PROBLEM - Check systemd state on rdb1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:09] PROBLEM - Check systemd state on mw1297 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:09] PROBLEM - Check systemd state on mw2259 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:09] PROBLEM - Check systemd state on db2087 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:11] PROBLEM - Check systemd state on mw2236 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:11] PROBLEM - Check systemd state on db1122 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:13] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:15] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:19] PROBLEM - Check systemd state on sretest1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:19] PROBLEM - Check systemd state on an-worker1099 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:21] PROBLEM - Check systemd state on mw1277 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:21] PROBLEM - Check systemd state on thorium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:21] PROBLEM - Check systemd state on snapshot1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:23] PROBLEM - Check systemd state on htmldumper1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:23] PROBLEM - Check systemd state on dbproxy1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:25] PROBLEM - Check systemd state on db1088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:27] PROBLEM - Check systemd state on analytics1076 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:29] PROBLEM - Check systemd state on dns4002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:29] PROBLEM - Check systemd state on ms-be2049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:29] PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:29] PROBLEM - Check systemd state on cp2042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:35] PROBLEM - Check systemd state on an-worker1116 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:35] PROBLEM - Check systemd state on parse2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:37] PROBLEM - Check systemd state on parse2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:37] PROBLEM - Check systemd state on mw2250 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:39] PROBLEM - Check systemd state on dbproxy1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:41] PROBLEM - Check systemd state on elastic1046 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:41] PROBLEM - Check systemd state on cp2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:42] <_joe_> nic-saturation-exporter.service [14:11:43] PROBLEM - Check systemd state on lvs4005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:44] <_joe_> it seems [14:11:45] PROBLEM - Check systemd state on cp3060 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:45] PROBLEM - Check systemd state on kubernetes2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:47] PROBLEM - Check systemd state on mw1378 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:49] PROBLEM - Check systemd state on wtp1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:49] PROBLEM - Check systemd state on mw2334 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:51] PROBLEM - Check systemd state on mw2230 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:51] PROBLEM - Check systemd state on restbase1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:52] jbond42: ^^ [14:11:53] PROBLEM - Check systemd state on logstash1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:54] _joe_: yes, jbond42 is on it [14:11:55] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:55] PROBLEM - Check systemd state on an-worker1108 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:55] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:01] PROBLEM - Check systemd state on ms-be1045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:05] PROBLEM - Check systemd state on thanos-be2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:09] PROBLEM - Check systemd state on kubernetes2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:09] PROBLEM - Check systemd state on dbprov1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:11] PROBLEM - Check systemd state on authdns1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:11] PROBLEM - Check systemd state on backup1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:11] PROBLEM - Check systemd state on ms-be1052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:17] PROBLEM - Check systemd state on mw1331 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:21] PROBLEM - Check systemd state on wtp2014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:21] PROBLEM - Check systemd state on mw2320 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:23] PROBLEM - Check systemd state on ms-be1048 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:23] PROBLEM - Check systemd state on mw1267 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:23] PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:25] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:27] PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:29] PROBLEM - Check systemd state on ms-be2054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:31] PROBLEM - Check systemd state on mw1388 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:31] PROBLEM - Check systemd state on ganeti1015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:33] PROBLEM - Check systemd state on cp1079 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:33] PROBLEM - Check systemd state on mw1313 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:35] PROBLEM - Check systemd state on mw1386 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:35] PROBLEM - Check systemd state on pc1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:39] PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:41] PROBLEM - Check systemd state on cp4027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:41] PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:45] PROBLEM - Check systemd state on mw2215 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:45] PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:47] PROBLEM - Check systemd state on elastic1052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:49] PROBLEM - Check systemd state on logstash2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:49] PROBLEM - Check systemd state on ganeti2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:49] PROBLEM - Check systemd state on ms-fe1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:53] PROBLEM - Check systemd state on mw2238 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:53] PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:53] PROBLEM - Check systemd state on an-worker1102 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:55] PROBLEM - Check systemd state on mw2226 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:57] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:59] PROBLEM - Check systemd state on ms-be1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:59] PROBLEM - Check systemd state on ganeti2018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:01] PROBLEM - Check systemd state on mc1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:03] PROBLEM - Check systemd state on ganeti2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:05] PROBLEM - Check systemd state on db2103 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:05] PROBLEM - Check systemd state on cp1087 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:07] PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:11] PROBLEM - Check systemd state on mw2367 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:15] PROBLEM - Check systemd state on puppetmaster2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:15] PROBLEM - Check systemd state on mw2275 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:15] PROBLEM - Check systemd state on mc2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:17] PROBLEM - Check systemd state on db2076 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:19] PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:21] PROBLEM - Check systemd state on ganeti4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:21] (03PS1) 10Jbond: prometheus-nic-saturation-exporter: fix param name listen vs addr [puppet] - 10https://gerrit.wikimedia.org/r/634232 [14:13:23] PROBLEM - Check systemd state on analytics1054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:23] PROBLEM - Check systemd state on ganeti1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:23] PROBLEM - Check systemd state on restbase1021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:31] PROBLEM - Check systemd state on mc1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:33] PROBLEM - Check systemd state on wtp1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:35] PROBLEM - Check systemd state on wtp2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:35] PROBLEM - Check systemd state on ganeti2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:35] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:39] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:43] PROBLEM - Check systemd state on parse2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:45] PROBLEM - Check systemd state on ores2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:47] PROBLEM - Check systemd state on ganeti3002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:47] PROBLEM - Check systemd state on elastic2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:49] PROBLEM - Check systemd state on lvs5002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:53] PROBLEM - Check systemd state on db2121 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:55] PROBLEM - Check systemd state on db1143 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:55] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:55] PROBLEM - Check systemd state on kubernetes1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:59] PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:01] PROBLEM - Check systemd state on mw1389 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:03] (03CR) 10Kormat: [C: 03+1] prometheus-nic-saturation-exporter: fix param name listen vs addr [puppet] - 10https://gerrit.wikimedia.org/r/634232 (owner: 10Jbond) [14:14:03] PROBLEM - Check systemd state on analytics1058 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:03] PROBLEM - Check systemd state on cp3065 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:03] PROBLEM - Check systemd state on db2112 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:07] PROBLEM - Check systemd state on ms-be1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:07] PROBLEM - Check systemd state on mw2375 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:09] PROBLEM - Check systemd state on db1110 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:09] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:09] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:11] PROBLEM - Check systemd state on elastic1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:13] PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:15] PROBLEM - Check systemd state on mc2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:18] (03CR) 10Jbond: [C: 03+2] prometheus-nic-saturation-exporter: fix param name listen vs addr [puppet] - 10https://gerrit.wikimedia.org/r/634232 (owner: 10Jbond) [14:14:21] PROBLEM - Check systemd state on db1146 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:21] PROBLEM - Check systemd state on db1109 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:23] PROBLEM - Check systemd state on elastic1042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:23] PROBLEM - Check systemd state on es1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:23] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:23] PROBLEM - Check systemd state on ms-be1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:23] PROBLEM - Check systemd state on mw1391 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:25] PROBLEM - Check systemd state on ms-be1059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:25] PROBLEM - Check systemd state on db2137 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:25] PROBLEM - Check systemd state on druid1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:26] thx kormat [14:14:29] PROBLEM - Check systemd state on ores1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:29] PROBLEM - Check systemd state on an-worker1094 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:29] PROBLEM - Check systemd state on kubernetes1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:31] PROBLEM - Check systemd state on restbase1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:31] PROBLEM - Check systemd state on db1136 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:33] PROBLEM - Check systemd state on ores2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:33] PROBLEM - Check systemd state on mw2312 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:36] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:36] PROBLEM - Check systemd state on parse2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:36] PROBLEM - Check systemd state on wtp1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:37] PROBLEM - Check systemd state on restbase2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:39] PROBLEM - Check systemd state on mw1321 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:43] PROBLEM - Check systemd state on lvs2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:45] PROBLEM - Check systemd state on mw1406 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:45] PROBLEM - Check systemd state on db2093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:45] PROBLEM - Check systemd state on mw2235 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:47] PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:47] PROBLEM - Check systemd state on elastic1065 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:49] PROBLEM - Check systemd state on cp4026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:49] PROBLEM - Check systemd state on mc1033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:55] PROBLEM - Check systemd state on mw1286 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:59] PROBLEM - Check systemd state on mw2352 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:59] PROBLEM - Check systemd state on mc2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:01] PROBLEM - Check systemd state on ganeti2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:01] PROBLEM - Check systemd state on ms-be2057 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:03] PROBLEM - Check systemd state on db2079 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:07] PROBLEM - Check systemd state on mw1344 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:07] PROBLEM - Check systemd state on mw2291 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:09] PROBLEM - Check systemd state on mc2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:09] PROBLEM - Check systemd state on parse2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:09] PROBLEM - Check systemd state on rdb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:09] PROBLEM - Check systemd state on mw1315 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:11] PROBLEM - Check systemd state on cp3054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:11] PROBLEM - Check systemd state on restbase1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:13] PROBLEM - Check systemd state on restbase1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:17] PROBLEM - Check systemd state on db1119 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:19] PROBLEM - Check systemd state on db1103 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:21] PROBLEM - Check systemd state on elastic2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:25] PROBLEM - Check systemd state on mw2294 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:25] PROBLEM - Check systemd state on ganeti5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:29] PROBLEM - Check systemd state on cp1078 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:31] PROBLEM - Check systemd state on mw1410 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:31] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:33] PROBLEM - Check systemd state on db1081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:35] PROBLEM - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:35] PROBLEM - Check systemd state on wdqs2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:35] PROBLEM - Check systemd state on parse2019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:36] PROBLEM - Check systemd state on wtp1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:39] PROBLEM - Check systemd state on cp4029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:39] PROBLEM - Check systemd state on lvs5003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:39] PROBLEM - Check systemd state on cp5007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:41] PROBLEM - Check systemd state on elastic1034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:45] PROBLEM - Check systemd state on mw1366 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:45] PROBLEM - Check systemd state on ms-be1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:47] PROBLEM - Check systemd state on mw2314 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:49] PROBLEM - Check systemd state on rdb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:49] PROBLEM - Check systemd state on db1084 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:51] PROBLEM - Check systemd state on elastic2038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:53] PROBLEM - Check systemd state on db1111 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:55] PROBLEM - Check systemd state on mw2368 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:55] PROBLEM - Check systemd state on restbase2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:55] PROBLEM - Check systemd state on cloudweb2001-dev is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:57] PROBLEM - Check systemd state on kubestage1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:57] PROBLEM - Check systemd state on mc1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:59] PROBLEM - Check systemd state on an-worker1091 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:59] PROBLEM - Check systemd state on mw2278 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:01] PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:03] PROBLEM - Check systemd state on mw2249 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:03] PROBLEM - Check systemd state on ores2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:03] PROBLEM - Check systemd state on mw2273 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:03] PROBLEM - Check systemd state on mw1290 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:03] PROBLEM - Check systemd state on mw1285 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:03] PROBLEM - Check systemd state on mw2311 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:05] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:05] PROBLEM - Check systemd state on db2124 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:07] PROBLEM - Check systemd state on flerovium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:09] PROBLEM - Check systemd state on mw1405 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:09] PROBLEM - Check systemd state on mw2269 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:09] RECOVERY - Check systemd state on db1146 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:11] RECOVERY - Check systemd state on es1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:11] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:13] PROBLEM - Check systemd state on mw1320 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:13] PROBLEM - Check systemd state on an-worker1090 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:13] PROBLEM - Check systemd state on ms-be2041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:15] PROBLEM - Check systemd state on wtp2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:19] RECOVERY - Check systemd state on restbase1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:19] RECOVERY - Check systemd state on db1136 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:19] PROBLEM - Check systemd state on elastic1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:21] PROBLEM - Check systemd state on mw1360 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:21] PROBLEM - Check systemd state on mw2254 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:23] PROBLEM - Check systemd state on elastic1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:25] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:27] PROBLEM - Check systemd state on mc1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:27] PROBLEM - Check systemd state on db2091 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:31] PROBLEM - Check systemd state on db2078 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:33] PROBLEM - Check systemd state on kubernetes1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:35] PROBLEM - Check systemd state on ganeti2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:35] PROBLEM - Check systemd state on mw1330 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:37] PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:39] PROBLEM - Check systemd state on lvs1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:39] RECOVERY - Check systemd state on ganeti2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:39] RECOVERY - Check systemd state on dbproxy1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:39] PROBLEM - Check systemd state on druid1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:39] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:39] PROBLEM - Check systemd state on pc2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:41] PROBLEM - Check systemd state on ores1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:41] PROBLEM - Check systemd state on ganeti1017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:41] PROBLEM - Check systemd state on mw1271 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:43] PROBLEM - Check systemd state on kubernetes2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:43] PROBLEM - Check systemd state on thumbor1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:45] PROBLEM - Check systemd state on wtp1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:45] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:47] PROBLEM - Check systemd state on ms-fe1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:51] PROBLEM - Check systemd state on mw2282 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:51] PROBLEM - Check systemd state on mw2339 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:51] PROBLEM - Check systemd state on restbase2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:51] RECOVERY - Check systemd state on db2076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:53] PROBLEM - Check systemd state on db1116 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:55] PROBLEM - Check systemd state on mw2306 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:55] PROBLEM - Check systemd state on thanos-be1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:55] PROBLEM - Check systemd state on ganeti2022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:59] RECOVERY - Check systemd state on dbproxy1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:59] PROBLEM - Check systemd state on elastic2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:59] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:01] PROBLEM - Check systemd state on an-worker1088 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:05] PROBLEM - Check systemd state on cp2041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:07] PROBLEM - Check systemd state on ores1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:07] PROBLEM - Check systemd state on db1125 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:09] PROBLEM - Check systemd state on mw2227 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:09] PROBLEM - Check systemd state on mw1361 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:09] PROBLEM - Check systemd state on conf1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:11] PROBLEM - Check systemd state on kafka-main2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:11] PROBLEM - Check systemd state on mc1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:13] PROBLEM - Check systemd state on wtp2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:13] PROBLEM - Check systemd state on mw2335 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:13] PROBLEM - Check systemd state on mw2351 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:13] PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:15] PROBLEM - Check systemd state on bast3004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:16] PROBLEM - Check systemd state on mc1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:17] PROBLEM - Check systemd state on db1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:17] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:17] PROBLEM - Check systemd state on stat1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:21] RECOVERY - Check systemd state on ores2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:23] PROBLEM - Check systemd state on mw1333 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:25] RECOVERY - Check systemd state on cp4029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:25] RECOVERY - Check systemd state on db1139 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:27] PROBLEM - Check systemd state on an-worker1092 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:27] PROBLEM - Check systemd state on mw1273 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:29] RECOVERY - Check systemd state on db1143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:31] RECOVERY - Check systemd state on dbprov1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:31] PROBLEM - Check systemd state on cp1081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:35] PROBLEM - Check systemd state on mw2307 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:37] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:37] PROBLEM - Check systemd state on elastic2027 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:39] PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:39] PROBLEM - Check systemd state on lvs3005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:41] PROBLEM - Check systemd state on mw2297 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:43] PROBLEM - Check systemd state on mc1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:45] PROBLEM - Check systemd state on an-worker1081 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:47] PROBLEM - Check systemd state on mw2228 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:49] RECOVERY - Check systemd state on db2082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:49] RECOVERY - Check systemd state on ores2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:49] RECOVERY - Check systemd state on mw1290 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:51] PROBLEM - Check systemd state on kafka-jumbo1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:53] PROBLEM - Check systemd state on mc-gp2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:53] PROBLEM - Check systemd state on mw2219 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:53] PROBLEM - Check systemd state on thanos-fe2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:53] PROBLEM - Check systemd state on kafka-main1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:53] PROBLEM - Check systemd state on cp2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:55] RECOVERY - Check systemd state on mw1313 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:55] RECOVERY - Check systemd state on mw2269 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:55] RECOVERY - Check systemd state on lvs1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:55] PROBLEM - Check systemd state on ms-be1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:57] PROBLEM - Check systemd state on kafka-jumbo1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:57] PROBLEM - Check systemd state on an-worker1111 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:59] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:59] RECOVERY - Check systemd state on mw1320 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:01] PROBLEM - Check systemd state on thumbor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:01] PROBLEM - Check systemd state on mw1269 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:03] RECOVERY - Check systemd state on cp4027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:03] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:07] PROBLEM - Check systemd state on mc1025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:07] RECOVERY - Check systemd state on ores2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:09] PROBLEM - Check systemd state on ganeti1021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:09] RECOVERY - Check systemd state on mw2254 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:09] PROBLEM - Check systemd state on analytics1052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:11] PROBLEM - Check systemd state on mw2360 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:13] RECOVERY - Check systemd state on mc1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:13] RECOVERY - Check systemd state on mw1321 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:13] RECOVERY - Check systemd state on db2091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:15] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:15] RECOVERY - Check systemd state on mw1297 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:15] PROBLEM - Check systemd state on elastic2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:17] PROBLEM - Check systemd state on db1115 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:17] RECOVERY - Check systemd state on mw2259 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:17] RECOVERY - Check systemd state on db2078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:17] RECOVERY - Check systemd state on db2087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:19] PROBLEM - Check systemd state on elastic2059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:19] RECOVERY - Check systemd state on db2093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:19] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:21] PROBLEM - Check systemd state on elastic1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:21] PROBLEM - Check systemd state on mw1268 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:23] RECOVERY - Check systemd state on cp4026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:23] RECOVERY - Check systemd state on mc1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:27] PROBLEM - Check systemd state on ms-be2033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:27] PROBLEM - Check systemd state on db2120 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:27] RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:35] RECOVERY - Check systemd state on mw2262 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:37] RECOVERY - Check systemd state on mw2282 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:37] RECOVERY - Check systemd state on mw2275 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:37] RECOVERY - Check systemd state on db2079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:41] PROBLEM - Check systemd state on mw2323 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:41] RECOVERY - Check systemd state on mw1315 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:43] PROBLEM - Check systemd state on kubernetes1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:43] PROBLEM - Check systemd state on ganeti1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:45] RECOVERY - Check systemd state on mw2250 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:45] PROBLEM - Check systemd state on puppetmaster1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:49] PROBLEM - Check systemd state on cp5004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:51] RECOVERY - Check systemd state on mc1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:53] RECOVERY - Check systemd state on ores1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:57] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:59] PROBLEM - Check systemd state on dns4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:59] PROBLEM - Check systemd state on cp3059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:18:59] RECOVERY - Check systemd state on mw2286 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:01] PROBLEM - Check systemd state on ms-be1050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:03] RECOVERY - Check systemd state on mw2279 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:09] PROBLEM - Check systemd state on db2113 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:27] RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:29] PROBLEM - Check systemd state on analytics1075 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:31] PROBLEM - Check systemd state on mw2292 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:31] PROBLEM - Check systemd state on mc2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:31] RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:31] PROBLEM - Check systemd state on ganeti1019 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:31] PROBLEM - Check systemd state on db1076 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:33] RECOVERY - Check systemd state on mc1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:35] RECOVERY - Check systemd state on an-worker1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:35] RECOVERY - Check systemd state on mw2278 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:35] PROBLEM - Check systemd state on db1098 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:37] PROBLEM - Check systemd state on elastic1060 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:39] RECOVERY - Check systemd state on ganeti1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:39] RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:39] RECOVERY - Check systemd state on ms-be2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:39] RECOVERY - Check systemd state on mw2273 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:41] RECOVERY - Check systemd state on mc-gp2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:43] RECOVERY - Check systemd state on ganeti1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:43] PROBLEM - Check systemd state on lvs2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:45] PROBLEM - Check systemd state on mw1338 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:47] RECOVERY - Check systemd state on mw2283 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:51] RECOVERY - Check systemd state on ganeti2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:55] RECOVERY - Check systemd state on mw2215 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:57] RECOVERY - Check systemd state on ganeti1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:57] RECOVERY - Check systemd state on ms-be2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:59] RECOVERY - Check systemd state on ganeti2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:19:59] RECOVERY - Check systemd state on logstash2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:09] RECOVERY - Check systemd state on ganeti2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:09] RECOVERY - Check systemd state on ganeti2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:11] RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:15] RECOVERY - Check systemd state on ganeti2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:15] RECOVERY - Check systemd state on sretest1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:17] RECOVERY - Check systemd state on ganeti1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:19] RECOVERY - Check systemd state on thorium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:19] RECOVERY - Check systemd state on snapshot1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:21] RECOVERY - Check systemd state on dbproxy1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:23] RECOVERY - Check systemd state on ganeti2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:25] RECOVERY - Check systemd state on ganeti2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:25] RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:31] RECOVERY - Check systemd state on thanos-be1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:31] RECOVERY - Check systemd state on ganeti1018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:31] RECOVERY - Check systemd state on ganeti1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:33] RECOVERY - Check systemd state on ganeti2022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:35] RECOVERY - Check systemd state on elastic2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:35] RECOVERY - Check systemd state on an-worker1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:41] RECOVERY - Check systemd state on ganeti1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:43] RECOVERY - Check systemd state on ganeti2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:46] PROBLEM - Check systemd state on ms-be2034 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:46] RECOVERY - Check systemd state on dns4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:46] RECOVERY - Check systemd state on ganeti5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:51] RECOVERY - Check systemd state on parse2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:51] RECOVERY - Check systemd state on stat1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:51] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:55] RECOVERY - Check systemd state on parse2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:55] RECOVERY - Check systemd state on elastic2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:55] RECOVERY - Check systemd state on parse2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:57] RECOVERY - Check systemd state on ganeti1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:01] RECOVERY - Check systemd state on thanos-be2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:09] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:11] RECOVERY - Check systemd state on elastic2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:11] !log imported doxygen_1.8.19-1~deb10+wmf1 to component/ci buster-wikimedia - T265579 [14:21:15] RECOVERY - Check systemd state on ganeti1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:18] T265579: Build and release Doxygen 1.8.19 to apt.wikimedia.org - https://phabricator.wikimedia.org/T265579 [14:21:21] RECOVERY - Check systemd state on db2125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:23] RECOVERY - Check systemd state on db2124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:23] RECOVERY - Check systemd state on thanos-fe2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:25] RECOVERY - Check systemd state on flerovium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:29] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:29] RECOVERY - Check systemd state on druid1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:29] RECOVERY - Check systemd state on pc1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:31] RECOVERY - Check systemd state on druid1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:31] RECOVERY - Check systemd state on thumbor2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:39] RECOVERY - Check systemd state on analytics1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:41] RECOVERY - Check systemd state on parse2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:43] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:47] RECOVERY - Check systemd state on db2110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:57] RECOVERY - Check systemd state on druid1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:57] RECOVERY - Check systemd state on druid1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:57] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:57] RECOVERY - Check systemd state on pc2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:57] RECOVERY - Check systemd state on db2103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:57] RECOVERY - Check systemd state on ms-be2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:57] RECOVERY - Check systemd state on db2120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:05] RECOVERY - Check systemd state on analytics1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:07] RECOVERY - Check systemd state on dns4002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:11] RECOVERY - Check systemd state on mw2291 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:11] RECOVERY - Check systemd state on parse2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:13] RECOVERY - Check systemd state on analytics1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:13] RECOVERY - Check systemd state on parse2013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:15] RECOVERY - Check systemd state on parse2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:19] RECOVERY - Check systemd state on db2123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:25] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:26] RECOVERY - Check systemd state on mw2294 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:27] RECOVERY - Check systemd state on ms-be2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:31] RECOVERY - Check systemd state on mw2293 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:37] RECOVERY - Check systemd state on db2113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:39] RECOVERY - Check systemd state on mw1333 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:43] RECOVERY - Check systemd state on db2121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:45] RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:49] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:49] RECOVERY - Check systemd state on db2119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:53] RECOVERY - Check systemd state on analytics1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:53] RECOVERY - Check systemd state on mw1331 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:53] RECOVERY - Check systemd state on db2112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:55] RECOVERY - Check systemd state on analytics1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:55] RECOVERY - Check systemd state on mw2297 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:55] RECOVERY - Check systemd state on mw2292 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:59] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:59] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:03] RECOVERY - Check systemd state on analytics1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:05] RECOVERY - Check systemd state on mw2311 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:09] RECOVERY - Check systemd state on mw1338 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:13] RECOVERY - Check systemd state on ms-be2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:21] RECOVERY - Check systemd state on mw2312 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:21] RECOVERY - Check systemd state on mw2316 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:35] RECOVERY - Check systemd state on ms-be1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:35] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:37] RECOVERY - Check systemd state on mw1330 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:49] RECOVERY - Check systemd state on ms-be2057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:51] RECOVERY - Check systemd state on ms-be2049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:53] RECOVERY - Check systemd state on mw1344 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:55] RECOVERY - Check systemd state on mw2306 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:56] RECOVERY - Check systemd state on ganeti4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:59] RECOVERY - Check systemd state on mw1350 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:09] RECOVERY - Check systemd state on ms-be1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:09] RECOVERY - Check systemd state on mw1361 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:11] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:11] RECOVERY - Check systemd state on mw2313 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:13] RECOVERY - Check systemd state on logstash1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:21] RECOVERY - Check systemd state on ganeti3002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:25] RECOVERY - Check systemd state on an-worker1092 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:27] RECOVERY - Check systemd state on an-worker1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:33] RECOVERY - Check systemd state on mw2314 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:33] RECOVERY - Check systemd state on mw2307 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:37] RECOVERY - Check systemd state on an-worker1097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:39] RECOVERY - Check systemd state on lvs3005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:41] RECOVERY - Check systemd state on cloudweb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:43] RECOVERY - Check systemd state on mw2320 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:43] RECOVERY - Check systemd state on an-worker1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:43] RECOVERY - Check systemd state on kafka-jumbo1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:47] RECOVERY - Check systemd state on kafka-jumbo1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:49] RECOVERY - Check systemd state on ms-be2054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:51] RECOVERY - Check systemd state on mw1357 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:51] RECOVERY - Check systemd state on lvs2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:55] RECOVERY - Check systemd state on kafka-jumbo1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:55] RECOVERY - Check systemd state on an-worker1111 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:55] RECOVERY - Check systemd state on ms-be1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:55] RECOVERY - Check systemd state on an-worker1117 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:56] RECOVERY - Check systemd state on an-worker1095 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:24:57] RECOVERY - Check systemd state on an-worker1090 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:01] RECOVERY - Check systemd state on an-worker1094 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:03] RECOVERY - Check systemd state on mw1360 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:11] RECOVERY - Check systemd state on an-worker1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:13] RECOVERY - Check systemd state on lvs2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:21] RECOVERY - Check systemd state on lvs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:21] RECOVERY - Check systemd state on an-worker1114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:23] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:25] RECOVERY - Check systemd state on an-worker1099 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:27] RECOVERY - Check systemd state on thumbor1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:27] RECOVERY - Check systemd state on htmldumper1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:33] RECOVERY - Check systemd state on puppetmaster2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:39] RECOVERY - Check systemd state on rdb2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:39] RECOVERY - Check systemd state on an-worker1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:43] RECOVERY - Check systemd state on puppetmaster1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:49] RECOVERY - Check systemd state on lvs4005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:51] RECOVERY - Check systemd state on elastic2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:57] RECOVERY - Check systemd state on an-worker1108 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:59] (03CR) 10Herron: [C: 03+1] "sgtm" [puppet] - 10https://gerrit.wikimedia.org/r/633972 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [14:25:59] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:03] RECOVERY - Check systemd state on prometheus2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:03] RECOVERY - Check systemd state on wdqs2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:09] RECOVERY - Check systemd state on lvs5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:09] RECOVERY - Check systemd state on lvs5003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:09] RECOVERY - Check systemd state on elastic1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:13] RECOVERY - Check systemd state on mw1366 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:15] RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:17] RECOVERY - Check systemd state on db1084 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:17] RECOVERY - Check systemd state on rdb2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:21] RECOVERY - Check systemd state on elastic2038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:23] RECOVERY - Check systemd state on db1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:25] RECOVERY - Check systemd state on elastic1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:29] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:29] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:33] RECOVERY - Check systemd state on mw1388 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:33] RECOVERY - Check systemd state on db1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:39] RECOVERY - Check systemd state on mw1391 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:41] RECOVERY - Check systemd state on db2137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:49] RECOVERY - Check systemd state on rdb1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:49] RECOVERY - Check systemd state on ms-fe1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:53] RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:57] RECOVERY - Check systemd state on elastic2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:59] RECOVERY - Check systemd state on elastic2059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:01] RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:05] RECOVERY - Check systemd state on db2138 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:05] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:09] RECOVERY - Check systemd state on elastic1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:11] RECOVERY - Check systemd state on elastic2050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:13] RECOVERY - Check systemd state on db1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:21] RECOVERY - Check systemd state on mw2323 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:35] RECOVERY - Check systemd state on mw1378 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:35] RECOVERY - Check systemd state on conf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:35] RECOVERY - Check systemd state on mw2227 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:35] RECOVERY - Check systemd state on mw2334 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:39] RECOVERY - Check systemd state on mw2230 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:39] RECOVERY - Check systemd state on mw2335 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:41] (03CR) 10Ema: [V: 03+2 C: 03+2] 6.0.6-1wm2: clear vut->sighup even if sighup_f is not defined [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/634018 (https://phabricator.wikimedia.org/T264074) (owner: 10Ema) [14:27:41] RECOVERY - Check systemd state on db1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:45] RECOVERY - Check systemd state on db1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:57] RECOVERY - Check systemd state on mw1389 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:01] RECOVERY - Check systemd state on conf1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:09] RECOVERY - Check systemd state on cp2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:09] RECOVERY - Check systemd state on db1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:11] RECOVERY - Check systemd state on mw2228 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:13] RECOVERY - Check systemd state on mw1399 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:15] RECOVERY - Check systemd state on mw2219 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:15] RECOVERY - Check systemd state on cp1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:17] RECOVERY - Check systemd state on cp2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:17] RECOVERY - Check systemd state on ms-be1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:19] RECOVERY - Check systemd state on mw1386 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:31] RECOVERY - Check systemd state on ms-be1036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:33] RECOVERY - Check systemd state on mw2360 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:34] (03CR) 10Ema: [V: 03+2 C: 03+2] Do not explicitly create varnish-dbg [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/634028 (owner: 10Ema) [14:28:35] RECOVERY - Check systemd state on mw2238 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:39] RECOVERY - Check systemd state on mw2226 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:41] RECOVERY - Check systemd state on mw2235 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:41] RECOVERY - Check systemd state on mw2236 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:53] RECOVERY - Check systemd state on mw2331 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:53] RECOVERY - Check systemd state on mw2367 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:53] RECOVERY - Check systemd state on mw2352 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:55] RECOVERY - Check systemd state on ms-fe1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:55] RECOVERY - Check systemd state on cp2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:59] RECOVERY - Check systemd state on mw2339 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:01] RECOVERY - Check systemd state on cp2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:05] RECOVERY - Check systemd state on mw2366 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:11] RECOVERY - Check systemd state on cp2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:15] RECOVERY - Check systemd state on wtp1028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:15] RECOVERY - Check systemd state on cp2041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:15] RECOVERY - Check systemd state on kubernetes2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:17] RECOVERY - Check systemd state on wtp2019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:17] RECOVERY - Check systemd state on wtp1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:19] RECOVERY - Check systemd state on kafka-main2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:21] RECOVERY - Check systemd state on ms-be1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:21] RECOVERY - Check systemd state on wtp2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:21] RECOVERY - Check systemd state on mw2351 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:21] RECOVERY - Check systemd state on cp1078 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:25] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:29] RECOVERY - Check systemd state on wtp1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:29] RECOVERY - Check systemd state on ms-be1045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:31] RECOVERY - Check systemd state on wtp1035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:37] RECOVERY - Check systemd state on cp1081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:37] RECOVERY - Check systemd state on kubernetes2013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:37] RECOVERY - Check systemd state on cp1076 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:39] RECOVERY - Check systemd state on ms-be1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:39] RECOVERY - Check systemd state on ms-be1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:45] RECOVERY - Check systemd state on db1111 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:45] RECOVERY - Check systemd state on ms-be1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:49] RECOVERY - Check systemd state on wtp2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:49] RECOVERY - Check systemd state on restbase2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:49] RECOVERY - Check systemd state on mw2368 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:49] RECOVERY - Check systemd state on db1110 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:49] RECOVERY - Check systemd state on ms-be1048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:51] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:53] RECOVERY - Check systemd state on ms-be1038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:55] RECOVERY - Check systemd state on elastic1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:55] RECOVERY - Check systemd state on mw2249 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:59] RECOVERY - Check systemd state on wtp1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:59] RECOVERY - Check systemd state on kafka-main1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:01] RECOVERY - Check systemd state on db1109 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:03] RECOVERY - Check systemd state on elastic1042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:03] RECOVERY - Check systemd state on ms-be1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:05] RECOVERY - Check systemd state on db1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:09] RECOVERY - Check systemd state on wtp2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:11] (03CR) 10Bartosz Dziewoński: "> I would have assumed that wgDiscussionToolsBeta would enable DT as beta rather than wgDiscussionToolsEnable which is used here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633980 (https://phabricator.wikimedia.org/T260624) (owner: 10Hnowlan) [14:30:11] RECOVERY - Check systemd state on wtp2013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:11] RECOVERY - Check systemd state on elastic1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:13] RECOVERY - Check systemd state on elastic1052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:13] RECOVERY - Check systemd state on elastic1037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:15] RECOVERY - Check systemd state on wtp1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:17] RECOVERY - Check systemd state on restbase2015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:21] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:21] RECOVERY - Check systemd state on db1115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:25] RECOVERY - Check systemd state on db1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:25] RECOVERY - Check systemd state on kubernetes1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:25] RECOVERY - Check systemd state on elastic1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:25] RECOVERY - Check systemd state on elastic1065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:29] RECOVERY - Check systemd state on db1123 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:33] RECOVERY - Check systemd state on elastic1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:35] RECOVERY - Check systemd state on wtp1036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:35] RECOVERY - Check systemd state on kubernetes2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:41] RECOVERY - Check systemd state on mw1400 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:43] RECOVERY - Check systemd state on restbase1018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:43] RECOVERY - Check systemd state on restbase2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:43] RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:43] RECOVERY - Check systemd state on wtp1038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:43] RECOVERY - Check systemd state on db1116 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:49] RECOVERY - Check systemd state on restbase1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:49] RECOVERY - Check systemd state on restbase1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:53] RECOVERY - Check systemd state on elastic1046 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:53] RECOVERY - Check systemd state on db1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:53] RECOVERY - Check systemd state on db1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:57] RECOVERY - Check systemd state on db1125 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:03] RECOVERY - Check systemd state on restbase1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:03] RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:05] RECOVERY - Check systemd state on db1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:05] RECOVERY - Check systemd state on mw1410 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:21] RECOVERY - Check systemd state on restbase2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:21] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:27] RECOVERY - Check systemd state on db1124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:29] RECOVERY - Check systemd state on mwlog2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:31] RECOVERY - Check systemd state on mw2375 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:37] RECOVERY - Check systemd state on mw1411 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:37] RECOVERY - Check systemd state on elastic1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:43] RECOVERY - Check systemd state on mw1405 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:48] (03CR) 10Kormat: "PCC is clean: https://puppet-compiler.wmflabs.org/compiler1001/25924/" [puppet] - 10https://gerrit.wikimedia.org/r/634044 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [14:31:49] RECOVERY - Check systemd state on mw2374 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:49] RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:49] RECOVERY - Check systemd state on mw1269 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:05] RECOVERY - Check systemd state on mw1406 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:09] RECOVERY - Check systemd state on mw1268 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:12] (03CR) 10Kormat: [C: 03+2] mariadb: dbstore_multiinstance - generate sections. [puppet] - 10https://gerrit.wikimedia.org/r/634044 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [14:32:13] RECOVERY - Check systemd state on cp1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:15] RECOVERY - Check systemd state on mw1286 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:15] RECOVERY - Check systemd state on mw1271 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:15] RECOVERY - Check systemd state on mw1277 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:35] RECOVERY - Check systemd state on cp3054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:39] RECOVERY - Check systemd state on cp5004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:43] RECOVERY - Check systemd state on cp3060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:43] RECOVERY - Check systemd state on mc1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:45] RECOVERY - Check systemd state on mc2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:49] RECOVERY - Check systemd state on cp3059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:49] RECOVERY - Check systemd state on mc1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:51] RECOVERY - Check systemd state on bast3004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:53] RECOVERY - Check systemd state on aqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:53] RECOVERY - Check systemd state on aqs1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:01] RECOVERY - Check systemd state on mw1273 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:03] RECOVERY - Check systemd state on authdns1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:03] RECOVERY - Check systemd state on backup1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:05] RECOVERY - Check systemd state on cp5007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:15] RECOVERY - Check systemd state on cp3065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:15] RECOVERY - Check systemd state on mc2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:15] RECOVERY - Check systemd state on kubestage1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:15] RECOVERY - Check systemd state on mc1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:17] RECOVERY - Check systemd state on mw1267 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:23] RECOVERY - Check systemd state on mw1285 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:23] RECOVERY - Check systemd state on mc2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:35] RECOVERY - Check systemd state on kubernetes1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:39] RECOVERY - Check systemd state on mc1025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:53] RECOVERY - Check systemd state on mc1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:33:53] RECOVERY - Check systemd state on mc1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:05] RECOVERY - Check systemd state on mc2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:09] RECOVERY - Check systemd state on mc2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:11] RECOVERY - Check systemd state on mc2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:13] RECOVERY - Check systemd state on mc2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:13] RECOVERY - Check systemd state on kubernetes1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:45] RECOVERY - Check systemd state on kubernetes1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:34:54] (03PS2) 10Kormat: mariadb: misc::analytics::multiinstance - generate sections [puppet] - 10https://gerrit.wikimedia.org/r/634208 (https://phabricator.wikimedia.org/T256972) [14:35:57] hmm.. what's that flood of RECOVERY messages about? [14:36:43] dancy: bad deploy of a service which runs on all systems [14:36:45] wrong patch for a prometheus exporter, nothing terrible :) [14:37:00] (03CR) 10Kormat: [C: 03+2] mariadb: misc::analytics::multiinstance - generate sections [puppet] - 10https://gerrit.wikimedia.org/r/634208 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [14:37:36] ah. Thanks! [14:38:05] !log disable puppet to deploy puppetdb change blacklisting dynamic facts [14:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:44] (03CR) 10Jbond: [C: 03+2] puppetdb: blacklist dynamicly generated facts [puppet] - 10https://gerrit.wikimedia.org/r/634043 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [14:41:13] !log varnish 6.0.6-1wm2 uploaded to apt.wikimedia.org component/varnish6 T264074 [14:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:19] T264074: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 [14:41:39] PROBLEM - Check systemd state on ms-be2036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:12] (03CR) 10Herron: [C: 03+1] thanos: add thanos-bucket-web explorer [puppet] - 10https://gerrit.wikimedia.org/r/634198 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [14:43:45] (03CR) 10Herron: [C: 03+1] role: add thanos bucket-web to frontend [puppet] - 10https://gerrit.wikimedia.org/r/634199 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [14:44:08] (03CR) 10Elukey: [C: 03+2] "Merging to test and see how it goes :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/634193 (owner: 10Elukey) [14:45:31] (03CR) 10Herron: [C: 03+1] prometheus: ensure new prometheus-rsyslog-exporter version [puppet] - 10https://gerrit.wikimedia.org/r/634112 (https://phabricator.wikimedia.org/T210137) (owner: 10Cwhite) [14:45:41] !log enable puppet post deploy puppetdb change blacklisting dynamic facts [14:45:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:15] (03CR) 10Herron: [C: 03+2] mailman: Set default charset in mailman2 configs [puppet] - 10https://gerrit.wikimedia.org/r/632837 (https://phabricator.wikimedia.org/T261031) (owner: 10Ladsgroup) [14:51:01] !log elukey@cumin1001 START - Cookbook sre.druid.roll-restart-workers [14:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:39] !log roll restart druid-historical daemons on druid1004-1008 to pick up new conn pooling changes [14:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:51] RECOVERY - Check systemd state on ms-be2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:55] (03PS1) 10Urbanecm: Revert "Validate username input before constructing subpage links" [extensions/CheckUser] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634247 (https://phabricator.wikimedia.org/T265606) [14:52:06] (03CR) 10Urbanecm: [C: 03+2] "train blocker" [extensions/CheckUser] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634247 (https://phabricator.wikimedia.org/T265606) (owner: 10Urbanecm) [14:57:14] (03PS1) 10Elukey: Remove an-scheduler1001 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/634244 (https://phabricator.wikimedia.org/T265620) [14:57:49] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:58:45] this is me --^ [14:59:11] PROBLEM - Number of messages locally queued by purged for processing on cp3054 is CRITICAL: cluster=cache_text instance=cp3054 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [14:59:11] PROBLEM - Number of messages locally queued by purged for processing on cp3060 is CRITICAL: cluster=cache_text instance=cp3060 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [14:59:25] PROBLEM - Number of messages locally queued by purged for processing on cp1079 is CRITICAL: cluster=cache_text instance=cp1079 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [15:00:33] (03PS1) 10Ema: varnish: add SystemTap script to trace VSM_Status/VSLQ_Dispatch [puppet] - 10https://gerrit.wikimedia.org/r/634245 (https://phabricator.wikimedia.org/T264074) [15:01:03] PROBLEM - Number of messages locally queued by purged for processing on cp3056 is CRITICAL: cluster=cache_text instance=cp3056 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [15:01:09] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:02:39] RECOVERY - Number of messages locally queued by purged for processing on cp3054 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3054 [15:02:39] RECOVERY - Number of messages locally queued by purged for processing on cp3060 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3060 [15:02:53] (03CR) 10Ema: [C: 03+2] varnish: add SystemTap script to trace VSM_Status/VSLQ_Dispatch [puppet] - 10https://gerrit.wikimedia.org/r/634245 (https://phabricator.wikimedia.org/T264074) (owner: 10Ema) [15:02:59] PROBLEM - Number of messages locally queued by purged for processing on cp3052 is CRITICAL: cluster=cache_text instance=cp3052 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [15:04:24] (03CR) 10Elukey: [C: 03+2] Remove an-scheduler1001 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/634244 (https://phabricator.wikimedia.org/T265620) (owner: 10Elukey) [15:04:31] RECOVERY - Number of messages locally queued by purged for processing on cp3056 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3056 [15:04:37] RECOVERY - Number of messages locally queued by purged for processing on cp1079 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [15:04:41] RECOVERY - Number of messages locally queued by purged for processing on cp3052 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3052 [15:05:05] (03PS1) 10Ottomata: camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) [15:05:14] (03CR) 10jerkins-bot: [V: 04-1] camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:06:21] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@500bdad]: spark: correctly parse non-partitioned partition specs [15:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:05] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: add thanos-bucket-web explorer [puppet] - 10https://gerrit.wikimedia.org/r/634198 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [15:07:14] (03CR) 10Filippo Giunchedi: [C: 03+2] role: add thanos bucket-web to frontend [puppet] - 10https://gerrit.wikimedia.org/r/634199 (https://phabricator.wikimedia.org/T261281) (owner: 10Filippo Giunchedi) [15:07:21] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@500bdad]: spark: correctly parse non-partitioned partition specs (duration: 00m 59s) [15:07:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) [15:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:56] (03PS2) 10Ottomata: camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) [15:13:05] (03CR) 10jerkins-bot: [V: 04-1] camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:15:13] PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:31] !log elukey@cumin1001 START - Cookbook sre.hosts.decommission [15:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:03] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:24:29] (03PS3) 10Ottomata: camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) [15:24:37] (03CR) 10jerkins-bot: [V: 04-1] camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:25:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "Validate username input before constructing subpage links" [extensions/CheckUser] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634247 (https://phabricator.wikimedia.org/T265606) (owner: 10Urbanecm) [15:25:41] (03CR) 10Urbanecm: [C: 03+2] Revert "Validate username input before constructing subpage links" [extensions/CheckUser] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634247 (https://phabricator.wikimedia.org/T265606) (owner: 10Urbanecm) [15:26:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:27:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:29:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [15:29:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:53] (03PS4) 10Ottomata: camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) [15:34:01] (03CR) 10jerkins-bot: [V: 04-1] camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:35:46] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [15:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:51] RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:34] (03Merged) 10jenkins-bot: Revert "Validate username input before constructing subpage links" [extensions/CheckUser] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634247 (https://phabricator.wikimedia.org/T265606) (owner: 10Urbanecm) [15:47:54] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:08] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [15:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:22] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/CheckUser/includes/specials/: fd94002cf6070180a289296ec65ad224e5a0ae67: Revert "Validate username input before constructing subpage links" (T265606) (duration: 02m 48s) [15:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:48] (03PS1) 10Jbond: stdlib: update to v6.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/634278 [15:57:13] (03CR) 10Alexandros Kosiaris: sretest: Experiment with preserving docker rules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634192 (owner: 10Alexandros Kosiaris) [15:57:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:58:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:58:57] (03PS1) 10Herron: thanos: rename hiera setting "enable_thanos_upload" & enable in pops [puppet] - 10https://gerrit.wikimedia.org/r/634279 [16:00:04] jbond42 and cdanis: That opportune time is upon us again. Time for a Puppet request window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201015T1600). [16:01:12] (03CR) 10Hashar: [C: 03+1] "Great, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff) [16:01:15] (03PS2) 10Jbond: stdlib: update to v6.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/634278 [16:01:17] (03PS1) 10Jbond: ferm::filter_log: update type to modern version [puppet] - 10https://gerrit.wikimedia.org/r/634280 [16:01:50] PROBLEM - SSH on ms-be2036 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:02:32] (03CR) 10Jbond: [C: 03+2] ferm::filter_log: update type to modern version [puppet] - 10https://gerrit.wikimedia.org/r/634280 (owner: 10Jbond) [16:04:02] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/25930/" [puppet] - 10https://gerrit.wikimedia.org/r/634279 (owner: 10Herron) [16:04:56] PROBLEM - parsoid on wtp2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [16:06:06] RECOVERY - parsoid on wtp2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 4.607 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [16:06:39] (03PS1) 10Bartosz Dziewoński: Correctly generate timezone abbreviations for parsing [extensions/DiscussionTools] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/634249 (https://phabricator.wikimedia.org/T265500) [16:06:54] (03PS1) 10Bartosz Dziewoński: Correctly generate timezone abbreviations for parsing [extensions/DiscussionTools] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634250 (https://phabricator.wikimedia.org/T265500) [16:07:28] (03CR) 10Filippo Giunchedi: [C: 03+1] thanos: rename hiera setting "enable_thanos_upload" & enable in pops [puppet] - 10https://gerrit.wikimedia.org/r/634279 (owner: 10Herron) [16:07:44] (03CR) 10Herron: [C: 03+2] thanos: rename hiera setting "enable_thanos_upload" & enable in pops [puppet] - 10https://gerrit.wikimedia.org/r/634279 (owner: 10Herron) [16:08:42] PROBLEM - very high load average likely xfs on ms-be2036 is CRITICAL: CRITICAL - load average: 164.46, 154.87, 97.03 https://wikitech.wikimedia.org/wiki/Swift [16:09:20] RECOVERY - SSH on ms-be2036 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:11:36] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [16:11:36] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [16:11:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:45] !log elukey@cumin1001 START - Cookbook sre.dns.netbox [16:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:09] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [16:14:09] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [16:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:20] RECOVERY - very high load average likely xfs on ms-be2036 is OK: OK - load average: 23.50, 67.18, 75.36 https://wikitech.wikimedia.org/wiki/Swift [16:14:23] (03PS3) 10Jbond: stdlib: update to v6.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/634278 [16:14:25] (03PS1) 10Jbond: stdlib: drop legacy Stdlib::Ipv{4,6} types [puppet] - 10https://gerrit.wikimedia.org/r/634282 [16:14:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:28] PROBLEM - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [16:18:18] PROBLEM - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus site={eqsin,esams,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [16:18:28] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus site={eqsin,esams,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [16:18:28] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus site={eqsin,esams,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [16:18:34] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 44968520 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:18:36] (03PS1) 10Elukey: Rename an-scheduler1001 to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/634283 (https://phabricator.wikimedia.org/T265620) [16:19:03] (03CR) 10jerkins-bot: [V: 04-1] Rename an-scheduler1001 to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/634283 (https://phabricator.wikimedia.org/T265620) (owner: 10Elukey) [16:19:20] (03CR) 10Jbond: [C: 03+2] stdlib: drop legacy Stdlib::Ipv{4,6} types [puppet] - 10https://gerrit.wikimedia.org/r/634282 (owner: 10Jbond) [16:20:08] (03PS2) 10Elukey: Rename an-scheduler1001 to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/634283 (https://phabricator.wikimedia.org/T265620) [16:20:12] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 401616 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:20:48] PROBLEM - parsoid on wtp2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [16:20:55] (03CR) 10Elukey: [C: 03+2] Rename an-scheduler1001 to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/634283 (https://phabricator.wikimedia.org/T265620) (owner: 10Elukey) [16:21:18] PROBLEM - Prometheus prometheus4001/ops restarted: beware possible monitoring artifacts on prometheus4001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=ulsfo+prometheus/ops [16:21:44] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus site={esams,ulsfo} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [16:22:20] RECOVERY - parsoid on wtp2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 1.138 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [16:23:44] PROBLEM - Prometheus prometheus3001/ops restarted: beware possible monitoring artifacts on prometheus3001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [16:25:40] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [16:25:40] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [16:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:22] (03CR) 10Cwhite: [C: 03+2] prometheus: ensure new prometheus-rsyslog-exporter version [puppet] - 10https://gerrit.wikimedia.org/r/634112 (https://phabricator.wikimedia.org/T210137) (owner: 10Cwhite) [16:27:32] (03PS1) 10Cwhite: Revert "prometheus: ensure new prometheus-rsyslog-exporter version" [puppet] - 10https://gerrit.wikimedia.org/r/634251 [16:27:34] PROBLEM - parsoid on wtp2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [16:29:04] RECOVERY - parsoid on wtp2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1022 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/parsoid [16:29:08] (03PS4) 10Jbond: stdlib: update to v6.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/634278 [16:29:10] (03PS1) 10Jbond: get_ips_for_service: update stdlib::ip type [puppet] - 10https://gerrit.wikimedia.org/r/634284 [16:31:32] (03CR) 10Jbond: [C: 03+2] get_ips_for_service: update stdlib::ip type [puppet] - 10https://gerrit.wikimedia.org/r/634284 (owner: 10Jbond) [16:32:37] (03PS5) 10Jbond: stdlib: update to v6.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/634278 [16:40:01] (03PS1) 10Elukey: Add an-coord1002 basic settings to puppet [puppet] - 10https://gerrit.wikimedia.org/r/634289 (https://phabricator.wikimedia.org/T265620) [16:40:24] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [16:40:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:37] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25933/" [puppet] - 10https://gerrit.wikimedia.org/r/634278 (owner: 10Jbond) [16:40:56] RECOVERY - Prometheus prometheus5001/ops restarted: beware possible monitoring artifacts on prometheus5001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [16:40:56] (03CR) 10Elukey: [C: 03+2] Add an-coord1002 basic settings to puppet [puppet] - 10https://gerrit.wikimedia.org/r/634289 (https://phabricator.wikimedia.org/T265620) (owner: 10Elukey) [16:44:50] RECOVERY - Prometheus prometheus4001/ops restarted: beware possible monitoring artifacts on prometheus4001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=ulsfo+prometheus/ops [16:46:17] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [16:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:46] RECOVERY - Prometheus prometheus1003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [16:46:56] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [16:46:58] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [16:47:00] RECOVERY - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [16:47:20] RECOVERY - Prometheus prometheus3001/ops restarted: beware possible monitoring artifacts on prometheus3001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=esams+prometheus/ops [16:48:00] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [16:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:35] (03CR) 10Dzahn: netmon: move webserver setup to profile and pass PHP version as param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633825 (owner: 10Dzahn) [16:50:21] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [16:50:21] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [16:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:44] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [16:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:11] (03PS1) 10Jforrester: Work around LESS calculating `calc()` values wrong [extensions/UploadWizard] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634252 (https://phabricator.wikimedia.org/T265560) [16:54:52] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [16:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:25] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [16:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:32] (03CR) 10jerkins-bot: [V: 04-1] Work around LESS calculating `calc()` values wrong [extensions/UploadWizard] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634252 (https://phabricator.wikimedia.org/T265560) (owner: 10Jforrester) [16:57:28] (03PS1) 10Jdlrobson: Vertically align personal tools [skins/Vector] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634253 (https://phabricator.wikimedia.org/T264339) [16:57:30] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . [16:57:34] (03PS1) 10Jdlrobson: Drop text indent in modern Vector [extensions/Echo] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634254 (https://phabricator.wikimedia.org/T264339) [16:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:28] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'push-notifications' for release 'main' . [16:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] chrisalbon and accraze: How many deployers does it take to do Services – Graphoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201015T1700). [17:00:39] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'termbox' for release 'staging' . [17:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:35] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [17:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:38] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [17:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:45] (03PS1) 10Jdlrobson: Revert "clientError: Adds 'is_logged_in' tag to aid filtering" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634256 (https://phabricator.wikimedia.org/T256173) [17:08:56] (03CR) 10Jdlrobson: [C: 03+1] "this will unblock wmf13." [extensions/WikimediaEvents] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634256 (https://phabricator.wikimedia.org/T256173) (owner: 10Jdlrobson) [17:10:06] (03CR) 10Dzahn: [C: 03+2] "I'll be a bit bold and move ahead, we can still discuss a generic profile if there is a large enough overlap with other httpd setups." [puppet] - 10https://gerrit.wikimedia.org/r/633825 (owner: 10Dzahn) [17:12:45] !log volans@cumin1001 START - Cookbook sre.dns.netbox [17:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:38] (03PS5) 10Ottomata: camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) [17:14:47] (03CR) 10jerkins-bot: [V: 04-1] camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [17:17:30] !log deleteing old pcc reports in compiler1002 to free disk space [17:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:03] (03CR) 10Ottomata: "Looks correct!" [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [17:18:42] (03PS4) 10Dzahn: netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825 [17:19:08] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:43] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:23:14] (03PS5) 10Dzahn: netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825 [17:24:00] netbox uncommitted changes was me, already fixed, sorry for the noise [17:24:37] (03CR) 10Dzahn: [C: 03+2] netmon: move webserver setup to profile and pass PHP version as param [puppet] - 10https://gerrit.wikimedia.org/r/633825 (owner: 10Dzahn) [17:26:54] (03PS1) 10Dzahn: netmon: PHP version is float, not string, fix error [puppet] - 10https://gerrit.wikimedia.org/r/634297 [17:28:31] (03CR) 10Dzahn: [C: 03+2] netmon: PHP version is float, not string, fix error [puppet] - 10https://gerrit.wikimedia.org/r/634297 (owner: 10Dzahn) [17:29:05] (03PS4) 10Dzahn: netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827 [17:30:04] (03CR) 10Dzahn: [C: 03+2] netmon: ensure nmap and mtr-tiny are installed, add profile for tools [puppet] - 10https://gerrit.wikimedia.org/r/633827 (owner: 10Dzahn) [17:31:29] (03CR) 10Dzahn: "grrr "Duplicate declaration: Package[mtr-tiny] is already declared" exactly this was supposed to NOT happen using ensure_packages!" [puppet] - 10https://gerrit.wikimedia.org/r/633827 (owner: 10Dzahn) [17:34:48] (03PS1) 10Dzahn: librenms: use ensure_packages to avoid duplicate declarations [puppet] - 10https://gerrit.wikimedia.org/r/634300 [17:36:16] (03PS2) 10Razzi: Add envoy on unprivileged port 8080 for yarn.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/633227 (https://phabricator.wikimedia.org/T240439) [17:38:21] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:39:06] (03PS3) 10Razzi: yarn: add envoy on unprivileged port 8443 for yarn.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/633227 (https://phabricator.wikimedia.org/T240439) [17:40:08] (03CR) 10Ottomata: [C: 03+1] yarn: add envoy on unprivileged port 8443 for yarn.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/633227 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [17:40:16] (03CR) 10Razzi: [C: 03+2] yarn: add envoy on unprivileged port 8443 for yarn.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/633227 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [17:45:13] (03CR) 10CRusnov: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [17:45:22] (03CR) 10CRusnov: "This change is ready for review." [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [17:46:18] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [17:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:20] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25937/netmon1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/634300 (owner: 10Dzahn) [17:48:57] (03CR) 10Dzahn: "this made puppet on netmon1002 happy again and we can have flexible roles for librenms and general netmon" [puppet] - 10https://gerrit.wikimedia.org/r/634300 (owner: 10Dzahn) [17:49:34] (03PS6) 10Jbond: stdlib: update to v6.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/634278 [17:49:51] (03PS1) 10Razzi: yarn: replace nginx with envoy for tls [puppet] - 10https://gerrit.wikimedia.org/r/634306 (https://phabricator.wikimedia.org/T240439) [17:50:29] (03CR) 10Ottomata: [C: 03+1] yarn: replace nginx with envoy for tls [puppet] - 10https://gerrit.wikimedia.org/r/634306 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [17:51:13] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:29] (03PS2) 10Dzahn: netmon: remove stretch PHP 7.2 support [puppet] - 10https://gerrit.wikimedia.org/r/633824 [17:51:49] (03PS1) 10Mholloway: Update Wikifeeds to 2020-10-15-103140-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/634307 (https://phabricator.wikimedia.org/T251900) [17:52:35] (03Abandoned) 10Dzahn: netmon: remove stretch PHP 7.2 support [puppet] - 10https://gerrit.wikimedia.org/r/633824 (owner: 10Dzahn) [17:55:36] (03CR) 10Razzi: [C: 03+2] yarn: replace nginx with envoy for tls [puppet] - 10https://gerrit.wikimedia.org/r/634306 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [17:56:44] (03CR) 10Mholloway: [C: 03+2] Update Wikifeeds to 2020-10-15-103140-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/634307 (https://phabricator.wikimedia.org/T251900) (owner: 10Mholloway) [17:59:27] (03Merged) 10jenkins-bot: Update Wikifeeds to 2020-10-15-103140-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/634307 (https://phabricator.wikimedia.org/T251900) (owner: 10Mholloway) [18:00:04] (03PS1) 10Jforrester: ApiFlickrBlacklistTest: Don't try to access HTTP in integration tests [extensions/UploadWizard] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634259 (https://phabricator.wikimedia.org/T265628) [18:00:04] RoanKattouw, Niharika, and Urbanecm: (Dis)respected human, time to deploy Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201015T1800). Please do the needful. [18:00:04] MatmaRex and Jdlrobson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:17] yeah hi [18:00:22] (03PS2) 10Jforrester: Work around LESS calculating `calc()` values wrong [extensions/UploadWizard] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634252 (https://phabricator.wikimedia.org/T265560) [18:00:29] (03PS3) 10Jforrester: Work around LESS calculating `calc()` values wrong [extensions/UploadWizard] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634252 (https://phabricator.wikimedia.org/T265560) [18:02:16] (o/) [18:06:34] anyone deploying? [18:07:30] !log mx1001/mx2001: made previous live hack official and added benefactors@wikipedia alias, re-enabling puppet [18:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:14] marxarelli: thcipriani are either of you around? (one of the patches is a deploy blocker) [18:10:41] Jdlrobson: yep [18:10:56] i'm aware of https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/634256/ [18:10:59] anything else? [18:11:03] that's the only one [18:11:18] right on. i'll make sure it goes out [18:11:20] i dont know if you have time to run the morning backport window. if not that's fine i can reschewdule for monday [18:11:38] (for the non blockers) [18:15:45] hmm, hadn't planned on it. are there no backport deployers around? [18:15:56] marxarelli: MatmaRex looks like roan might be around in 5mins [18:16:19] so hopefully he can run it [18:16:28] ack [18:17:05] okay [18:20:26] Hey, sorry for the delay [18:20:30] Jdlrobson: i would be more game to step in, but somehow i've managed to never run a backport deploy window before. conceptually i know how to do it and have read the docs [18:20:38] I'm not usually around during this deployment, but today I came home early [18:21:13] thanks, RoanKattouw [18:21:55] marxarelli: I can walk you through it if you like, but no worries if you don't have the time or inclination to be trained right now with no notice :) [18:22:14] (03CR) 10Catrope: [C: 03+2] Correctly generate timezone abbreviations for parsing [extensions/DiscussionTools] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/634249 (https://phabricator.wikimedia.org/T265500) (owner: 10Bartosz Dziewoński) [18:22:17] haha. i'll observe :) [18:22:17] (03CR) 10Catrope: [C: 03+2] Correctly generate timezone abbreviations for parsing [extensions/DiscussionTools] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634250 (https://phabricator.wikimedia.org/T265500) (owner: 10Bartosz Dziewoński) [18:22:33] (03CR) 10Catrope: [C: 03+2] Vertically align personal tools [skins/Vector] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634253 (https://phabricator.wikimedia.org/T264339) (owner: 10Jdlrobson) [18:22:45] (03CR) 10Catrope: [C: 03+2] Drop text indent in modern Vector [extensions/Echo] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634254 (https://phabricator.wikimedia.org/T264339) (owner: 10Jdlrobson) [18:22:55] (03CR) 10Catrope: [C: 03+2] Revert "clientError: Adds 'is_logged_in' tag to aid filtering" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634256 (https://phabricator.wikimedia.org/T256173) (owner: 10Jdlrobson) [18:23:10] Alright, I'll narrate here [18:24:06] The first thing I just did is +2 all the changes to get CI working on them. For config changes, I would +2 then deploy them one at a time, but these backports are in independent repos and take longer to CI [18:25:57] (03Merged) 10jenkins-bot: Correctly generate timezone abbreviations for parsing [extensions/DiscussionTools] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/634249 (https://phabricator.wikimedia.org/T265500) (owner: 10Bartosz Dziewoński) [18:27:01] * marxarelli nods [18:27:45] (now waiting for them to all merge) [18:28:14] thanks [18:29:09] (03Merged) 10jenkins-bot: Correctly generate timezone abbreviations for parsing [extensions/DiscussionTools] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634250 (https://phabricator.wikimedia.org/T265500) (owner: 10Bartosz Dziewoński) [18:29:24] RoanKattouw: do you want to do two more? i thought James_F would be doing them, but he seems to have disappeared. https://gerrit.wikimedia.org/r/q/project:mediawiki%252Fextensions%252FUploadWizard+branch:wmf%252F1.36.0-wmf.13 [18:29:34] Sure [18:29:37] (i'll add them to the calendar if it's okay) [18:30:21] Hmm we should try to make a stylelint rule for unprotected calc() in LESS files [18:30:54] (03CR) 10Catrope: [C: 03+2] ApiFlickrBlacklistTest: Don't try to access HTTP in integration tests [extensions/UploadWizard] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634259 (https://phabricator.wikimedia.org/T265628) (owner: 10Jforrester) [18:31:04] (03CR) 10Catrope: [C: 03+2] Work around LESS calculating `calc()` values wrong [extensions/UploadWizard] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634252 (https://phabricator.wikimedia.org/T265560) (owner: 10Jforrester) [18:31:24] RoanKattouw: yes, i was going to file a task [18:31:38] there is actually an option in the LESS compiler for it [18:31:43] Wait what [18:31:53] To have it not do its own math inside calc()? [18:32:06] That seems like such a useful feature that I don't understand why it would ever be disabled [18:32:09] RoanKattouw: not quite, to fail when trying to add incompetible units [18:32:14] and yes, i also don't understand [18:32:15] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,name=wtp200[6-9].codfw.wmnet [18:32:15] Oh I see [18:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:31] Yeah that would be better than making 32px + 1em equal 33em [18:32:57] !log depooling wtp2005 through wtp2009 (parsoid, old server generation) T265558 [18:32:58] (Also, FWIW I've more typically seen calc( 32px ~'+' 1em ) as the armored version) [18:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:03] T265558: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 [18:33:18] at least, there is an option in the JS compiler (https://gerrit.wikimedia.org/r/c/oojs/ui/+/535932), hopefully in the PHP one too… [18:33:48] i copied the "armor" convention in the UploadWizard code from other files in the same repo [18:38:44] Right. I think different repos have different armoring conventions for this [18:42:04] (03Merged) 10jenkins-bot: Vertically align personal tools [skins/Vector] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634253 (https://phabricator.wikimedia.org/T264339) (owner: 10Jdlrobson) [18:47:00] (Still waiting on CI...) [18:47:54] waiting on Echo selenium tests it seems [18:48:48] o_o [18:49:39] (03Merged) 10jenkins-bot: Drop text indent in modern Vector [extensions/Echo] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634254 (https://phabricator.wikimedia.org/T264339) (owner: 10Jdlrobson) [18:49:42] (03Merged) 10jenkins-bot: Revert "clientError: Adds 'is_logged_in' tag to aid filtering" [extensions/WikimediaEvents] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634256 (https://phabricator.wikimedia.org/T256173) (owner: 10Jdlrobson) [18:49:45] (03Merged) 10jenkins-bot: ApiFlickrBlacklistTest: Don't try to access HTTP in integration tests [extensions/UploadWizard] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634259 (https://phabricator.wikimedia.org/T265628) (owner: 10Jforrester) [18:49:48] (03Merged) 10jenkins-bot: Work around LESS calculating `calc()` values wrong [extensions/UploadWizard] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/634252 (https://phabricator.wikimedia.org/T265560) (owner: 10Jforrester) [18:50:04] (03PS1) 10Ottomata: eventlogging-processor - skip events for schemas that have been fully migrated to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/634314 (https://phabricator.wikimedia.org/T262304) [18:51:02] (03CR) 10jerkins-bot: [V: 04-1] eventlogging-processor - skip events for schemas that have been fully migrated to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/634314 (https://phabricator.wikimedia.org/T262304) (owner: 10Ottomata) [18:52:02] (03PS2) 10Ottomata: eventlogging-processor - skip events for schemas that have been fully migrated to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/634314 (https://phabricator.wikimedia.org/T262304) [18:52:14] Ugh finally [18:52:30] (i filed https://phabricator.wikimedia.org/T265650 for the less strict units stuff) [18:53:00] (03CR) 10jerkins-bot: [V: 04-1] eventlogging-processor - skip events for schemas that have been fully migrated to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/634314 (https://phabricator.wikimedia.org/T262304) (owner: 10Ottomata) [18:53:06] Alright, now I'm going to do: ssh deployment.eqiad.wmnet ; cd /srv/mediawiki-staging/php-1.36.0-wmf.13 ; git pull [18:53:11] Then the relevant git submodule update commands [18:53:12] And the same for wmf.11 [18:53:38] why wmf.11? are there wmf.11 backports? [18:55:07] Yeah there was one, it merged earlier [18:55:14] The other ones are all wmf.13 though [18:55:15] ah, ok [18:55:41] OK I've just finished doing that. Next, I'll ssh into mwdebug2001.codfw.wmnet and run scap pull [18:56:21] Normally I'd use mwdebug1001 or 1002, but right now codfw is the active data center, so we have to use 2001 (and if you try to ssh into 1001 or 1002, you'll get a big shouty motd telling you you're in the wrong place) [18:56:22] (03PS3) 10Ottomata: eventlogging-processor - skip events for schemas migrated to Event Platform [puppet] - 10https://gerrit.wikimedia.org/r/634314 (https://phabricator.wikimedia.org/T262304) [18:56:31] Jdlrobson: MatmaRex: Your patches are on mwdebug2001, please test [18:57:06] * marxarelli nods [18:57:15] thanks, looking [18:57:28] At this stage, the requestors will use the WikimediaDebug browser extension to point their browsers at mwdebug2001 and test their patches [18:58:26] looking :) [18:58:33] DiscussionTools looks good at a glance [18:58:46] Note: for config patches this happens in series, one patch at a time: +2, merge, test, deploy, then +2 the next one. For backports, I usually do it in parallel because 1) it's possible to deploy them separately since they're all in separate repos; 2) CI is much slower for them (CI for a config patch is usually under a minute) [18:58:55] UploadWizard as well [18:59:30] RoanKattouw: k i think we're good [19:00:04] marxarelli and longma: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201015T1900). [19:00:35] MatmaRex, RoanKattouw: Hey, sorry, was in meetings. [19:01:00] Alright great [19:01:20] Please hold off on the train for a few minutes, I'm finishing up the backports window (CI took 35 minutes) [19:01:38] okay [19:02:25] marxarelli is also here and is spectacting my backports deploy :) [19:02:34] oh :) [19:02:36] RoanKattouw: i'm doing train (w/ longma) so no worries [19:02:50] So at this point it's fairly straightforward, I'm just syncing each of the things that need to be deployed in turn [19:02:51] thanks RoanKattouw [19:03:01] For extensions, we usually just sync the entire extension directory [19:03:39] For core patches, we'll try to figure out a sync path or combination of syncs that makes sense (based on git show --stat), but if all else fails you can also just sync the entire php-1.36.0-wmf.13 directory [19:03:42] RoanKattouw: unless the files changed somehow depends on each other, that could cause error spikes [19:03:57] Yes, which is common for config files but less so for software [19:04:18] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/UploadWizard/: Work around LESS calculating calc() values wrong (T265560) (duration: 02m 07s) [19:04:20] We also try to do one sync per patch / phab task, and include the task number in the sync message. logmsgbot then picks that up and comments on the phab task [19:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:24] T265560: Text on UploadWizard's "Describe" page is too narrow (due to .mwe-upwiz-data defining "width: calc(-132.5%)") - https://phabricator.wikimedia.org/T265560 [19:04:26] Like so --^^ [19:04:30] * marxarelli nods [19:04:36] sure, but it did happen a few times for me, so I'm noting that :) [19:05:16] (03PS1) 10Dduvall: ci: Install docker-credential-environment credHelper [puppet] - 10https://gerrit.wikimedia.org/r/634316 (https://phabricator.wikimedia.org/T265177) [19:05:36] (03CR) 10jerkins-bot: [V: 04-1] ci: Install docker-credential-environment credHelper [puppet] - 10https://gerrit.wikimedia.org/r/634316 (https://phabricator.wikimedia.org/T265177) (owner: 10Dduvall) [19:06:00] (03PS1) 10Ottomata: Remove eventlogging-valid-mixed output for eventlogging-processor [puppet] - 10https://gerrit.wikimedia.org/r/634317 (https://phabricator.wikimedia.org/T265651) [19:07:01] btw, RoanKattouw, what does git show --stat do? never saw that command [19:07:11] (03CR) 10jerkins-bot: [V: 04-1] Remove eventlogging-valid-mixed output for eventlogging-processor [puppet] - 10https://gerrit.wikimedia.org/r/634317 (https://phabricator.wikimedia.org/T265651) (owner: 10Ottomata) [19:07:16] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/WikimediaEvents/: Revert "clientError: Adds is_logged_in tag to aid filtering" (T256173) (duration: 01m 58s) [19:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:22] T256173: Allow filtering of errors from logged in users - https://phabricator.wikimedia.org/T256173 [19:07:24] Try it in any git repo :) it shows the commit message of the latest commit and a summary of which files were modified [19:07:45] nice [19:07:58] i have the gerrit page opened when I'm deploying, so I use that instead [19:09:52] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.13/skins/Vector/: Vertically align personal tools (T264339) (duration: 01m 43s) [19:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:58] T264339: "Notices" hitbox overlaps "alerts" - https://phabricator.wikimedia.org/T264339 [19:14:20] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/Echo/: Drop text indent in modern Vector (T264339) (duration: 01m 51s) [19:14:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:57] OK, 2 more and then I'll be done [19:15:20] k [19:15:22] (03PS2) 10Dduvall: ci: Install docker-credential-environment credHelper [puppet] - 10https://gerrit.wikimedia.org/r/634316 (https://phabricator.wikimedia.org/T265177) [19:15:56] RoanKattouw: thanks for teaching me the process [19:16:09] Sure thing! Sorry this was a cumbersome one [19:16:22] (03CR) 10jerkins-bot: [V: 04-1] ci: Install docker-credential-environment credHelper [puppet] - 10https://gerrit.wikimedia.org/r/634316 (https://phabricator.wikimedia.org/T265177) (owner: 10Dduvall) [19:16:42] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/DiscussionTools/: Correctly generate timezone abbreviations for parsing (T265500) (duration: 01m 51s) [19:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:47] T265500: Reply tool doesn't appear at some wikis, possibly due to timestamp formatting - https://phabricator.wikimedia.org/T265500 [19:18:49] meanwhile, flake hates my python. will investigate later [19:20:04] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.11/extensions/DiscussionTools/: Correctly generate timezone abbreviations for parsing (T265500) (duration: 01m 29s) [19:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:04] Alright, all done! [19:22:12] Thank you marxarelli and longma for your patience [19:22:15] Sorry it took so long [19:22:25] nice. thanks again, RoanKattouw! [19:22:47] hashar: happy birthday? :) [19:23:25] thanks ;) [19:23:36] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:45] longma: just getting my logstash dashboards ready... [19:26:22] standing by [19:27:07] (03PS1) 10Alexandros Kosiaris: akosiaris: Update no_proxy vars [puppet] - 10https://gerrit.wikimedia.org/r/634320 [19:28:18] (03PS1) 10Dduvall: all wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634321 [19:28:20] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634321 (owner: 10Dduvall) [19:29:03] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634321 (owner: 10Dduvall) [19:30:11] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:50] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.13 [19:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:04] (03CR) 10Milimetric: [C: 03+1] varnish: check for debug=1 value in X-Analytics header [puppet] - 10https://gerrit.wikimedia.org/r/629735 (https://phabricator.wikimedia.org/T263683) (owner: 10Effie Mouzeli) [19:35:05] RECOVERY - Ensure local MW versions match expected deployment on mw2279 is OK: OKAY: Not alerting due to fresh production wikiversions: 956 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [19:35:06] hmm, seeing a new funny smelling error "InvalidArgumentException from line 2493 of /srv/mediawiki/php-1.36.0-wmf.13/includes/libs/rdbms/database/Database.php: Wikimedia\Rdbms\Database::makeList: empty input for field wbtl_type_id" [19:35:14] (03PS3) 10Krinkle: Improve error message if wikiversions.php has wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy) [19:35:52] (03PS4) 10Krinkle: multiversion: Improve error message for wikiversions.php in wrong format [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy) [19:35:55] nm. those happened a while ago [19:36:15] yeah I'm not seeing it [19:36:18] (03CR) 10Krinkle: [C: 04-1] multiversion: Improve error message for wikiversions.php in wrong format (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622408 (owner: 10Ahmon Dancy) [19:36:28] (03PS15) 10Krinkle: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [19:37:38] (03PS3) 10Krinkle: wmf-config/env.php: Add dcs and servicesFile info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 (owner: 10Ahmon Dancy) [19:38:01] (03PS4) 10Krinkle: wmf-config/env.php: Add dcs and servicesFile info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 (owner: 10Ahmon Dancy) [19:38:16] (03CR) 10Krinkle: [C: 04-1] "Needs to be rebased on I245e84e0b8." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [19:39:43] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@88e1283]: spark: fix handling of unpartitioned data sources [19:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:30] longma: k. calling it good for now [19:41:50] 👍 [19:41:50] (03PS5) 10Krinkle: wmf-config/env.php: Add dcs and servicesFile info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 (owner: 10Ahmon Dancy) [19:42:04] (03PS6) 10Krinkle: wmf-config/env.php: Add dcs and servicesFile info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 (owner: 10Ahmon Dancy) [19:42:26] (03CR) 10Krinkle: [C: 03+1] wmf-config/env.php: Add dcs and servicesFile info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 (owner: 10Ahmon Dancy) [19:43:10] !log all wikis promoted to 1.36.0-wmf.13 (T263179) [19:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:16] T263179: 1.36.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T263179 [19:43:37] (03PS16) 10Krinkle: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [19:43:47] (03PS17) 10Krinkle: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [19:45:00] (03CR) 10jerkins-bot: [V: 04-1] Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [19:46:05] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@88e1283]: spark: fix handling of unpartitioned data sources (duration: 06m 22s) [19:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:01] (03PS3) 10Dduvall: ci: Install docker-credential-environment credHelper [puppet] - 10https://gerrit.wikimedia.org/r/634316 (https://phabricator.wikimedia.org/T265177) [20:04:56] (03PS1) 10RobH: deploy1002 mac info [puppet] - 10https://gerrit.wikimedia.org/r/634333 (https://phabricator.wikimedia.org/T265653) [20:16:03] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/25940/eventlog1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/634314 (https://phabricator.wikimedia.org/T262304) (owner: 10Ottomata) [20:16:42] (03CR) 10Aklapper: "Ping. How to get a review here?" [puppet] - 10https://gerrit.wikimedia.org/r/619130 (https://phabricator.wikimedia.org/T259979) (owner: 10Aklapper) [20:27:09] !log cdanis@cumin1001 START - Cookbook sre.network.cf [20:27:10] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [20:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:47] PROBLEM - Ensure local MW versions match expected deployment on mw2279 is CRITICAL: CRITICAL: 956 mismatched wikiversions https://wikitech.wikimedia.org/wiki/Application_servers [20:44:18] (03CR) 10Cwhite: [C: 03+2] Revert "prometheus: ensure new prometheus-rsyslog-exporter version" [puppet] - 10https://gerrit.wikimedia.org/r/634251 (owner: 10Cwhite) [21:00:28] (03PS1) 10Ottomata: Migrate ContentTranslationAbuseFilter event stream to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634339 (https://phabricator.wikimedia.org/T259163) [21:01:05] (03PS2) 10Ottomata: Migrate ContentTranslationAbuseFilter event stream to EventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634339 (https://phabricator.wikimedia.org/T259163) [21:02:20] (03CR) 10Ottomata: [C: 04-1] "Just prepping patches, not ready for merge yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634339 (https://phabricator.wikimedia.org/T259163) (owner: 10Ottomata) [21:10:30] 10Operations, 10LDAP-Access-Requests: Access to the Logstash for John Bolorinos - https://phabricator.wikimedia.org/T264918 (10jbolorinos-ctr) Staff contact: @spatton Contract end date: June 30, 2021 [21:16:09] (03PS18) 10Krinkle: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [21:16:58] (03CR) 10jerkins-bot: [V: 04-1] Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [21:25:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10RobH) [21:32:17] (03PS1) 10Legoktm: Add buildpack images ("stacks") [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/634349 (https://phabricator.wikimedia.org/T265686) [21:34:25] (03Abandoned) 10Legoktm: [WIP] Add buildpack base images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/633036 (owner: 10Legoktm) [21:35:30] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,name=wtp201[0-5].codfw.wmnet [21:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:50] (03PS2) 10Cicalese: apiportal: enable discussion tools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633980 (https://phabricator.wikimedia.org/T260624) (owner: 10Hnowlan) [21:36:00] (03PS2) 10Legoktm: Add buildpack images ("stacks") [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/634349 (https://phabricator.wikimedia.org/T265686) [21:36:19] (03PS3) 10Cicalese: apiportal: enable discussion tools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633980 (https://phabricator.wikimedia.org/T260624) (owner: 10Hnowlan) [21:39:55] (03PS1) 10Jeena Huneidi: [DNM] Experimental King helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/634354 [21:40:09] (03CR) 10Bstorm: [C: 03+2] "Thanks looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/634113 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer) [21:40:34] (03CR) 10Bstorm: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/634113 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer) [21:41:37] (03CR) 10jerkins-bot: [V: 04-1] Add neovim – a modern fork of vim – to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/634113 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer) [21:43:23] (03CR) 10Bstorm: "Jenkins is complaining about the commmit message. I'll update it." [puppet] - 10https://gerrit.wikimedia.org/r/634113 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer) [21:43:38] (03PS3) 10Bstorm: Add neovim – a modern fork of vim – to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/634113 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer) [21:44:12] (03CR) 10Dzahn: [C: 03+2] Switch gerrit to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff) [21:45:01] (03CR) 10Bstorm: [C: 03+2] Add neovim – a modern fork of vim – to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/634113 (https://phabricator.wikimedia.org/T219501) (owner: 10MichaelSchoenitzer) [21:45:44] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/634213 (https://phabricator.wikimedia.org/T265587) (owner: 10Jbond) [21:47:35] (03CR) 10Dzahn: "/etc/apt/sources.list.d/repository_openjdk-8-jdk.list was removed and then recreated by puppet" [puppet] - 10https://gerrit.wikimedia.org/r/632224 (https://phabricator.wikimedia.org/T264182) (owner: 10Muehlenhoff) [21:49:35] PROBLEM - Check systemd state on ms-be2043 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:21] (03CR) 10Dzahn: [C: 03+1] "> The error is related to facter. .." [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [21:53:54] (03PS1) 10Cicalese: Configuration for user menu and sidebar special pages. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) [21:54:47] (03PS3) 10Dzahn: wmcs::postgres: hiera->lookup and add data types [puppet] - 10https://gerrit.wikimedia.org/r/628459 [21:55:35] bstorm: ^ new compiler output from John looking good. noop on clouddb*. was going to merge it unless that's one you would like to veto [21:56:31] 👀 [21:57:58] mutante: sure, go for it [21:58:07] bstorm: thanks! [21:58:17] 👍🏻 [21:58:18] (03CR) 10Dzahn: [C: 03+2] wmcs::postgres: hiera->lookup and add data types [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [21:58:58] (03CR) 10jerkins-bot: [V: 04-1] Configuration for user menu and sidebar special pages. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) (owner: 10Cicalese) [21:59:06] (03PS2) 10Cicalese: Configuration for user menu and sidebar special pages. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) [22:00:00] (03PS3) 10Cicalese: Configuration for user menu and sidebar special pages. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) [22:01:24] (03CR) 10Reedy: [C: 04-1] Configuration for user menu and sidebar special pages. (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) (owner: 10Cicalese) [22:05:13] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,name=wtp201[6-9].codfw.wmnet [22:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:24] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,name=wtp2020.codfw.wmnet [22:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:16] !log depooled remaining wtp* servers in codfw. old parsoid servers, new servers are parse2* (T265558) [22:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:22] T265558: decommission wtp2001 through wtp2020 - https://phabricator.wikimedia.org/T265558 [22:06:43] PROBLEM - ping-offload grafana alert on alert1001 is CRITICAL: CRITICAL: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is alerting: target IP missing on hosts loopback. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [22:08:27] RECOVERY - ping-offload grafana alert on alert1001 is OK: OK: Ping offload ( https://grafana.wikimedia.org/d/000000513/ping-offload ) is not alerting. https://wikitech.wikimedia.org/wiki/Ping_offload%23InAddrErrors_alert https://grafana.wikimedia.org/d/000000513/ [22:08:33] !log cdanis@cumin1001 START - Cookbook sre.network.cf [22:08:33] !log cdanis@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [22:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:12] (03CR) 10Dzahn: [C: 03+2] DHCP: remove wtp2001 through wtp2020 [puppet] - 10https://gerrit.wikimedia.org/r/634125 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [22:09:18] (03PS2) 10Dzahn: DHCP: remove wtp2001 through wtp2020 [puppet] - 10https://gerrit.wikimedia.org/r/634125 (https://phabricator.wikimedia.org/T265558) [22:09:29] (03CR) 10Cicalese: Configuration for user menu and sidebar special pages. (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634356 (https://phabricator.wikimedia.org/T264246) (owner: 10Cicalese) [22:09:47] !log previous sre.network.cf invocation was a no-op; just checking status [22:09:47] PROBLEM - Thanos query has high gRPC client errors on alert1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [22:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:22] (03PS5) 10Bstorm: Add fd and ripgrep to toolforge [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T265689) (owner: 10MichaelSchoenitzer) [22:12:00] (03PS2) 10Dzahn: scap/cumin: switch parsoid codfw canaries from wtp2001/2002 to parse2001/2002 [puppet] - 10https://gerrit.wikimedia.org/r/634128 (https://phabricator.wikimedia.org/T265558) [22:12:21] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [22:13:09] (03CR) 10Dzahn: "@subbu fyi, wtp2* is being removed. the new parse2* servers are staying of course." [puppet] - 10https://gerrit.wikimedia.org/r/634128 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [22:13:20] (03CR) 10Bstorm: [C: 04-1] "Leaving -1 review so this is not merged until/unless the attached task is unblocked one way or another." [puppet] - 10https://gerrit.wikimedia.org/r/633583 (https://phabricator.wikimedia.org/T265689) (owner: 10MichaelSchoenitzer) [22:14:27] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 22.83 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:14:57] RECOVERY - Thanos query has high gRPC client errors on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query [22:16:11] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 98.83 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:17:56] (03PS1) 10Dzahn: remove wtp2001-wtp2020 from LinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634361 (https://phabricator.wikimedia.org/T265558) [22:18:04] (03CR) 10jerkins-bot: [V: 04-1] remove wtp2001-wtp2020 from LinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634361 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [22:18:29] (03PS2) 10Dzahn: remove wtp2001-wtp2020 from LinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634361 (https://phabricator.wikimedia.org/T265558) [22:19:51] (03CR) 10Dzahn: [C: 03+2] scap/cumin: switch parsoid codfw canaries from wtp2001/2002 to parse2001/2002 [puppet] - 10https://gerrit.wikimedia.org/r/634128 (https://phabricator.wikimedia.org/T265558) (owner: 10Dzahn) [22:23:27] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=misc file=smartmon.prom instance=relforge1004 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [22:25:21] (03PS1) 10Dzahn: site: remove wtp2001 through wtp2020 [puppet] - 10https://gerrit.wikimedia.org/r/634362 (https://phabricator.wikimedia.org/T265558) [22:28:13] (03PS19) 10Krinkle: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [22:29:01] (03CR) 10jerkins-bot: [V: 04-1] Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [22:29:33] (03CR) 10Krinkle: [C: 03+2] wmf-config/env.php: Add dcs and servicesFile info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 (owner: 10Ahmon Dancy) [22:30:13] (03Merged) 10jenkins-bot: wmf-config/env.php: Add dcs and servicesFile info [mediawiki-config] - 10https://gerrit.wikimedia.org/r/632769 (owner: 10Ahmon Dancy) [22:30:42] (03CR) 10Krinkle: "OK, I'll leave this for you to update. I'm missing something clearly :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [22:31:12] dancy: staging now, you check WmD/ logstash mwdebug? [22:32:10] OK. Digging up my notes on that. [22:32:27] RECOVERY - Check systemd state on ms-be2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:37] (03PS1) 10Dzahn: cassandra: add data types, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/634363 [22:35:21] Logstash ok. WmD ok [22:36:15] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [22:37:22] (03PS1) 10Dzahn: archiva: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/634364 [22:38:55] (03CR) 10Bstorm: "I wonder if we could base it purely on the upstream image that buster-sssd uses as its base (docker-registry.wikimedia.org/wikimedia-buste" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/634349 (https://phabricator.wikimedia.org/T265686) (owner: 10Legoktm) [22:39:00] (03CR) 10Esanders: [C: 03+1] apiportal: enable discussion tools [mediawiki-config] - 10https://gerrit.wikimedia.org/r/633980 (https://phabricator.wikimedia.org/T260624) (owner: 10Hnowlan) [22:46:18] (03PS1) 10Dzahn: puppetmaster: pass $servers parameter to gitclone class [puppet] - 10https://gerrit.wikimedia.org/r/634368 [22:49:37] (03PS1) 10Dzahn: mirrors: fix FIXME about including sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/634369 [22:50:26] (03CR) 10jerkins-bot: [V: 04-1] mirrors: fix FIXME about including sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/634369 (owner: 10Dzahn) [22:52:12] (03PS1) 10Dzahn: statistics::user: remove lint-ignore that ignores nothing [puppet] - 10https://gerrit.wikimedia.org/r/634370 [22:54:38] Krinkle: Ready when you are. [22:55:16] dancy: sorry, haven't pulled yet I think. I was waiting for you to be ready and check current state, and I forgot after that. [22:55:29] dancy: now live on mwdebug2001 [22:55:40] running some tests. [22:57:00] Looks good. [22:57:15] `mw.config.set({"wgBackendResponseTime":160,"wgHostname":"mwdebug2001"}` [22:57:47] (03PS1) 10Dzahn: phabricator: remove lint-ignore, fix alignments [puppet] - 10https://gerrit.wikimedia.org/r/634371 [22:58:25] I see some unexpected "IP change within the same session" log messages [22:58:33] but I'm confident those are not caused by this change. [22:59:19] there are generally be no type:mediawiki messages though when just browsing. I've informed t.gr about this and will report later if needed. [22:59:21] okay, rolling out [22:59:37] Excellent. [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201015T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:55] !log krinkle@deploy1001 Synchronized wmf-config/env.php: I245e84e0b8c (duration: 01m 10s) [23:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:05] 10Operations, 10Research, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10EBernhardson) I think this would typically go in https://wikitech.wikimedia.org/wik... [23:03:15] (03CR) 10Ahmon Dancy: [C: 04-2] "Marking -2 to avoid confusion. Will probably abandon." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [23:03:46] (03CR) 10Ebernhardson: [C: 03+1] "Patch and pcc seem reasonable: https://puppet-compiler.wmflabs.org/compiler1001/25941/" [puppet] - 10https://gerrit.wikimedia.org/r/619130 (https://phabricator.wikimedia.org/T259979) (owner: 10Aklapper) [23:05:00] (03PS1) 10Dzahn: ores::web: remove lint-ignore that ignores nothing [puppet] - 10https://gerrit.wikimedia.org/r/634373 [23:06:27] (03PS1) 10Dzahn: base::environment: remove lint-ignore that ignores nothing [puppet] - 10https://gerrit.wikimedia.org/r/634374 [23:06:29] (03PS2) 10Legoktm: Don't install apt-transport-https for buster [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/631998 [23:13:11] (03PS1) 10Dzahn: remove more lint-ignores that don't do anything [puppet] - 10https://gerrit.wikimedia.org/r/634376 [23:14:19] (03CR) 10jerkins-bot: [V: 04-1] remove more lint-ignores that don't do anything [puppet] - 10https://gerrit.wikimedia.org/r/634376 (owner: 10Dzahn) [23:16:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10Dzahn) @RobH Please leave it on stretch for now. It relies on Mediawiki classes that are not ready for buster just yet. [23:17:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) setup/install deploy1002 - https://phabricator.wikimedia.org/T265653 (10Dzahn) a:05Dzahn→03RobH [23:19:44] (03Abandoned) 10Dzahn: remove more lint-ignores that don't do anything [puppet] - 10https://gerrit.wikimedia.org/r/634376 (owner: 10Dzahn) [23:20:38] (03Abandoned) 10Dzahn: mirrors: fix FIXME about including sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/634369 (owner: 10Dzahn) [23:29:37] 10Operations, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10jrbs) [23:44:40] 10Operations, 10Wikimedia-Mailing-lists: Please close unused mailing list fkc-l - https://phabricator.wikimedia.org/T265659 (10Dzahn) 05Open→03Resolved a:03Dzahn Done! "No such list fkc-l" [23:45:48] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Tacsipacsi) [23:48:50] 10Operations, 10Mail, 10Epic: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144 (10Dzahn) We will remove remaining personal aliases for non-staff once 2020 is over. The users will be contacted that they have time until end of year to switch to an alternative. This was... [23:49:21] !log Began in-place reindex of `eqiad`, `codfw`, and `cloudelastic`. Running on `ryankemper@mwmaint2001` under tmux sessions `inplace_reindex_[eqiad, codfw, cloudelastic]` [23:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:11] 10Operations, 10Mail: Forwarding or alias for fundraising@ - https://phabricator.wikimedia.org/T252932 (10Dzahn) a:05JGulingan→03MBeat33 re-assigning based on the last comment [23:50:17] (03PS3) 10Legoktm: Add buildpack images ("stacks") [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/634349 (https://phabricator.wikimedia.org/T265686) [23:56:46] 10Operations, 10Parsoid, 10Parsoid-Tests, 10serviceops, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) I am not sure what exactly needs to be done here as the next step. I have created a VM and both rt client/ser...