[00:48:05] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db2140 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 959.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:53:21] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8246641400 and 453 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:53:31] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4550001296 and 221 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:53:37] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1805426296 and 82 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:54:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 739475792 and 58 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:54:19] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1235530216 and 80 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:54:53] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4976185352 and 302 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:55:59] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 152504 and 120 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:55:59] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 152504 and 120 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:56:57] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 464624 and 176 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:58:11] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 10344 and 252 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:58:29] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 200376 and 270 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:00:01] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24144 and 361 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:16:33] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:30:07] <icinga-wm>	 PROBLEM - HP RAID on ms-be2055 is CRITICAL: CRITICAL: Slot 0: Failed: 3I:3:1 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[02:30:10] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be2055 is CRITICAL: CRITICAL: Slot 0: Failed: 3I:3:1 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T271055 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[02:30:15] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2055 - https://phabricator.wikimedia.org/T271055 (10ops-monitoring-bot)
[02:40:49] <wikibugs>	 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2055 - https://phabricator.wikimedia.org/T271055 (10Peachey88)
[02:46:36] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:51:22] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:14:36] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:18:12] <wikibugs>	 (03CR) 10Luke081515: [C: 03+1] hrwiki: Restrict changetags permissions to sysop and bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/653173 (https://phabricator.wikimedia.org/T270996) (owner: 10Urbanecm)
[03:20:42] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[03:22:20] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[04:14:46] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova: modernize api-paste.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/653667 (https://phabricator.wikimedia.org/T261134)
[04:16:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova: modernize api-paste.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/653667 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott)
[05:40:58] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:51:17] <wikibugs>	 10Operations, 10SRE-swift-storage, 10netops: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) It is very strange since from /var/log/swift I see the host logging requests, and pings to other ms-be in codfw work, but TCP conns to the puppet master for example fail:  ` eluk...
[08:58:31] <wikibugs>	 10Operations, 10SRE-swift-storage, 10netops: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) Something might be messed up in the network config, I see a strange routing for v6 (no G flags for example):  ` elukey@ms-be2050:~$ sudo route -n -6 Kernel IPv6 routing table Des...
[09:07:12] <elukey>	 !log reboot ms-be2050 as attempt to recover/fix its broken networking state (started from Dec 30th) - T271041
[09:07:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:17] <stashbot>	 T271041: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041
[09:26:08] <wikibugs>	 10Operations, 10SRE-swift-storage, 10netops: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) Something changed:  * puppet now runs on ipv4 * swift container availability [[ https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?viewPanel=8&orgId=1&var-DC=codfw&var-prometheus=co...
[09:30:10] <elukey>	 weird, no idea why ipv6 doesn't work on ms-be2050, will check later..
[09:48:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:50:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:02:42] <icinga-wm>	 PROBLEM - SSH on an-worker1114 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:00:48] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[11:01:46] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[11:38:59] <elukey>	 !log powercycle an-worker1114 (kernel errors in the serial console)
[11:39:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:02] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:43:02] <icinga-wm>	 RECOVERY - SSH on an-worker1114 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[11:43:34] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:50] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[13:11:40] <icinga-wm>	 RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[14:07:58] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:56] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:12] <andrewbogott>	 !log restarting slapd on serpens and seaborgium
[14:42:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:18] <andrewbogott>	 !log disabling puppet fleet-wide to avert potential disaster from acme-chief cert rotation T271063
[15:30:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:23] <stashbot>	 T271063: acme-chief just generated invalid ldap certs - https://phabricator.wikimedia.org/T271063
[15:49:17] <vgutierrez>	 !log reenable puppet on ldap-replica2004.wm.o
[15:49:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:38] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openldap: use chained certificate for slapd service [puppet] - 10https://gerrit.wikimedia.org/r/653871 (https://phabricator.wikimedia.org/T271063)
[15:50:57] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openldap: use chained certificate for slapd service [puppet] - 10https://gerrit.wikimedia.org/r/653871 (https://phabricator.wikimedia.org/T271063) (owner: 10Arturo Borrero Gonzalez)
[16:01:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] openldap: use chained certificate for slapd service [puppet] - 10https://gerrit.wikimedia.org/r/653871 (https://phabricator.wikimedia.org/T271063) (owner: 10Arturo Borrero Gonzalez)
[16:02:16] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openldap: use chained certificate for slapd service [puppet] - 10https://gerrit.wikimedia.org/r/653871 (https://phabricator.wikimedia.org/T271063) (owner: 10Arturo Borrero Gonzalez)
[16:17:26] <arturo>	 !log merged change to TLS cert used by slapd/openldap servers https://gerrit.wikimedia.org/r/c/operations/puppet/+/653871
[16:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:32] <wikibugs>	 10Operations, 10SRE-swift-storage, 10netops: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) Some notes after tests:  1) I don't see Router Advertisements using tcpdumps on ms-be2050, but I see them on all other nodes. I don't recall if the default gw settings are set vi...
[18:49:10] <wikibugs>	 10Operations, 10SRE-swift-storage, 10netops: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) I am out of ideas, the next thing that I'd check is if the fiber between the switch and the host needs to be replaced..
[19:44:18] <wikibugs>	 (03CR) 10Luke081515: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish)
[19:50:23] <wikibugs>	 (03PS3) 10Luke081515: Add wgImportSources for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish)
[19:57:39] <wikibugs>	 (03PS4) 10Luke081515: Add wgImportSources for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish)
[21:42:02] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[21:43:42] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[22:12:04] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova: generate vendor_data from a puppet hash rather than a template [puppet] - 10https://gerrit.wikimedia.org/r/653972 (https://phabricator.wikimedia.org/T271056)
[22:14:09] <wikibugs>	 (03CR) 10Andrew Bogott: "pcc results: https://puppet-compiler.wmflabs.org/compiler1001/27323/" [puppet] - 10https://gerrit.wikimedia.org/r/653972 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott)
[22:36:35] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Temporarily add alternative path for AbuseFilter script [puppet] - 10https://gerrit.wikimedia.org/r/653539 (owner: 10Daimona Eaytoy)
[22:44:39] <wikibugs>	 (03PS1) 10Andrew Bogott: Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056)
[22:45:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott)
[22:46:53] <wikibugs>	 (03PS2) 10Andrew Bogott: Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056)
[22:47:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott)
[22:50:07] <wikibugs>	 (03PS3) 10Andrew Bogott: Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056)
[22:53:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova: generate vendor_data from a puppet hash rather than a template [puppet] - 10https://gerrit.wikimedia.org/r/653972 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott)
[22:57:28] <wikibugs>	 (03PS4) 10Andrew Bogott: Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056)
[22:59:30] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott)
[23:11:48] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Nova: move our injected userdata into vendor_data, where it belongs" [puppet] - 10https://gerrit.wikimedia.org/r/653964
[23:12:51] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "Nova: move our injected userdata into vendor_data, where it belongs" [puppet] - 10https://gerrit.wikimedia.org/r/653964 (owner: 10Andrew Bogott)
[23:53:02] <wikibugs>	 10Operations, 10SRE-swift-storage: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10JGHowes) Apparently now fixed ... successfully deleted a...