[00:48:05] PROBLEM - MariaDB Replica Lag: s4 on db2140 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 959.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:53:21] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8246641400 and 453 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:31] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4550001296 and 221 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:53:37] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1805426296 and 82 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:19] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 739475792 and 58 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:19] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1235530216 and 80 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:54:53] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4976185352 and 302 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:59] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 152504 and 120 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:59] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 152504 and 120 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:57] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 464624 and 176 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:11] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 10344 and 252 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:29] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 200376 and 270 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:01] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24144 and 361 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:16:33] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:07] PROBLEM - HP RAID on ms-be2055 is CRITICAL: CRITICAL: Slot 0: Failed: 3I:3:1 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:30:10] ACKNOWLEDGEMENT - HP RAID on ms-be2055 is CRITICAL: CRITICAL: Slot 0: Failed: 3I:3:1 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T271055 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [02:30:15] 10Operations, 10ops-codfw: Degraded RAID on ms-be2055 - https://phabricator.wikimedia.org/T271055 (10ops-monitoring-bot) [02:40:49] 10Operations, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2055 - https://phabricator.wikimedia.org/T271055 (10Peachey88) [02:46:36] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:51:22] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:14:36] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:18:12] (03CR) 10Luke081515: [C: 03+1] hrwiki: Restrict changetags permissions to sysop and bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/653173 (https://phabricator.wikimedia.org/T270996) (owner: 10Urbanecm) [03:20:42] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [03:22:20] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [04:14:46] (03PS1) 10Andrew Bogott: Nova: modernize api-paste.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/653667 (https://phabricator.wikimedia.org/T261134) [04:16:53] (03CR) 10Andrew Bogott: [C: 03+2] Nova: modernize api-paste.ini.erb [puppet] - 10https://gerrit.wikimedia.org/r/653667 (https://phabricator.wikimedia.org/T261134) (owner: 10Andrew Bogott) [05:40:58] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 100 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:51:17] 10Operations, 10SRE-swift-storage, 10netops: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) It is very strange since from /var/log/swift I see the host logging requests, and pings to other ms-be in codfw work, but TCP conns to the puppet master for example fail: ` eluk... [08:58:31] 10Operations, 10SRE-swift-storage, 10netops: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) Something might be messed up in the network config, I see a strange routing for v6 (no G flags for example): ` elukey@ms-be2050:~$ sudo route -n -6 Kernel IPv6 routing table Des... [09:07:12] !log reboot ms-be2050 as attempt to recover/fix its broken networking state (started from Dec 30th) - T271041 [09:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:17] T271041: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 [09:26:08] 10Operations, 10SRE-swift-storage, 10netops: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) Something changed: * puppet now runs on ipv4 * swift container availability [[ https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?viewPanel=8&orgId=1&var-DC=codfw&var-prometheus=co... [09:30:10] weird, no idea why ipv6 doesn't work on ms-be2050, will check later.. [09:48:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:50:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:02:42] PROBLEM - SSH on an-worker1114 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:00:48] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [11:01:46] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [11:38:59] !log powercycle an-worker1114 (kernel errors in the serial console) [11:39:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:02] PROBLEM - Check systemd state on an-worker1114 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:43:02] RECOVERY - SSH on an-worker1114 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:43:34] RECOVERY - Check systemd state on an-worker1114 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:50] RECOVERY - mediawiki originals uploads -hourly- for eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [13:11:40] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [14:07:58] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:56] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:12] !log restarting slapd on serpens and seaborgium [14:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:18] !log disabling puppet fleet-wide to avert potential disaster from acme-chief cert rotation T271063 [15:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:23] T271063: acme-chief just generated invalid ldap certs - https://phabricator.wikimedia.org/T271063 [15:49:17] !log reenable puppet on ldap-replica2004.wm.o [15:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:38] (03PS1) 10Arturo Borrero Gonzalez: openldap: use chained certificate for slapd service [puppet] - 10https://gerrit.wikimedia.org/r/653871 (https://phabricator.wikimedia.org/T271063) [15:50:57] (03CR) 10David Caro: [C: 03+2] openldap: use chained certificate for slapd service [puppet] - 10https://gerrit.wikimedia.org/r/653871 (https://phabricator.wikimedia.org/T271063) (owner: 10Arturo Borrero Gonzalez) [16:01:05] (03CR) 10Vgutierrez: [C: 03+1] openldap: use chained certificate for slapd service [puppet] - 10https://gerrit.wikimedia.org/r/653871 (https://phabricator.wikimedia.org/T271063) (owner: 10Arturo Borrero Gonzalez) [16:02:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openldap: use chained certificate for slapd service [puppet] - 10https://gerrit.wikimedia.org/r/653871 (https://phabricator.wikimedia.org/T271063) (owner: 10Arturo Borrero Gonzalez) [16:17:26] !log merged change to TLS cert used by slapd/openldap servers https://gerrit.wikimedia.org/r/c/operations/puppet/+/653871 [16:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:32] 10Operations, 10SRE-swift-storage, 10netops: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) Some notes after tests: 1) I don't see Router Advertisements using tcpdumps on ms-be2050, but I see them on all other nodes. I don't recall if the default gw settings are set vi... [18:49:10] 10Operations, 10SRE-swift-storage, 10netops: ms-be2050 shows network errors - https://phabricator.wikimedia.org/T271041 (10elukey) I am out of ideas, the next thing that I'd check is if the fiber between the switch and the host needs to be replaced.. [19:44:18] (03CR) 10Luke081515: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [19:50:23] (03PS3) 10Luke081515: Add wgImportSources for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [19:57:39] (03PS4) 10Luke081515: Add wgImportSources for zhwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/637869 (https://phabricator.wikimedia.org/T266388) (owner: 10Hamish) [21:42:02] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:43:42] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:12:04] (03PS1) 10Andrew Bogott: Nova: generate vendor_data from a puppet hash rather than a template [puppet] - 10https://gerrit.wikimedia.org/r/653972 (https://phabricator.wikimedia.org/T271056) [22:14:09] (03CR) 10Andrew Bogott: "pcc results: https://puppet-compiler.wmflabs.org/compiler1001/27323/" [puppet] - 10https://gerrit.wikimedia.org/r/653972 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott) [22:36:35] (03CR) 10Jforrester: [C: 03+1] Temporarily add alternative path for AbuseFilter script [puppet] - 10https://gerrit.wikimedia.org/r/653539 (owner: 10Daimona Eaytoy) [22:44:39] (03PS1) 10Andrew Bogott: Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056) [22:45:08] (03CR) 10jerkins-bot: [V: 04-1] Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott) [22:46:53] (03PS2) 10Andrew Bogott: Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056) [22:47:25] (03CR) 10jerkins-bot: [V: 04-1] Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott) [22:50:07] (03PS3) 10Andrew Bogott: Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056) [22:53:27] (03CR) 10Andrew Bogott: [C: 03+2] Nova: generate vendor_data from a puppet hash rather than a template [puppet] - 10https://gerrit.wikimedia.org/r/653972 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott) [22:57:28] (03PS4) 10Andrew Bogott: Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056) [22:59:30] (03CR) 10Andrew Bogott: [C: 03+2] Nova: move our injected userdata into vendor_data, where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/653976 (https://phabricator.wikimedia.org/T271056) (owner: 10Andrew Bogott) [23:11:48] (03PS1) 10Andrew Bogott: Revert "Nova: move our injected userdata into vendor_data, where it belongs" [puppet] - 10https://gerrit.wikimedia.org/r/653964 [23:12:51] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Nova: move our injected userdata into vendor_data, where it belongs" [puppet] - 10https://gerrit.wikimedia.org/r/653964 (owner: 10Andrew Bogott) [23:53:02] 10Operations, 10SRE-swift-storage: Re-deleting a Commons file: "Error deleting file: The file "mwstore://local-multiwrite/local-deleted/..." is in an inconsistent state within the internal storage backends". - https://phabricator.wikimedia.org/T270994 (10JGHowes) Apparently now fixed ... successfully deleted a...