[00:02:24] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:09:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:09:52] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:23:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:41:48] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [00:43:30] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [00:44:10] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:38] 10SRE, 10Mail: Mail to root@lists1001.wikimedia.org from noreply@lists1001.wikimedia.org doesn't work - https://phabricator.wikimedia.org/T280744 (10Legoktm) [00:48:08] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:48:32] 10SRE, 10Mail: Mail to root@lists1001.wikimedia.org from noreply@lists1001.wikimedia.org doesn't work - https://phabricator.wikimedia.org/T280744 (10Legoktm) I copied the code out of systemd-timer-mail-wrapper and ran it interactively to see if I could get it to work. ` >>> from email.message import EmailMess... [00:50:34] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [00:52:21] (03PS1) 10Legoktm: systemd-timer-mail-wrapper: Set "Sender" header on emails [puppet] - 10https://gerrit.wikimedia.org/r/685980 (https://phabricator.wikimedia.org/T280744) [00:53:44] (03CR) 10Legoktm: "I only tested this on lists1001" [puppet] - 10https://gerrit.wikimedia.org/r/685980 (https://phabricator.wikimedia.org/T280744) (owner: 10Legoktm) [00:55:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:55:46] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [01:00:46] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [01:07:27] (03PS1) 10Cwhite: profile: turn off scap duplication [puppet] - 10https://gerrit.wikimedia.org/r/685987 (https://phabricator.wikimedia.org/T234565) [01:08:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:08:38] (03PS2) 10Cwhite: profile: turn off scap duplication [puppet] - 10https://gerrit.wikimedia.org/r/685987 (https://phabricator.wikimedia.org/T234565) [01:09:07] (03PS3) 10Cwhite: profile: turn off scap duplication [puppet] - 10https://gerrit.wikimedia.org/r/685987 (https://phabricator.wikimedia.org/T234565) [01:11:54] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [01:17:55] (03CR) 10Cwhite: [C: 03+2] profile: turn off scap duplication [puppet] - 10https://gerrit.wikimedia.org/r/685987 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [01:18:32] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:20:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Discovery-Search: hw troubleshooting: failure to power up for elastic2043.codfw.wmnet - https://phabricator.wikimedia.org/T281327 (10Papaul) 05Open→03Resolved Tested both power supplies by running the server on only one PSU, the server works fine. I also upgrade the BI... [01:25:20] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:31:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:32:28] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:38:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:40:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:44:12] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:44:54] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:46:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:46:54] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:48:03] (03PS3) 10Legoktm: lists: Add Apache configuration for pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685711 [01:48:05] (03PS3) 10Legoktm: mailman3: Script to generate pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685723 (https://phabricator.wikimedia.org/T280731) [01:48:07] (03CR) 10Legoktm: lists: Add Apache configuration for pipermail redirects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685711 (owner: 10Legoktm) [01:49:47] (03CR) 10Ladsgroup: [C: 03+1] lists: Add Apache configuration for pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685711 (owner: 10Legoktm) [01:51:12] (03CR) 10Ladsgroup: [C: 03+1] systemd-timer-mail-wrapper: Set "Sender" header on emails [puppet] - 10https://gerrit.wikimedia.org/r/685980 (https://phabricator.wikimedia.org/T280744) (owner: 10Legoktm) [02:03:16] (03PS4) 10Legoktm: lists: Add Apache configuration for pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685711 [02:04:26] (03PS5) 10Legoktm: lists: Add Apache configuration for pipermail redirects [puppet] - 10https://gerrit.wikimedia.org/r/685711 [02:16:04] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:18:16] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [02:34:38] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:46:54] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:55:10] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:00:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:01:42] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:11:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:17:28] (03CR) 10Dmaza: [C: 03+1] Enable Wikimedia OCR on Beta Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685643 (https://phabricator.wikimedia.org/T282080) (owner: 10Samwilson) [03:19:34] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:37:00] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.072 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:49:28] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:52:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:13:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:18:42] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:23:52] legoktm, Amir1: I think mailman3 migration somehow lost my cloud-admin@ subscription :/ [04:24:13] I check [04:24:14] Uhoh [04:24:25] that list has some filters preventing subscription requests, but I was a subscriber yesterday [04:24:27] possibly because of different email address [04:25:05] I think the address should be same [04:25:31] and the archives say "this is you" on my message [04:25:48] I checked, no address of you is there [04:25:57] maybe because you were too fast? :D [04:26:18] let me check the old mailing list members [04:26:40] wdym by "too fast"? [04:27:26] hmm, list_members shows you [04:27:56] yeah and yuvi and some other [04:28:41] it's the ban list again [04:28:45] basically every non-wikimedia has been kicked out [04:28:46] https://lists.wikimedia.org/postorius/lists/cloud-admin.lists.wikimedia.org/bans/ [04:29:03] ugh [04:30:16] I add them back [04:30:45] if you want more strangeness, mm3 says that I'm subscribed to mediawiki-commits@ as "nonmember", never been subscribed to that nor do I receive those emails [04:31:15] https://usercontent.irccloud-cdn.com/file/p2UGwVcl/image.png [04:31:30] You got to be f... kidding me [04:32:03] Amir1: did you remove the ban first? [04:32:07] Majavah: I'm there too, it just accepts emails from you right away [04:32:19] legoktm: I do it now, didn't want to [04:33:12] Majavah: the non-member interface is just a mess [04:33:24] to put it mildly [04:33:26] basically any address MM3 knows about with relationship to that list, is a non-member [04:33:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130 for schema change', diff saved to https://phabricator.wikimedia.org/P15837 and previous config saved to /var/cache/conftool/dbconfig/20210507-043350-marostegui.json [04:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:35:02] it should be fixed now [04:35:20] thanks [04:35:30] I added the ban back hoping it doesn't remove Taavi again [04:35:44] but that ban is not right [04:35:59] * Majavah hopes this isn't a wider problem [04:36:35] at least it still shows me as subscribed [04:37:53] Majavah: oh, it's a wider problem: https://phabricator.wikimedia.org/T280322#7058759 [04:38:31] legoktm: cloud-admin@ is not even on that lid [04:38:43] s/lid/list!/ [04:39:51] I'm looking through the db now [04:40:46] multiple lists that banned all of @aol.com and @yahoo.com [04:41:24] in total ninety mailing lists have some sort of wild ban. I have them in my home [04:42:10] twenty have a total ban of every email. I avoided upgrading them but well, seventy have some sorta of middle ground ban like what lego said [04:44:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:46:56] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:48:10] * legoktm back to afk [04:50:11] (03PS1) 10Marostegui: install_server: Do not format db1158. [puppet] - 10https://gerrit.wikimedia.org/r/686150 [04:56:50] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1158. [puppet] - 10https://gerrit.wikimedia.org/r/686150 (owner: 10Marostegui) [05:02:31] Majavah: btw. I meant "your uid is too low" [05:03:22] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:05] it's the flappy one I think [05:05:35] https://gerrit.wikimedia.org/r/c/operations/puppet/+/685980 not merged yet [05:06:19] I go write a script to handle this ban mess [05:08:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Repool db1130', diff saved to https://phabricator.wikimedia.org/P15839 and previous config saved to /var/cache/conftool/dbconfig/20210507-050839-root.json [05:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 T282093', diff saved to https://phabricator.wikimedia.org/P15840 and previous config saved to /var/cache/conftool/dbconfig/20210507-051519-marostegui.json [05:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:29] T282093: decommission db1087.eqiad.wmnet - https://phabricator.wikimedia.org/T282093 [05:16:01] (03PS1) 10Marostegui: db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/686160 (https://phabricator.wikimedia.org/T282093) [05:16:52] (03CR) 10Marostegui: [C: 03+2] db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/686160 (https://phabricator.wikimedia.org/T282093) (owner: 10Marostegui) [05:23:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 50%: Repool db1130', diff saved to https://phabricator.wikimedia.org/P15841 and previous config saved to /var/cache/conftool/dbconfig/20210507-052343-root.json [05:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Repool db1130', diff saved to https://phabricator.wikimedia.org/P15842 and previous config saved to /var/cache/conftool/dbconfig/20210507-053847-root.json [05:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:21] (03PS1) 10Tim Starling: ApiQueryLogEvents: when user is specified, omit STRAIGHT_JOIN [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685896 (https://phabricator.wikimedia.org/T282122) [05:42:50] (03PS1) 10Tim Starling: ApiQueryLogEvents: when user is specified, omit STRAIGHT_JOIN [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685897 (https://phabricator.wikimedia.org/T282122) [05:43:17] (03CR) 10Tim Starling: [C: 03+2] ApiQueryLogEvents: when user is specified, omit STRAIGHT_JOIN [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685896 (https://phabricator.wikimedia.org/T282122) (owner: 10Tim Starling) [05:43:20] (03CR) 10Tim Starling: [C: 03+2] ApiQueryLogEvents: when user is specified, omit STRAIGHT_JOIN [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685897 (https://phabricator.wikimedia.org/T282122) (owner: 10Tim Starling) [05:47:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:50:40] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:53:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Repool db1130', diff saved to https://phabricator.wikimedia.org/P15844 and previous config saved to /var/cache/conftool/dbconfig/20210507-055350-root.json [05:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1161 for schema change', diff saved to https://phabricator.wikimedia.org/P15845 and previous config saved to /var/cache/conftool/dbconfig/20210507-055425-marostegui.json [05:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:59:36] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:00:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:04:22] (03Merged) 10jenkins-bot: ApiQueryLogEvents: when user is specified, omit STRAIGHT_JOIN [core] (wmf/1.37.0-wmf.3) - 10https://gerrit.wikimedia.org/r/685896 (https://phabricator.wikimedia.org/T282122) (owner: 10Tim Starling) [06:04:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:06:38] (03Merged) 10jenkins-bot: ApiQueryLogEvents: when user is specified, omit STRAIGHT_JOIN [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685897 (https://phabricator.wikimedia.org/T282122) (owner: 10Tim Starling) [06:08:04] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:08] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:11] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.3/includes/api/ApiQueryLogEvents.php: fix UBN T282122 (duration: 01m 06s) [06:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:20] T282122: Frequent 504s while using logevents API on Commons - https://phabricator.wikimedia.org/T282122 [06:11:01] !log tstarling@deploy1002 Synchronized php-1.37.0-wmf.4/includes/api/ApiQueryLogEvents.php: fix UBN T282122 (duration: 01m 10s) [06:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:16] (03CR) 10Elukey: "Left a minor comment, the rest LGTM!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685923 (https://phabricator.wikimedia.org/T282185) (owner: 10Razzi) [06:17:28] !log Deploy schema change on s2 codfw, lag will appear T266486 T268392 T273360 [06:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:38] T268392: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 [06:17:39] T273360: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 [06:17:39] T266486: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 [06:22:07] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) [06:25:34] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:33:47] (Traffic bill over quota) firing: Traffic bill over quota - https://alerts.wikimedia.org [06:38:47] (Traffic bill over quota) firing: (2) Traffic bill over quota - https://alerts.wikimedia.org [06:43:47] (Traffic bill over quota) resolved: Traffic bill over quota - https://alerts.wikimedia.org [06:58:00] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Nemo_bis) I see that WikiIT-l is in group F. I talked with co-admin @Sannita and we agreed we can offer it as guinea pig before the others in the same group, if it's helpful. Jus... [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210507T0700) [07:02:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15846 and previous config saved to /var/cache/conftool/dbconfig/20210507-070214-root.json [07:02:18] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) Thanks. I'm already upgrading group F atm. It'll take an hour or so to reach wikiit-l. We are fairly confident this is stable now :) [07:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:46] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:53] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Nemo_bis) Ah, good to know! Great. [07:11:50] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:09] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Nemo_bis) > the plan is to serve both mailman2 and mailman3 from the same server [...] FWIW the mailman mailing lists preserved their old pipermail archives and just left... [07:17:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15847 and previous config saved to /var/cache/conftool/dbconfig/20210507-071718-root.json [07:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:56] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/685980 (https://phabricator.wikimedia.org/T280744) (owner: 10Legoktm) [07:18:50] (03PS1) 10Ladsgroup: prometheus: Remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/686351 (https://phabricator.wikimedia.org/T273673) [07:18:53] (03PS1) 10Ladsgroup: prometheus: Migrate cron in node_amd_rocm to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/686352 (https://phabricator.wikimedia.org/T273673) [07:18:55] (03PS1) 10Ladsgroup: prometheus: Migrate node_ssh_open_sessions cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/686353 (https://phabricator.wikimedia.org/T273673) [07:21:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/685855 (owner: 10Volans) [07:22:52] (03CR) 10Muehlenhoff: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/685923 (https://phabricator.wikimedia.org/T282185) (owner: 10Razzi) [07:25:38] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) >>! In T52864#7068604, @Nemo_bis wrote: >> the plan is to serve both mailman2 and mailman3 from the same server [...] FWIW the mailman mailing lists preserved the... [07:25:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:27:42] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:28:18] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:28] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) hyperkitty allows you to download gzip files (click on "Download" in https://lists.wikimedia.org/hyperkitty/list/listadmins@lists.wikimedia.org/2021/5/) Thankfu... [07:30:00] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:31:14] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:32:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15848 and previous config saved to /var/cache/conftool/dbconfig/20210507-073222-root.json [07:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:32] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:36:39] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Remove absented crons [puppet] - 10https://gerrit.wikimedia.org/r/686351 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:37:04] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Migrate cron in node_amd_rocm to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/686352 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:37:31] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: Migrate node_ssh_open_sessions cron to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/686353 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:37:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:42:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:47:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repool db1161', diff saved to https://phabricator.wikimedia.org/P15849 and previous config saved to /var/cache/conftool/dbconfig/20210507-074725-root.json [07:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:10] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:51:34] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/686352 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [07:59:56] (03PS1) 10Vgutierrez: trafficserver: Clear outbound TLS cacert_path for ats@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/686384 (https://phabricator.wikimedia.org/T281673) [08:02:18] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29449/console" [puppet] - 10https://gerrit.wikimedia.org/r/686384 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [08:03:01] (03PS11) 10Legoktm: Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [08:04:52] (03PS6) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 [08:05:44] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:06:24] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:07:03] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Implement static redirects from pipermail archives to hyperkitty archives - https://phabricator.wikimedia.org/T280731 (10Legoktm) >>! In T52864#7068619, @Legoktm wrote: >>>! In T52864#7068604, @Nemo_bis wrote: >>> the plan is to serve both mailman2 and ma... [08:07:19] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) Let's move the redirect conversation to {T280731}. I replied there. [08:08:06] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:08:48] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:10:23] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet [08:10:25] (03CR) 10Muehlenhoff: Add a cookbook to delete hosts from debmonitor (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/684819 (owner: 10Muehlenhoff) [08:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:55] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Clear outbound TLS cacert_path for ats@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/686384 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [08:15:07] !log Enforce Puppet Internal CA validation on trafficserver@eqsin - T281673 [08:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:24] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:19:08] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet [08:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:30] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:22:46] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:27:46] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [08:31:52] 10SRE, 10observability, 10Documentation, 10Service-Architecture, 10Services (later): Create a doc explaining the SLA between services and the monitoring tool - https://phabricator.wikimedia.org/T105780 (10Aklapper) [08:32:19] (03PS1) 10Muehlenhoff: Add ldap-replica2005 to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/686392 [08:38:14] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:39:18] (03PS1) 10Jcrespo: mariadb: Install 10.5 client on host with only client packages [puppet] - 10https://gerrit.wikimedia.org/r/686393 (https://phabricator.wikimedia.org/T276589) [08:39:32] (03PS2) 10Jcrespo: mariadb: Install 10.5 client on host with only client packages [puppet] - 10https://gerrit.wikimedia.org/r/686393 (https://phabricator.wikimedia.org/T276589) [08:40:52] (03CR) 10Jcrespo: "This will need upload of the preliminary 10.5 client packages." [puppet] - 10https://gerrit.wikimedia.org/r/686393 (https://phabricator.wikimedia.org/T276589) (owner: 10Jcrespo) [08:42:16] (03CR) 10Muehlenhoff: [C: 03+1] "If DBAs are onboard, fine with me!" [puppet] - 10https://gerrit.wikimedia.org/r/686393 (https://phabricator.wikimedia.org/T276589) (owner: 10Jcrespo) [08:43:07] (03CR) 10Muehlenhoff: [C: 03+2] Add ldap-replica2005 to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/686392 (owner: 10Muehlenhoff) [08:45:28] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:49:43] (03CR) 10Elukey: [C: 03+1] "Thanks! I have always some fear that systemd can mess up non trivial command lines (so I like to use bash -c "etc//"), but we can always f" [puppet] - 10https://gerrit.wikimedia.org/r/686352 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [08:50:04] !log jmm@puppetmaster1001 conftool action : set/pooled=yes; selector: name=ldap-replica2005.wikimedia.org [08:50:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:00] (03PS1) 10David Caro: wmcs.prometheus: increase retention to 500GB on cloudmetrics [puppet] - 10https://gerrit.wikimedia.org/r/686395 [09:05:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:05:28] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/686395 (owner: 10David Caro) [09:14:48] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:29:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:30:12] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Implement static redirects from pipermail archives to hyperkitty archives - https://phabricator.wikimedia.org/T280731 (10Ladsgroup) oh I see, thanks. [09:32:00] (03PS1) 10Muehlenhoff: Apply LDAP replica role to ldap-replica1003/1004/2006 [puppet] - 10https://gerrit.wikimedia.org/r/686434 [09:33:23] (03PS1) 10Vgutierrez: trafficserver: Clear outbound TLS cacert_path for ats@codfw [puppet] - 10https://gerrit.wikimedia.org/r/686435 (https://phabricator.wikimedia.org/T281673) [09:33:52] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:34:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/685936 (https://phabricator.wikimedia.org/T282087) (owner: 10Bstorm) [09:36:19] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29450/console" [puppet] - 10https://gerrit.wikimedia.org/r/686435 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [09:39:16] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:42:52] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Clear outbound TLS cacert_path for ats@codfw [puppet] - 10https://gerrit.wikimedia.org/r/686435 (https://phabricator.wikimedia.org/T281673) (owner: 10Vgutierrez) [09:44:06] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:44:51] !log Enforce Puppet Internal CA validation on trafficserver@codfw - T281673 [09:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:20] (03PS1) 10Zabe: Change namespace name and aliases on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686437 (https://phabricator.wikimedia.org/T262155) [09:54:24] (03PS2) 10Muehlenhoff: package_builder: Remove support for jessie [puppet] - 10https://gerrit.wikimedia.org/r/671093 [09:54:56] (03Abandoned) 10Muehlenhoff: Support password changes in manage_principals (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/559765 (https://phabricator.wikimedia.org/T237605) (owner: 10Muehlenhoff) [09:55:34] !log depooling wdqs1012 T280382, T282222 [09:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:43] T280382: WDQS hosts low on /srv disk space - https://phabricator.wikimedia.org/T280382 [09:55:44] T282222: SPARQL query for all painting stopped returning results - https://phabricator.wikimedia.org/T282222 [09:58:29] (03CR) 10Muehlenhoff: Initial control files for 10.5 mariadb packages (031 comment) [software] - 10https://gerrit.wikimedia.org/r/685524 (owner: 10Jcrespo) [10:08:14] (03CR) 10Jcrespo: Initial control files for 10.5 mariadb packages (031 comment) [software] - 10https://gerrit.wikimedia.org/r/685524 (owner: 10Jcrespo) [10:14:14] (03PS2) 10David Caro: wmcs.prometheus: increase retention to 500GB on cloudmetrics [puppet] - 10https://gerrit.wikimedia.org/r/686395 [10:15:44] (03CR) 10jerkins-bot: [V: 04-1] wmcs.prometheus: increase retention to 500GB on cloudmetrics [puppet] - 10https://gerrit.wikimedia.org/r/686395 (owner: 10David Caro) [10:17:32] (03PS1) 10Arturo Borrero Gonzalez: openstack: cleanup neutron hacks [puppet] - 10https://gerrit.wikimedia.org/r/686457 (https://phabricator.wikimedia.org/T270704) [10:18:32] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:21:00] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:36:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:37:15] (03PS2) 10Arturo Borrero Gonzalez: openstack: cleanup neutron hacks [puppet] - 10https://gerrit.wikimedia.org/r/686457 (https://phabricator.wikimedia.org/T270704) [10:38:34] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:39:33] (03CR) 10jerkins-bot: [V: 04-1] openstack: cleanup neutron hacks [puppet] - 10https://gerrit.wikimedia.org/r/686457 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [11:03:31] (03PS3) 10David Caro: wmcs.prometheus: increase retention to 500GB on cloudmetrics [puppet] - 10https://gerrit.wikimedia.org/r/686395 [11:10:08] 10SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699 (10hnowlan) From the outside it seems to me like we don't need this feature - given how long it has been non-functional I would almost be scared to enable it as it might cause an unfamiliar series o... [11:10:57] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/686395 (owner: 10David Caro) [11:17:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:19:39] (03CR) 10David Caro: [C: 03+2] "The changes are the expected changes, the 500GB is set as they have both ~1T free." [puppet] - 10https://gerrit.wikimedia.org/r/686395 (owner: 10David Caro) [11:22:21] (03PS1) 10Jbond: O:nagios_common: Pass ssl expiry constraints to https checks [puppet] - 10https://gerrit.wikimedia.org/r/686495 [11:23:12] (03CR) 10Jbond: "> Patch Set 6:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [11:25:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29453/console" [puppet] - 10https://gerrit.wikimedia.org/r/686495 (owner: 10Jbond) [11:28:02] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:28:47] (03PS1) 10Kormat: Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/685900 [11:30:30] (03CR) 10Kormat: [C: 03+2] Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/685900 (owner: 10Kormat) [11:33:56] !log kormat@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15853 and previous config saved to /var/cache/conftool/dbconfig/20210507-113355-kormat.json [11:34:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:07] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [11:44:47] PROBLEM - Disk space on backup2002 is CRITICAL: DISK CRITICAL - free space: /srv 2998160 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=backup2002&var-datasource=codfw+prometheus/ops [11:46:20] checking [11:48:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:49:01] !log kormat@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15854 and previous config saved to /var/cache/conftool/dbconfig/20210507-114859-kormat.json [11:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:10] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [11:49:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:49:55] (03PS7) 10Ssingh: P:wikidough: Add TCP connect check for DoH and DoT [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [11:53:06] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29454/console" [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [11:57:50] (03CR) 10VolkerE: [C: 03+1] [wikitech] Enable VE desktop section edit links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679940 (https://phabricator.wikimedia.org/T280291) (owner: 10Jforrester) [11:58:41] (03CR) 10VolkerE: [C: 03+1] "No +2 rights here, therefore only…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679940 (https://phabricator.wikimedia.org/T280291) (owner: 10Jforrester) [11:59:43] (03CR) 10Ssingh: [V: 03+1] "> Patch Set 6:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [11:59:49] (03PS1) 10Kormat: mariadb: Promote db1173 to s6 eqiad master. [puppet] - 10https://gerrit.wikimedia.org/r/686505 (https://phabricator.wikimedia.org/T282124) [12:01:53] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:02:01] (03CR) 10Kormat: [C: 04-2] "Don't merge until maintenance window." [puppet] - 10https://gerrit.wikimedia.org/r/686505 (https://phabricator.wikimedia.org/T282124) (owner: 10Kormat) [12:02:58] (03PS1) 10Kormat: wmnet: Update s6-master to db1173 [dns] - 10https://gerrit.wikimedia.org/r/686513 (https://phabricator.wikimedia.org/T282124) [12:03:20] (03CR) 10Jbond: "> Patch Set 7:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [12:03:32] (03CR) 10Jbond: [C: 03+1] P:wikidough: Add TCP connect check for DoH and DoT [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [12:03:46] (03CR) 10Kormat: [C: 04-2] "Don't merge before maintenance window." [dns] - 10https://gerrit.wikimedia.org/r/686513 (https://phabricator.wikimedia.org/T282124) (owner: 10Kormat) [12:04:04] !log kormat@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15855 and previous config saved to /var/cache/conftool/dbconfig/20210507-120404-kormat.json [12:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:13] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [12:06:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:07:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, all dependencies match for the library versions in bullseye." [software] - 10https://gerrit.wikimedia.org/r/685524 (owner: 10Jcrespo) [12:11:01] (03CR) 10Filippo Giunchedi: [C: 04-1] "The idea LGTM, but check_http wants -C for expiration days not -w/-c" [puppet] - 10https://gerrit.wikimedia.org/r/686495 (owner: 10Jbond) [12:12:34] (03CR) 10Muehlenhoff: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685825 (owner: 10Muehlenhoff) [12:12:50] (03CR) 10Muehlenhoff: [C: 03+2] Unconditionally install spicerack from "main" [puppet] - 10https://gerrit.wikimedia.org/r/685825 (owner: 10Muehlenhoff) [12:19:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: reimaged to buster T280751', diff saved to https://phabricator.wikimedia.org/P15856 and previous config saved to /var/cache/conftool/dbconfig/20210507-121908-kormat.json [12:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:19] T280751: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 [12:21:23] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:24:01] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:31:09] (03CR) 10Jbond: [C: 03+1] P:wikidough: Add TCP connect check for DoH and DoT [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [12:33:48] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:36:54] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:37:55] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29457/console" [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [12:38:09] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:wikidough: Add TCP connect check for DoH and DoT [puppet] - 10https://gerrit.wikimedia.org/r/685800 (https://phabricator.wikimedia.org/T252132) (owner: 10Jbond) [12:46:04] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:47:19] (03PS18) 10DCausse: rdf-streaming-updater: create helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/671204 (https://phabricator.wikimedia.org/T264006) (owner: 10Mstyles) [12:47:21] (03PS9) 10DCausse: rdf-streaming-updater: enable HA capability [deployment-charts] - 10https://gerrit.wikimedia.org/r/679519 (https://phabricator.wikimedia.org/T273098) (owner: 10Mstyles) [12:47:23] (03PS1) 10DCausse: [WIP] rdf-streaming-updater application mode experiment [deployment-charts] - 10https://gerrit.wikimedia.org/r/686550 [12:47:34] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:49:30] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:54:30] (03PS2) 10DCausse: [WIP] rdf-streaming-updater application mode experiment [deployment-charts] - 10https://gerrit.wikimedia.org/r/686550 [12:55:54] (03PS3) 10Jbond: thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [12:57:41] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [12:57:56] 10SRE, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [12:58:26] 10SRE, 10CFSSL-PKI, 10Patch-For-Review: Additional CFSSL tasks - https://phabricator.wikimedia.org/T281369 (10jbond) [13:00:31] (03CR) 10Jcrespo: [C: 03+2] Initial control files for 10.5 mariadb packages [software] - 10https://gerrit.wikimedia.org/r/685524 (owner: 10Jcrespo) [13:04:06] (03PS8) 10DCausse: rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [13:04:28] !log Start server-side upload for 1 video file (T281927) [13:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:39] T281927: Server side upload for Auréola to Wikimedia Commons - https://phabricator.wikimedia.org/T281927 [13:05:09] (03CR) 10jerkins-bot: [V: 04-1] rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [13:08:55] (03PS9) 10DCausse: rdf-streaming-updater: use session mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/681497 (https://phabricator.wikimedia.org/T280166) (owner: 10Mstyles) [13:09:34] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:13:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:14:10] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Set resource requests and limits for calico PODs - https://phabricator.wikimedia.org/T277877 (10JMeybohm) I tried to verify the above assumption by collecting metrics more frequently (per second) from the docker API (see P15857). This paints a more clea... [13:15:06] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685597 (https://phabricator.wikimedia.org/T282112) (owner: 10Rubin) [13:15:14] (03PS5) 10Urbanecm: Disabling Education Program namespaces in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685597 (https://phabricator.wikimedia.org/T282112) (owner: 10Rubin) [13:15:24] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! I was thinking about the case of paging alerts, but even if there are that's fine as the change is easy to understand and revert" [puppet] - 10https://gerrit.wikimedia.org/r/686495 (owner: 10Jbond) [13:18:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdu_sentry4} site={eqiad,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:18:40] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:23:02] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:30:58] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:37:13] (03CR) 10Jbond: [C: 04-1] "-1 mainly relates to thumbor_memcached_servers and the curl options" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [13:37:44] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:38:58] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:42:47] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review: Ensure Puppet checks types as part of the build - https://phabricator.wikimedia.org/T261693 (10jbond) > However this depends heavily on the PS i had to revert as that takes care of dependencies configuring, hiera custom functions, the private repo and... [13:42:52] (03PS1) 10Muehlenhoff: Switch puppetdb to profile::java [puppet] - 10https://gerrit.wikimedia.org/r/686583 (https://phabricator.wikimedia.org/T264178) [13:44:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/686583 (https://phabricator.wikimedia.org/T264178) (owner: 10Muehlenhoff) [13:49:28] (03CR) 10Jbond: [C: 03+2] "lgtm will merge" [puppet] - 10https://gerrit.wikimedia.org/r/685980 (https://phabricator.wikimedia.org/T280744) (owner: 10Legoktm) [13:57:06] (03CR) 10Jbond: "lgtm (inline comment/question/confusion)" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/685855 (owner: 10Volans) [14:07:05] (03PS1) 10Majavah: P::toolforge::k8s::haproxy: Add keepalived [puppet] - 10https://gerrit.wikimedia.org/r/686607 [14:08:07] (03CR) 10Marostegui: [C: 03+2] Add control file for wmf-mariadb-client104 for Bullseye [software] - 10https://gerrit.wikimedia.org/r/685512 (owner: 10Muehlenhoff) [14:08:38] (03Merged) 10jenkins-bot: Add control file for wmf-mariadb-client104 for Bullseye [software] - 10https://gerrit.wikimedia.org/r/685512 (owner: 10Muehlenhoff) [14:12:43] (03PS2) 10Majavah: P::toolforge::k8s::haproxy: Add keepalid [puppet] - 10https://gerrit.wikimedia.org/r/686607 [14:13:15] (03CR) 10Muehlenhoff: [C: 03+1] sre.deploy.python-code: add new cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/685855 (owner: 10Volans) [14:16:48] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/686495 (owner: 10Jbond) [14:19:22] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:29:58] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I like the idea of using HTTP to fetch this and we can even use the service itself probably. However, I do have a few inlines comments." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [14:33:15] (03CR) 10Effie Mouzeli: "It needs a little bit of work, but I think it is getting there" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [14:34:14] 10SRE, 10ops-codfw, 10ops-eqiad, 10DC-Ops: Dc-Ops Commands for Cumin - https://phabricator.wikimedia.org/T279721 (10MoritzMuehlenhoff) Thanks all! I'll write an email to you all next week with some context and next steps. [14:37:28] !log andrew@deploy1002 Started deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev [14:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:18] !log andrew@deploy1002 Finished deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev (duration: 00m 50s) [14:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:49] !log andrew@deploy1002 Started deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev [14:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:08] !log andrew@deploy1002 Finished deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev (duration: 01m 19s) [14:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:34] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp203[34].codfw.wmnet [14:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:56:06] !log andrew@deploy1002 Started deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev [14:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:28] !log andrew@deploy1002 Finished deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev (duration: 01m 22s) [14:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:41] !log andrew@deploy1002 Started deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev [14:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:10] !log andrew@deploy1002 Finished deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev (duration: 01m 29s) [15:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:43] !log andrew@deploy1002 Started deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev [15:00:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:09] !log andrew@deploy1002 Finished deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev (duration: 01m 26s) [15:02:11] !log andrew@deploy1002 Started deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev [15:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:22] !log andrew@deploy1002 Finished deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev (duration: 01m 11s) [15:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:42] (03PS1) 10Ssingh: nagios_common: add check_https_url_custom_ip [puppet] - 10https://gerrit.wikimedia.org/r/686622 (https://phabricator.wikimedia.org/T252132) [15:08:38] !log andrew@deploy1002 Started deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev [15:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:02] !log andrew@deploy1002 Finished deploy [horizon/deploy@71f273c]: updated trove -> codfw1dev (duration: 01m 24s) [15:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:13] 10SRE, 10serviceops: Support Canary releases on Kubernetes - https://phabricator.wikimedia.org/T282148 (10jijiki) [15:12:11] 10SRE, 10serviceops, 10User-WDoran, 10User-brennen: Canaries canaries canaries - https://phabricator.wikimedia.org/T210143 (10jijiki) [15:12:16] 10SRE, 10serviceops: Support Canary releases on Kubernetes - https://phabricator.wikimedia.org/T282148 (10jijiki) [15:12:38] (03PS1) 10Ssingh: wikidough: use check_https_url_custom_ip for DoH check [puppet] - 10https://gerrit.wikimedia.org/r/686625 (https://phabricator.wikimedia.org/T252132) [15:16:51] (03CR) 10Jforrester: "This seems to have broken local builds for me?" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/661919 (owner: 10Giuseppe Lavagetto) [15:17:04] (03PS2) 10Effie Mouzeli: WIP: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) [15:17:36] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) (owner: 10Effie Mouzeli) [15:19:07] (03PS3) 10Effie Mouzeli: WIP: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) [15:19:35] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) (owner: 10Effie Mouzeli) [15:28:04] (03PS4) 10Effie Mouzeli: Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) [15:28:33] (03CR) 10jerkins-bot: [V: 04-1] Add canary support in scaffolding [deployment-charts] - 10https://gerrit.wikimedia.org/r/685748 (https://phabricator.wikimedia.org/T282148) (owner: 10Effie Mouzeli) [15:30:32] 10SRE, 10serviceops, 10Patch-For-Review: Support Canary releases on Kubernetes - https://phabricator.wikimedia.org/T282148 (10jijiki) [15:37:35] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/686351 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [15:49:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:51:37] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:53:52] (03CR) 10Cwhite: "Added WMCS folks. This changeset appears to only apply to WMCS instances." [puppet] - 10https://gerrit.wikimedia.org/r/686353 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [15:54:26] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/686633 [15:55:00] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/686353 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [15:56:23] (03CR) 10Cwhite: [C: 03+1] "This change appears to affect 8 hosts in Analytics. (an-worker1*, stat100*)" [puppet] - 10https://gerrit.wikimedia.org/r/686352 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:11:05] (03CR) 10Elukey: [C: 03+2] prometheus: Migrate cron in node_amd_rocm to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/686352 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:11:24] (03CR) 10Elukey: "Ah no there is a parent change!" [puppet] - 10https://gerrit.wikimedia.org/r/686352 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:12:32] (03CR) 10Ladsgroup: "> Patch Set 1: -Code-Review" [puppet] - 10https://gerrit.wikimedia.org/r/686352 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [16:17:10] (03PS1) 10Jforrester: LinkBatch: skip bad input [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685901 (https://phabricator.wikimedia.org/T282180) [16:38:28] (03CR) 10Jforrester: "recheck" [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685937 (owner: 10Legoktm) [16:38:41] James_F: hrm, does "LinkBatch: skip bad input" look sufficient to unblock wmf.4? [16:40:11] (sorry, i should just ask this on the tickets.) [16:40:16] brennen: It's one of the UBNs. [16:40:23] Did the other get fixed? [16:40:24] * James_F looks. [16:41:01] i was just trying to get a read on whether https://gerrit.wikimedia.org/r/c/mediawiki/core/+/686591/ was necessary to unblock T282070 [16:41:02] T282070: After unblocking autoblock, Special:Log and Special:RecentChanges gives ParameterAssertionException: Bad value for parameter $dbKey - https://phabricator.wikimedia.org/T282070 [16:41:06] Sorry, it's two of them, but not the third. [16:41:09] Yeah. [16:41:55] I imagine it might be sufficient, but comments on Phabricator are sparse and unclear. [16:42:20] (03CR) 10Jforrester: [C: 03+1] Add Debian packaging [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685937 (owner: 10Legoktm) [16:43:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:44:16] James_F: ack, thx. i'll ask on ticket for clarity. [16:45:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:05:39] brennen: Should we deploy the wmf.4 back-port now? [17:09:13] (03PS3) 10Majavah: P::toolforge::k8s::haproxy: Add keepalived [puppet] - 10https://gerrit.wikimedia.org/r/686607 [17:09:19] James_F: yes, let's [17:09:38] Want me to, or should I let you? [17:10:51] (03CR) 10Legoktm: [C: 03+2] Add Debian packaging [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685937 (owner: 10Legoktm) [17:11:00] James_F: i'll go ahead. [17:11:24] Cool. [17:11:52] (03Merged) 10jenkins-bot: Add Debian packaging [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/685937 (owner: 10Legoktm) [17:13:54] (03CR) 10Brennen Bearnes: [C: 03+2] LinkBatch: skip bad input [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685901 (https://phabricator.wikimedia.org/T282180) (owner: 10Jforrester) [17:14:33] (03CR) 10Brennen Bearnes: [C: 03+2] Remove harmful validation regex in PageReferenceValue [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685596 (https://phabricator.wikimedia.org/T282070) (owner: 10Tim Starling) [17:16:58] (gah, ignore that second +2) [17:23:23] !log andrew@deploy1002 Started deploy [horizon/deploy@20f479e]: updated trove -> codfw1dev [17:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:19] !log andrew@deploy1002 Finished deploy [horizon/deploy@20f479e]: updated trove -> codfw1dev (duration: 01m 55s) [17:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:47] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:52] 10SRE, 10Mail: Mail to root@lists1001.wikimedia.org from noreply@lists1001.wikimedia.org doesn't work - https://phabricator.wikimedia.org/T280744 (10Legoktm) 05Open→03Resolved a:03Legoktm Triggered the unit and it successfully sent the email, yay! [17:35:53] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: check_exclude_backups.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:46] (03Merged) 10jenkins-bot: LinkBatch: skip bad input [core] (wmf/1.37.0-wmf.4) - 10https://gerrit.wikimedia.org/r/685901 (https://phabricator.wikimedia.org/T282180) (owner: 10Jforrester) [17:43:55] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:45:05] (03CR) 10Razzi: kerberos: add reset-password action to manage_principals.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685923 (https://phabricator.wikimedia.org/T282185) (owner: 10Razzi) [17:45:30] (03PS1) 10Legoktm: mailman: Fix check_exclude_backups to ignore MM3-only lists [puppet] - 10https://gerrit.wikimedia.org/r/686664 [17:46:00] 10SRE, 10Cloud-VPS: Monitor certificate validity for Cloud VPS - https://phabricator.wikimedia.org/T282264 (10Sascha) [17:47:40] (03CR) 10Bstorm: "Since this is a functional noop, I think this should be fine to merge (especially since it is tested in toolsbeta)." [puppet] - 10https://gerrit.wikimedia.org/r/686607 (owner: 10Majavah) [17:49:01] (03PS2) 10Legoktm: hieradata: Add pywikibot-bugs to mailman2_exclude_backups [puppet] - 10https://gerrit.wikimedia.org/r/683988 (owner: 10RLazarus) [17:50:51] !log brennen@deploy1002 Synchronized php-1.37.0-wmf.4/includes/cache/LinkBatch.php: Backport: [[gerrit:685901|LinkBatch: skip bad input (T282180 T282070)]] (duration: 01m 06s) [17:50:58] (03CR) 10RLazarus: [C: 03+1] mailman: Fix check_exclude_backups to ignore MM3-only lists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/686664 (owner: 10Legoktm) [17:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:03] T282180: Special:Log/move: ParameterTypeException: Bad value for parameter $link: must be a MediaWiki\Linker\LinkTarget|MediaWiki\Page\PageReference - https://phabricator.wikimedia.org/T282180 [17:51:03] T282070: After unblocking autoblock, Special:Log and Special:RecentChanges gives ParameterAssertionException: Bad value for parameter $dbKey - https://phabricator.wikimedia.org/T282070 [18:05:38] (03CR) 10Legoktm: [C: 03+2] hieradata: Add pywikibot-bugs to mailman2_exclude_backups [puppet] - 10https://gerrit.wikimedia.org/r/683988 (owner: 10RLazarus) [18:07:51] (03PS2) 10Legoktm: mailman: Fix check_exclude_backups to ignore MM3-only lists [puppet] - 10https://gerrit.wikimedia.org/r/686664 [18:07:53] (03CR) 10Legoktm: mailman: Fix check_exclude_backups to ignore MM3-only lists (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/686664 (owner: 10Legoktm) [18:08:12] (03PS3) 10Legoktm: mailman: Fix check_exclude_backups to ignore MM3-only lists [puppet] - 10https://gerrit.wikimedia.org/r/686664 [18:09:37] (03CR) 10RLazarus: [C: 03+1] mailman: Fix check_exclude_backups to ignore MM3-only lists [puppet] - 10https://gerrit.wikimedia.org/r/686664 (owner: 10Legoktm) [18:10:49] (03CR) 10Legoktm: [C: 03+2] mailman: Fix check_exclude_backups to ignore MM3-only lists [puppet] - 10https://gerrit.wikimedia.org/r/686664 (owner: 10Legoktm) [18:16:43] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:34] (03PS2) 10Bstorm: cloudmetrics: fail over to cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/684990 (https://phabricator.wikimedia.org/T281881) [18:19:48] (03CR) 10Bstorm: [C: 03+2] P::toolforge::k8s::haproxy: Add keepalived [puppet] - 10https://gerrit.wikimedia.org/r/686607 (owner: 10Majavah) [18:20:22] brennen: Train looks unblocked from my POV. [18:22:46] James_F: yep, thanks. going ahead. [18:23:45] !log 1.37.0-wmf.4 train status (T281145): blockers appear resolved, going ahead in the interest of not having a split deploy over weekend [18:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:54] T281145: 1.37.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T281145 [18:25:11] (03PS1) 10Brennen Bearnes: group1 wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686674 [18:25:13] (03CR) 10Brennen Bearnes: [C: 03+2] group1 wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686674 (owner: 10Brennen Bearnes) [18:25:55] (03Merged) 10jenkins-bot: group1 wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686674 (owner: 10Brennen Bearnes) [18:27:21] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.4 [18:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:29] !log brennen@deploy1002 Synchronized php: group1 wikis to 1.37.0-wmf.4 (duration: 01m 07s) [18:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:02] (03PS1) 10Brennen Bearnes: all wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686675 [18:31:04] (03CR) 10Brennen Bearnes: [C: 03+2] all wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686675 (owner: 10Brennen Bearnes) [18:31:46] (03Merged) 10jenkins-bot: all wikis to 1.37.0-wmf.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686675 (owner: 10Brennen Bearnes) [18:33:09] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.4 [18:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:42] !log deleted daily-article-l from mailman3 after failed import [18:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:20] (03CR) 10Bstorm: [C: 03+2] toolforge kubernetes: change class for the new cinder environment [puppet] - 10https://gerrit.wikimedia.org/r/685936 (https://phabricator.wikimedia.org/T282087) (owner: 10Bstorm) [19:13:12] (03CR) 10Andrew Bogott: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/684115 (https://phabricator.wikimedia.org/T198673) (owner: 10Krinkle) [19:17:31] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mirror1001 - https://phabricator.wikimedia.org/T282272 (10RobH) [19:18:26] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mirror1001 - https://phabricator.wikimedia.org/T282272 (10RobH) [19:26:47] (03CR) 10Andrew Bogott: [C: 03+2] openstack: Change default testing value for pageeditor.py to beta testwiki [puppet] - 10https://gerrit.wikimedia.org/r/684115 (https://phabricator.wikimedia.org/T198673) (owner: 10Krinkle) [19:42:36] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [19:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:13] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:45] (03CR) 10Jforrester: [C: 03+2] Reparse deploy page before announcing an event [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/680033 (https://phabricator.wikimedia.org/T243394) (owner: 10BryanDavis) [20:11:21] (03CR) 10jerkins-bot: [V: 04-1] Reparse deploy page before announcing an event [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/680033 (https://phabricator.wikimedia.org/T243394) (owner: 10BryanDavis) [20:15:09] (03CR) 10Elukey: kerberos: add reset-password action to manage_principals.py (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/685923 (https://phabricator.wikimedia.org/T282185) (owner: 10Razzi) [20:39:29] (03PS1) 10Jdlrobson: Drop unused configuration on labs instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686756 (https://phabricator.wikimedia.org/T277955) [20:40:31] (03CR) 10Jdlrobson: "Blocking voice and tone fix ups to PageImages - labs only config change. Is it possible to apply this sometime European Monday?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686756 (https://phabricator.wikimedia.org/T277955) (owner: 10Jdlrobson) [20:41:01] (03PS2) 10BryanDavis: Reparse deploy page before announcing an event [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/680033 (https://phabricator.wikimedia.org/T243394) [20:41:03] (03PS1) 10BryanDavis: Ignore some flake8 whines [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/686757 [20:44:16] (03CR) 10Clare Ming: [C: 03+1] "I don't have permissions to +2 this patch but it lgtm." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686756 (https://phabricator.wikimedia.org/T277955) (owner: 10Jdlrobson) [20:48:52] (03CR) 10BryanDavis: [C: 03+2] Ignore some flake8 whines [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/686757 (owner: 10BryanDavis) [20:49:09] (03CR) 10BryanDavis: [C: 03+2] Reparse deploy page before announcing an event [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/680033 (https://phabricator.wikimedia.org/T243394) (owner: 10BryanDavis) [20:49:31] (03Merged) 10jenkins-bot: Ignore some flake8 whines [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/686757 (owner: 10BryanDavis) [20:49:33] (03Merged) 10jenkins-bot: Reparse deploy page before announcing an event [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/680033 (https://phabricator.wikimedia.org/T243394) (owner: 10BryanDavis) [20:52:02] jouncebot: now [20:52:03] No deployments scheduled for the next 10 hour(s) and 7 minute(s) [20:52:06] jouncebot: next [20:52:06] In 10 hour(s) and 7 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210508T0700) [21:04:49] (03PS6) 10Rubin: Disabling Education Program namespaces in Russian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/685597 (https://phabricator.wikimedia.org/T282112) [21:09:12] (03PS1) 10Razzi: kerberos: require --email_address for create and reset-password [puppet] - 10https://gerrit.wikimedia.org/r/686766 (https://phabricator.wikimedia.org/T282185) [21:13:33] (03CR) 10Razzi: "Quick refactor to check arguments." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/686766 (https://phabricator.wikimedia.org/T282185) (owner: 10Razzi) [21:19:24] 10SRE, 10fr-email-preference-center, 10fundraising-tech-ops, 10netops, 10WMF-NDA: Deploy pfw policy1620422079 for T268501 and T281320 - https://phabricator.wikimedia.org/T282286 (10Dwisehaupt) [21:33:23] !log fixed owner for wdqs-gui-build list [21:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:26] !log deleted festivalsommer-teilnehmer from MM3, didn't import properly [21:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:27] !log deleted education@ from MM3, didn't import properly [21:40:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:45] 10SRE, 10Wikimedia-Mailing-lists: Find list owners for lists without them - https://phabricator.wikimedia.org/T281779 (10Legoktm) Saving this query for later: ` mysql:mailman3@m5-master.eqiad.wmnet [mailman3]> select mailinglist.list_id from mailinglist where not exists(select * from member where member.list_i... [21:47:25] 10SRE, 10Wikimedia-Mailing-lists: daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name - https://phabricator.wikimedia.org/T282271 (10Legoktm) This also affected the education@ list. [21:47:34] 10SRE, 10Wikimedia-Mailing-lists: daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name - https://phabricator.wikimedia.org/T282271 (10Legoktm) [22:36:16] 10SRE, 10Wikimedia-Mailing-lists: Archive destacado-l - https://phabricator.wikimedia.org/T282291 (10MarcoAurelio) [22:51:05] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:07:13] 10SRE, 10Wikimedia-Mailing-lists, 10serviceops: Allow list admins to train spam filters - https://phabricator.wikimedia.org/T244241 (10Legoktm) 05Declined→03Open There's some new interest in this, see https://lists.wikimedia.org/hyperkitty/list/listadmins@lists.wikimedia.org/message/I4NRRS23N3KMR7XH4GWYT... [23:33:14] (03PS1) 10Cwhite: logstash: collect kaios_app.error stream into logstash clienterror input [puppet] - 10https://gerrit.wikimedia.org/r/686803 (https://phabricator.wikimedia.org/T281507)