[00:01:25] (03CR) 10jerkins-bot: [V: 04-1] profile::wmcs::instance: replace hiera_include with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [00:09:30] (03CR) 10Dzahn: "well.. I also can't seem to lint ignore it fully.. have to get back to it later. comments welcome though" [puppet] - 10https://gerrit.wikimedia.org/r/665462 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [00:11:06] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1342.eqiad.wmnet [00:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:15] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1342.eqiad.wmnet [00:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:32] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1339.eqiad.wmnet [00:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:55] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1333.eqiad.wmnet [00:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:13] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1333.eqiad.wmnet [00:15:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:22] !log start batch processing images through MachineVision fetchSuggestions.php for T274220 on mwmaint1002 [00:15:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:28] T274220: Populate MachineVision databases for images commonly returned by search - https://phabricator.wikimedia.org/T274220 [00:16:46] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1317.eqiad.wmnet [00:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:17:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1317.eqiad.wmnet [00:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:58] (03CR) 10Legoktm: [C: 03+1] "LGTM, we can roll this out on Monday, probably testing it with https://test.wikipedia.org/w/fatal-error.php" [puppet] - 10https://gerrit.wikimedia.org/r/664903 (https://phabricator.wikimedia.org/T275075) (owner: 10Krinkle) [00:36:03] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.0.144:7001 on sessionstore1001 is CRITICAL: SSL CRITICAL - Certificate sessionstore1001-a valid until 2021-03-13 12:44:09 +0000 (expires in 21 days) daniel_zahn https://phabricator.wikimedia.org/T274564 https://phabricator.wikimedia.org/T120662 [00:36:03] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.32.85:7001 on sessionstore1002 is CRITICAL: SSL CRITICAL - Certificate sessionstore1002-a valid until 2021-03-13 12:44:10 +0000 (expires in 21 days) daniel_zahn https://phabricator.wikimedia.org/T274564 https://phabricator.wikimedia.org/T120662 [00:36:03] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.178:7001 on sessionstore1003 is CRITICAL: SSL CRITICAL - Certificate sessionstore1003-a valid until 2021-03-13 12:44:11 +0000 (expires in 21 days) daniel_zahn https://phabricator.wikimedia.org/T274564 https://phabricator.wikimedia.org/T120662 [00:36:03] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is CRITICAL: SSL CRITICAL - Certificate sessionstore2001-a valid until 2021-03-13 12:44:12 +0000 (expires in 21 days) daniel_zahn https://phabricator.wikimedia.org/T274564 https://phabricator.wikimedia.org/T120662 [00:36:03] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.32.101:7001 on sessionstore2002 is CRITICAL: SSL CRITICAL - Certificate sessionstore2002-a valid until 2021-03-13 12:44:13 +0000 (expires in 21 days) daniel_zahn https://phabricator.wikimedia.org/T274564 https://phabricator.wikimedia.org/T120662 [00:36:04] ACKNOWLEDGEMENT - cassandra-a SSL 10.192.48.132:7001 on sessionstore2003 is CRITICAL: SSL CRITICAL - Certificate sessionstore2003-a valid until 2021-03-13 12:44:14 +0000 (expires in 21 days) daniel_zahn https://phabricator.wikimedia.org/T274564 https://phabricator.wikimedia.org/T120662 [00:44:57] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10wiki_willy) [00:49:39] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 (10wiki_willy) Updating task with the new single row Chatsworth design. It's not already supported by Librenms, so it looks like we would have to add it in. A few other notes I took f... [03:37:50] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:24] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [05:29:20] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 87697 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [05:38:44] (03CR) 10Ladsgroup: [C: 03+1] ldap::config::labs: replace hiera_hash with lookup [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [05:47:11] (03CR) 10Ladsgroup: [C: 03+1] "Can say for sure but shouldn't "hash" have quotes? Can't find examples in the documentation :/ https://puppet.com/docs/puppet/7.4/hiera_au" [puppet] - 10https://gerrit.wikimedia.org/r/665461 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [06:06:34] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:32:34] RECOVERY - Check systemd state on analytics1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210220T0800) [09:13:18] PROBLEM - MegaRAID on db1103 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:13:19] ACKNOWLEDGEMENT - MegaRAID on db1103 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T275266 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:13:23] 10SRE, 10ops-eqiad: Degraded RAID on db1103 - https://phabricator.wikimedia.org/T275266 (10ops-monitoring-bot) [10:07:40] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:02] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:42:06] (03PS1) 10Ladsgroup: quarry: Replace query-killer cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/665471 (https://phabricator.wikimedia.org/T273673) [11:43:35] (03CR) 10jerkins-bot: [V: 04-1] quarry: Replace query-killer cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/665471 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [11:45:21] (03PS2) 10Ladsgroup: quarry: Replace query-killer cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/665471 (https://phabricator.wikimedia.org/T273673) [12:02:04] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:24] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:45:54] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:15:17] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1103 - https://phabricator.wikimedia.org/T275266 (10Marostegui) p:05Triage→03High This is X1 primary master, @wiki_willy can we give it some priority? Thanks [13:42:15] (03PS1) 10Zoranzoki21: Add a throttle rule for for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) [13:44:46] (03CR) 10Zoranzoki21: "@Urbanecm Please correct me if I counted some range wrongly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) (owner: 10Zoranzoki21) [14:54:11] PROBLEM - Stale file for node-exporter textfile in eqiad on alert1001 is CRITICAL: cluster=analytics file=debian_version.prom instance=an-worker1101 job=node site=eqiad https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [15:31:19] (03CR) 10Urbanecm: [C: 04-1] "included ranges are too big, see inline" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) (owner: 10Zoranzoki21) [15:40:00] Urbanecm: the raised amount is per IP, right? [15:40:46] Majavah: per IP and day [15:41:00] it changes the value of wgAccountCreationThrottle when the IP connects [15:42:15] see https://noc.wikimedia.org/conf/highlight.php?file=throttle-analyze.php [15:42:52] the event lasts for multiple days, given how many IPs they have I'd be really suprised if all attendees used just one and am wondering how much we actually need to raise the limit [15:43:23] not sure [15:43:42] feel free to ask for clarification, I'm unaware about how the university actually assigns IPs to users [15:44:00] but from my side, as long as it's university-owned IPs, it's pretty low risk change [15:44:31] I asked on the task, but not sure if the event organizer are aware of the technical details of how the university assigns addresses [15:44:40] probably not :/ [15:45:21] I agree with it being low-risk but I still don't like making overly broad exemptions [15:45:31] i understand that [15:51:40] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:26] PROBLEM - SSH on analytics1058.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:16:28] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:38] RECOVERY - SSH on analytics1058.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:38:26] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:32] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:27:54] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:30:24] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:36:09] (03PS1) 10Urbanecm: ukwikivoyage: Enable block AbuseFilter action [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665526 (https://phabricator.wikimedia.org/T275271) [20:37:52] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:18] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:38:24] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:45:56] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:20:14] PROBLEM - SSH on analytics1058.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:24:06] (03PS2) 10Zoranzoki21: Add a throttle rule for for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) [23:25:09] (03CR) 10Zoranzoki21: Add a throttle rule for for edit-a-thon (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) (owner: 10Zoranzoki21) [23:27:40] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:28:26] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:35:04] 10SRE, 10Wikimedia-Site-requests, 10Serbian-Sites, 10Wikimedia-maintenance-script-run: Drop FlaggedRevs rights from users at srwikinews - https://phabricator.wikimedia.org/T212058 (10FriedrickMILBarbarossa) [23:37:38] (03PS1) 10Urbanecm: fiwiki: Assign stablesettings to reviewers in IS.php rather than FR-specific file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665548 (https://phabricator.wikimedia.org/T275017) [23:44:33] (03CR) 10Urbanecm: [C: 04-1] Add a throttle rule for for edit-a-thon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) (owner: 10Zoranzoki21) [23:45:12] (03CR) 10Urbanecm: [C: 03+1] Adjust CX MT threshold to 90 for Vietnamese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665238 (https://phabricator.wikimedia.org/T275121) (owner: 10KartikMistry) [23:50:31] (03PS3) 10Zoranzoki21: Add a throttle rule for for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) [23:51:10] (03CR) 10Zoranzoki21: Add a throttle rule for for edit-a-thon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665474 (https://phabricator.wikimedia.org/T275237) (owner: 10Zoranzoki21)