[00:00:21] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:41] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:45] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:47] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:15] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:17] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [01:02:15] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:18:58] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 4 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10Shizhao) [02:22:57] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 4 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10cooltey) >>! In T262691#6456973, @Charlotte wrote: >>>! In T262691#6456908, @cooltey wrote: >> We have received a large numbe... [04:20:44] (03PS1) 10Brian Wolff: Add performance settings for DPL and re-enable on ruwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626919 (https://phabricator.wikimedia.org/T262240) [04:45:23] (03CR) 10Brian Wolff: "Note, the values here were chosen a bit arbitrarily. I think they are reasonable, but we may have to adjust once its live depending on how" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626919 (https://phabricator.wikimedia.org/T262240) (owner: 10Brian Wolff) [05:14:08] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [05:28:55] (03PS1) 10Marostegui: es2026: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/626920 (https://phabricator.wikimedia.org/T261717) [05:29:08] 10Operations, 10MW-on-K8s, 10serviceops: Sandbox/limit child processes within a container runtime - https://phabricator.wikimedia.org/T252745 (10Joe) [05:29:29] (03CR) 10Marostegui: [C: 03+2] es2026: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/626920 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:29:47] 10Operations, 10MW-on-K8s, 10serviceops: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) [05:32:03] (03PS1) 10Marostegui: instances.yaml: Add es2026 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/626921 (https://phabricator.wikimedia.org/T261717) [05:32:51] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add es2026 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/626921 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [05:34:27] (03CR) 10Giuseppe Lavagetto: cxserver: enable the service proxy everywhere (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [05:34:34] (03PS4) 10Giuseppe Lavagetto: cxserver: enable the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) [05:34:36] (03PS1) 10Giuseppe Lavagetto: Rakefile: uniform function commenting style [deployment-charts] - 10https://gerrit.wikimedia.org/r/626922 [05:35:09] <_joe_> kart_: FYI, I am going to deploy adding the service proxy to cxserver now [05:35:42] <_joe_> I have some urls to verify the results, but lmk if you think there is stuff I should look into [05:35:55] (03CR) 10jerkins-bot: [V: 04-1] cxserver: enable the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [05:36:01] <_joe_> heh [05:36:42] (03CR) 10Giuseppe Lavagetto: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [05:38:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2026 on es2 for the first time with minimum weight T261717', diff saved to https://phabricator.wikimedia.org/P12569 and previous config saved to /var/cache/conftool/dbconfig/20200914-053844-marostegui.json [05:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:38:53] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [05:39:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cxserver: enable the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [05:41:09] (03Merged) 10jenkins-bot: cxserver: enable the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/626110 (https://phabricator.wikimedia.org/T255879) (owner: 10Giuseppe Lavagetto) [05:43:27] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:03] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [05:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:59] _joe_: sure. [05:51:10] <_joe_> kart_: I'm running some translations and it seems to work pretty well [05:53:19] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [05:53:29] _joe_: cx loads fine, rechecking workflow and graphana looks good as of now. [05:54:03] <_joe_> kart_: thanks [05:54:25] !log Truncate tendril.general_log_sampled on db1115 - T262782 [05:54:26] !log execute "gnt-instance modify -B vcpus=4 an-tool1009.eqiad.wmnet" on ganeti1011 - T258768 [05:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:31] T262782: db1115's (tendril) disk filling up - https://phabricator.wikimedia.org/T262782 [05:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:54:36] T258768: Move Hue to a Buster VM - https://phabricator.wikimedia.org/T258768 [05:54:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: uniform function commenting style [deployment-charts] - 10https://gerrit.wikimedia.org/r/626922 (owner: 10Giuseppe Lavagetto) [05:56:08] (03Merged) 10jenkins-bot: Rakefile: uniform function commenting style [deployment-charts] - 10https://gerrit.wikimedia.org/r/626922 (owner: 10Giuseppe Lavagetto) [05:59:15] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:05:57] (03PS2) 10Giuseppe Lavagetto: mobileapps: use restbase-for-services in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/626269 [06:09:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: use restbase-for-services in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/626269 (owner: 10Giuseppe Lavagetto) [06:09:44] (03Merged) 10jenkins-bot: mobileapps: use restbase-for-services in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/626269 (owner: 10Giuseppe Lavagetto) [06:17:01] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Sudhanshu Gautam - https://phabricator.wikimedia.org/T262785 (10SGautam_WMF) [06:19:01] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [06:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:10] (03CR) 10Giuseppe Lavagetto: "@Mholloway just as a FYI, we're switching mobileapps to use another proxy port for restbase that ensures longer timeouts." [deployment-charts] - 10https://gerrit.wikimedia.org/r/626270 (owner: 10Giuseppe Lavagetto) [06:24:55] (03PS2) 10Giuseppe Lavagetto: wikifeeds: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/626132 (https://phabricator.wikimedia.org/T255878) [06:38:05] (03PS8) 10Elukey: Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 (https://phabricator.wikimedia.org/T204957) [06:54:52] (03CR) 10Elukey: [C: 03+2] Add custom ferm srange to Kafka Jumbo brokers [puppet] - 10https://gerrit.wikimedia.org/r/611168 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [06:56:26] !log slowly rollout ferm rules on Kafka-Jumbo hosts (see https://gerrit.wikimedia.org/r/611168) [06:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:12] 10Operations, 10Language-Team, 10Wikimedia-Mailing-lists: localisation-team mailing list to be archived and made read-only - https://phabricator.wikimedia.org/T262788 (10Majavah) [07:04:00] (03PS1) 10Elukey: role::kafka::jumbo::broker: add kafkamon1001 to the ferm list [puppet] - 10https://gerrit.wikimedia.org/r/626928 (https://phabricator.wikimedia.org/T204957) [07:05:09] (03CR) 10Giuseppe Lavagetto: wikifeeds: use the service proxy in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/626132 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [07:05:49] (03PS2) 10Elukey: role::kafka::jumbo::broker: add kafkamon1001 to the ferm list [puppet] - 10https://gerrit.wikimedia.org/r/626928 (https://phabricator.wikimedia.org/T204957) [07:07:07] (03CR) 10Muehlenhoff: "Didn't 623902 switch the active server to kakfamon1002? And what about the codfw counterpart, this would be icky to notice in case of a fa" [puppet] - 10https://gerrit.wikimedia.org/r/626928 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [07:07:28] moritzm: yes I noticed now about 1002, going to add it [07:07:32] thanks for the check :) [07:07:58] ok :-) [07:08:52] (03PS4) 10Giuseppe Lavagetto: citoid: add TLS LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625600 (https://phabricator.wikimedia.org/T255868) [07:08:54] (03CR) 10Elukey: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/626928 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [07:10:41] (03CR) 10JMeybohm: [C: 03+1] citoid: add TLS LVS endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625600 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [07:11:49] (03PS3) 10Elukey: role::kafka::jumbo::broker: add kafkamon1001 to the ferm list [puppet] - 10https://gerrit.wikimedia.org/r/626928 (https://phabricator.wikimedia.org/T204957) [07:11:52] moritzm: --^ [07:14:30] no I think I added one ) more [07:15:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (commit message needs updating, though)" [puppet] - 10https://gerrit.wikimedia.org/r/626928 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [07:16:08] (03PS4) 10Elukey: role::kafka::jumbo::broker: add kafkamon* to the ferm list [puppet] - 10https://gerrit.wikimedia.org/r/626928 (https://phabricator.wikimedia.org/T204957) [07:17:03] (03CR) 10Elukey: [C: 03+2] role::kafka::jumbo::broker: add kafkamon* to the ferm list [puppet] - 10https://gerrit.wikimedia.org/r/626928 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [07:29:05] moritzm: there might be a blocker for the ferm rules, namely not all the hosts in the list (like centralog1001, netflowXXXX, etc..) have AAAA records [07:29:15] so I see ipv6 traffic blocked [07:29:29] in syslog, but there's not much I can do [07:30:50] I'll try to send a patch to add AAAA records for netflow, should be ok [07:36:38] (03PS1) 10Elukey: Add AAAA/PTR records for netflow[345]001 [dns] - 10https://gerrit.wikimedia.org/r/627195 (https://phabricator.wikimedia.org/T204957) [07:37:02] (03CR) 10jerkins-bot: [V: 04-1] Add AAAA/PTR records for netflow[345]001 [dns] - 10https://gerrit.wikimedia.org/r/627195 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [07:39:20] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [07:39:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2026 on es2 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12570 and previous config saved to /var/cache/conftool/dbconfig/20200914-073919-marostegui.json [07:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:27] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [07:39:34] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [07:39:36] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:39:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:07] !log shutting down etcd100[1-3] (sheduled for decommission, replaced by kubetcd100[4-6]) [07:40:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:48] sounds good [07:48:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] citoid: add TLS LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625600 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [07:48:29] (03PS2) 10Elukey: Add AAAA/PTR records for netflow[345]001 [dns] - 10https://gerrit.wikimedia.org/r/627195 (https://phabricator.wikimedia.org/T204957) [07:51:27] IIUC these can be done without going through netbox, but I'll double check [07:52:58] <_joe_> !log restarting pybal on lvs1016 [07:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:18] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:55:07] <_joe_> XioNoX: uh? [07:55:44] <_joe_> I restarted one pybal [07:55:49] <_joe_> the other session should be up [07:56:28] _joe_: let me check [07:57:03] _joe_: they're all established now [07:57:24] <_joe_> so... bad luck with the check time? [07:57:32] most likely yeah [07:58:01] <_joe_> !log restarting pybal on lvs1015 [07:58:01] (03PS1) 10Marostegui: mariadb: Productinize es2027 into es3 [puppet] - 10https://gerrit.wikimedia.org/r/627198 (https://phabricator.wikimedia.org/T261717) [07:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:30] <_joe_> XioNoX: but still, one of the two must be established when I restart pybal on one server [07:58:47] _joe_: what do you mean by one of the two? [07:59:26] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - citoid_4003: Servers kubernetes1014.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:00:08] (03PS1) 10Muehlenhoff: piwik: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627199 [08:00:10] (03PS1) 10Muehlenhoff: yarn: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627200 [08:00:12] (03PS1) 10Muehlenhoff: icinga: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627201 [08:00:15] (03PS1) 10Muehlenhoff: turnilo: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627202 [08:00:17] (03PS1) 10Muehlenhoff: superset: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627203 [08:00:19] (03PS1) 10Muehlenhoff: thanos: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627204 [08:00:22] (03PS1) 10Muehlenhoff: hue: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627205 [08:00:24] (03PS1) 10Muehlenhoff: alerts: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627226 [08:00:38] <_joe_> uhm interesting, what did I get wrong there [08:02:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set es2018 as es3 codfw master T261717', diff saved to https://phabricator.wikimedia.org/P12571 and previous config saved to /var/cache/conftool/dbconfig/20200914-080239-marostegui.json [08:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:46] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:03:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2017 to clone es2027 - T261717', diff saved to https://phabricator.wikimedia.org/P12572 and previous config saved to /var/cache/conftool/dbconfig/20200914-080344-marostegui.json [08:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:08] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - citoid_4003: Servers kubernetes1014.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:04:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Productinize es2027 into es3 [puppet] - 10https://gerrit.wikimedia.org/r/627198 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [08:04:14] <_joe_> that's me ^^ [08:04:17] <_joe_> known issue [08:04:22] (03CR) 10Hashar: "To be fair, I am not even sure we still require that gitconfig statement:" [puppet] - 10https://gerrit.wikimedia.org/r/626757 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [08:04:53] !log Stop MySQL on es2017 to clone es2027 [08:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:05] !log prometheus codfw ops, extend the lv by 100G [08:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:14] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:05:30] (03CR) 10Ayounsi: [C: 03+1] "LTGM!" [dns] - 10https://gerrit.wikimedia.org/r/627195 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [08:05:34] (03PS1) 10Giuseppe Lavagetto: citoid: check https when connecting to https [puppet] - 10https://gerrit.wikimedia.org/r/627229 [08:05:57] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] citoid: check https when connecting to https [puppet] - 10https://gerrit.wikimedia.org/r/627229 (owner: 10Giuseppe Lavagetto) [08:06:55] FYI I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/626186 shortly, no impact expected but it'll trigger an rsyslog restart at the next puppet run [08:07:18] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.19:4003]) https://wikitech.wikimedia.org/wiki/PyBal [08:07:22] PROBLEM - PyBal connections to etcd on lvs2010 is CRITICAL: CRITICAL: 81 connections established with conf2001.codfw.wmnet:2379 (min=82) https://wikitech.wikimedia.org/wiki/PyBal [08:09:42] <_joe_> !log restarting pybal on lvs1015 [08:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:50] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:09:58] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:10:09] <_joe_> again, that's just one pybal [08:10:12] <_joe_> not both [08:10:55] (03CR) 10Filippo Giunchedi: [C: 03+2] base: add remote syslog queues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/626186 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [08:11:38] (03CR) 10Elukey: [C: 03+2] Add AAAA/PTR records for netflow[345]001 [dns] - 10https://gerrit.wikimedia.org/r/627195 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [08:12:48] <_joe_> !log restarting pybal on lvs2010 [08:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:12] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:13:18] RECOVERY - PyBal connections to etcd on lvs2010 is OK: OK: 82 connections established with conf2001.codfw.wmnet:2379 (min=82) https://wikitech.wikimedia.org/wiki/PyBal [08:13:38] 10Operations, 10netops: every Sunday at 00:00 UTC, logrotate fails on netflow hosts - https://phabricator.wikimedia.org/T257128 (10ayounsi) 05Open→03Resolved a:03ayounsi T262751 had a more verbose error log: ` Sep 13 00:00:01 netflow1001 systemd[1]: Starting Rotate log files... Sep 13 00:00:01 netflow100... [08:14:05] !log Stop MySQL on db2125 for on-site maintenance - T260670 [08:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:11] T260670: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 [08:14:44] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul mysql stopped, you can proceed with this host whenever you want. [08:14:53] <_joe_> !log restarting php on mw2297, php-fpm stuck in SIGILL [08:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:20] RECOVERY - PHP7 rendering on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:15:28] RECOVERY - Apache HTTP on mw2297 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:16:24] <_joe_> !log repooling mw2297 [08:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:08] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.19:4003]) https://wikitech.wikimedia.org/wiki/PyBal [08:17:32] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 61 connections established with conf2001.codfw.wmnet:2379 (min=62) https://wikitech.wikimedia.org/wiki/PyBal [08:17:50] <_joe_> !log restarting pybal on lvs2009 [08:17:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:00] (03PS1) 10Ayounsi: Add ##PRIMARY## to allowed interfaces names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627231 [08:19:54] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:22:16] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [08:22:38] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 62 connections established with conf2001.codfw.wmnet:2379 (min=62) https://wikitech.wikimedia.org/wiki/PyBal [08:23:24] (03PS1) 10Filippo Giunchedi: hieradata: enable rsyslog queues in ulsfo/eqsin [puppet] - 10https://gerrit.wikimedia.org/r/627232 (https://phabricator.wikimedia.org/T226703) [08:27:22] (03CR) 10JMeybohm: [C: 03+1] citoid: promote https lvs to production status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625601 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [08:32:41] (03PS4) 10Hashar: deployment: remove obsolete Trebuchet git config [puppet] - 10https://gerrit.wikimedia.org/r/626757 (https://phabricator.wikimedia.org/T262244) [08:33:34] (03CR) 10Hashar: "Repurposed to no more generate the obsolete config. The deployment servers will still have the obsolete /etc/gitconfig left behind but we " [puppet] - 10https://gerrit.wikimedia.org/r/626757 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [08:33:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] citoid: promote https lvs to production status [puppet] - 10https://gerrit.wikimedia.org/r/625601 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [08:34:07] (03PS4) 10Giuseppe Lavagetto: citoid: promote https lvs to production status [puppet] - 10https://gerrit.wikimedia.org/r/625601 (https://phabricator.wikimedia.org/T255868) [08:35:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2026 on es2 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12573 and previous config saved to /var/cache/conftool/dbconfig/20200914-083525-marostegui.json [08:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:33] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:36:51] (03PS4) 10Hashar: phabricator: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626758 (https://phabricator.wikimedia.org/T262244) [08:37:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/626758 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [08:40:01] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1001/25055/" [puppet] - 10https://gerrit.wikimedia.org/r/627232 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [08:40:23] seeking kind souls for (easy) review ^ [08:41:07] (03CR) 10Hashar: [C: 03+1] "Rebased to no more depends on the unrelated change https://gerrit.wikimedia.org/r/c/operations/puppet/+/626757/" [puppet] - 10https://gerrit.wikimedia.org/r/626758 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [08:42:39] !log upgrading remaining stretch systems to git 2.20 T262244 [08:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:45] T262244: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 [08:45:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool es2026 on es2 with more weight T261717', diff saved to https://phabricator.wikimedia.org/P12574 and previous config saved to /var/cache/conftool/dbconfig/20200914-084509-marostegui.json [08:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:16] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:47:44] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/25057/" [puppet] - 10https://gerrit.wikimedia.org/r/623966 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [08:47:53] (03PS2) 10Filippo Giunchedi: statsd_exporter: stop tracking local statsd connections [puppet] - 10https://gerrit.wikimedia.org/r/623966 (https://phabricator.wikimedia.org/T261633) [08:49:38] !log start the OTRS upgrade to 6.0.29 T187984 [08:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:45] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [08:49:50] PROBLEM - Check systemd state on mendelevium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:04] ignore ^ [08:50:50] should I stop the replication? [08:51:35] (03PS3) 10Giuseppe Lavagetto: service_proxy: switch citoid to TLS [puppet] - 10https://gerrit.wikimedia.org/r/625602 (https://phabricator.wikimedia.org/T255868) [08:54:38] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:46] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:50] PROBLEM - Check systemd state on ms-be2053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:54:57] yeah that's me, due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/623966 [08:55:00] fixing [08:55:13] sorry for the noise [08:56:00] PROBLEM - Check systemd state on ms-be1033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:56:24] PROBLEM - Check systemd state on ms-be1026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:18] PROBLEM - Check systemd state on ms-be1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:30] PROBLEM - Check systemd state on ores2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:52] PROBLEM - Check systemd state on ms-fe2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:18] PROBLEM - Check systemd state on ms-be2045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:34] PROBLEM - Check systemd state on ores1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:34] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.0124 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [08:58:36] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:40] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:58:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully pool es2026 on es2 T261717', diff saved to https://phabricator.wikimedia.org/P12575 and previous config saved to /var/cache/conftool/dbconfig/20200914-085842-marostegui.json [08:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:50] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [08:58:56] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:10] PROBLEM - Check systemd state on ores1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:18] PROBLEM - Check systemd state on ms-be1049 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:38] PROBLEM - Check systemd state on ms-be1044 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:59:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service_proxy: switch citoid to TLS [puppet] - 10https://gerrit.wikimedia.org/r/625602 (https://phabricator.wikimedia.org/T255868) (owner: 10Giuseppe Lavagetto) [09:00:00] PROBLEM - Check systemd state on ms-be2032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:00:58] PROBLEM - Check systemd state on thanos-be1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:00] PROBLEM - Check systemd state on ms-be1056 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:06] <_joe_> uh [09:01:51] <_joe_> that's ferm on ms-be1056 [09:01:58] <_joe_> any idea what happened? [09:02:16] yes, see above [09:02:17] 08:54 yeah that's me, due to https://gerrit.wikimedia.org/r/c/operations/puppet/+/623966 [09:02:24] I'll revert for now [09:02:44] <_joe_> heh sorry it was too far back in the scroll [09:03:00] PROBLEM - Check systemd state on ores2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:04] PROBLEM - Check systemd state on ms-be1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:10] PROBLEM - Check systemd state on ms-fe2007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:16] PROBLEM - Check systemd state on ms-be2025 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:40] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:06] yeah super spammy :( [09:04:16] PROBLEM - Check systemd state on ms-be1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:23] (03PS1) 10Filippo Giunchedi: Revert "statsd_exporter: stop tracking local statsd connections" [puppet] - 10https://gerrit.wikimedia.org/r/627234 [09:04:30] PROBLEM - Check systemd state on ms-fe1006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:34] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:46] PROBLEM - Check systemd state on ms-be2023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:52] PROBLEM - Check systemd state on ms-be2047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:58] PROBLEM - Check systemd state on ores1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:58] PROBLEM - Check systemd state on thanos-be2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:18] PROBLEM - Check systemd state on ms-be2021 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:39] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "statsd_exporter: stop tracking local statsd connections" [puppet] - 10https://gerrit.wikimedia.org/r/627234 (owner: 10Filippo Giunchedi) [09:05:40] PROBLEM - Check systemd state on ms-be1058 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:14] PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:24] PROBLEM - Check systemd state on ores2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:24] PROBLEM - Check systemd state on ores2008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:25] godog: it's ferm failing AFAICT with Unrecognized keyword: srange [09:06:34] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:36] shout if you need a hand [09:06:44] PROBLEM - Check systemd state on ms-be1047 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:06] PROBLEM - Check systemd state on ms-fe1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:10] volans: thanks, yeah I thought I could (hot)fix it but no, reverting instead [09:07:10] PROBLEM - Check systemd state on ms-be1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:12] PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:17] k [09:08:04] recoveries shortly [09:08:06] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:08] PROBLEM - Check systemd state on ores2009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:30] RECOVERY - Check systemd state on ms-fe2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:32] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:34] RECOVERY - Check systemd state on ms-be2021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:08:40] apparently ferm::client with 'drange' argument wasn't used anywhere yet [09:08:52] RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:06] RECOVERY - Check systemd state on ms-be2045 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:07] !log db1077. stop slave ; show slave status > /home/akosiaris/show_slave_status; reset slave all T187984 [09:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:14] RECOVERY - Check systemd state on ms-be1032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:14] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [09:09:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] Promote otrs1001 as the main otrs host [puppet] - 10https://gerrit.wikimedia.org/r/626631 (https://phabricator.wikimedia.org/T187984) (owner: 10Alexandros Kosiaris) [09:09:28] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:28] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:30] RECOVERY - Check systemd state on ms-be2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:32] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:36] RECOVERY - Check systemd state on ms-be1033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:36] RECOVERY - Check systemd state on ms-be1028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:36] RECOVERY - Check systemd state on ms-be1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:38] RECOVERY - Check systemd state on logstash2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:40] RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:44] RECOVERY - Check systemd state on ms-fe2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:46] RECOVERY - Check systemd state on ms-be2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:48] RECOVERY - Check systemd state on ms-be2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:48] RECOVERY - Check systemd state on ores2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:48] RECOVERY - Check systemd state on ores2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:50] RECOVERY - Check systemd state on ms-be2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:50] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:52] RECOVERY - Check systemd state on ms-be2047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:54] RECOVERY - Check systemd state on ores2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:54] RECOVERY - Check systemd state on ores2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:54] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:58] RECOVERY - Check systemd state on ores1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:58] RECOVERY - Check systemd state on thanos-be2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:04] RECOVERY - Check systemd state on ms-be1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:08] RECOVERY - Check systemd state on ms-be1047 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:10] RECOVERY - Check systemd state on ores1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:18] RECOVERY - Check systemd state on ms-be1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:30] RECOVERY - Check systemd state on ms-fe1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:38] RECOVERY - Check systemd state on ms-be1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:42] RECOVERY - Check systemd state on thanos-be1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:44] RECOVERY - Check systemd state on ms-be1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:46] RECOVERY - Check systemd state on ms-be1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:10:46] RECOVERY - Check systemd state on ms-be1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:10] RECOVERY - Check systemd state on ores1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:16] RECOVERY - Check systemd state on ms-fe1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:18] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:20] RECOVERY - Check systemd state on ores2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:46] RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:12:09] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the patch." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627231 (owner: 10Ayounsi) [09:13:00] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.00124 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [09:15:40] RECOVERY - Check systemd state on otrs1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:16:26] RECOVERY - OTRS SMTP on otrs1001 is OK: SMTP OK - 0.005 sec. response time https://wikitech.wikimedia.org/wiki/OTRS%23Troubleshooting [09:17:58] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10MoritzMuehlenhoff) All production hosts using Stretch h... [09:20:14] (03CR) 10Muehlenhoff: [C: 03+1] "Ack, sounds good, then. All of prod running Stretch now has 2.20" [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [09:25:12] 10Operations, 10netops: Upgrade Fastnetmon to 1.1.7 - https://phabricator.wikimedia.org/T257035 (10ayounsi) [09:26:17] !log T187984 migration script on otrs1001 now in step 8/41 [09:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:24] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [09:27:36] !log T187984 migration script on otrs1001 now in step 8/44 (correction) [09:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:07] jouncebot: next [09:28:08] In 1 hour(s) and 1 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T1030) [09:29:56] (03PS1) 10Filippo Giunchedi: ferm: fix keyword in NO_TRACK_R_CLIENT [puppet] - 10https://gerrit.wikimedia.org/r/627237 (https://phabricator.wikimedia.org/T261633) [09:30:39] moritzm: ^ for your eyes [09:31:30] having a look in a bit [09:37:36] ok, thanks! [09:37:52] (03CR) 10Ayounsi: [C: 03+2] Add ##PRIMARY## to allowed interfaces names [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627231 (owner: 10Ayounsi) [09:41:10] (03CR) 10Jbond: [C: 03+1] scap: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624343 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [09:41:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/624376 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [09:42:05] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/624369 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [09:43:38] PROBLEM - MariaDB Replica Lag: m2 on db2133 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 803.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:45:45] jynus: do you want me to silence this? ^ [09:46:04] sure, @meeting [09:46:14] jynus: I will take care of it [09:46:56] Silenced db1117, db2133 and db2078 for 24h [09:48:14] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624340 (owner: 10Dzahn) [09:49:49] (03CR) 10Filippo Giunchedi: "Haven't tested it but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/626405 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [09:51:13] (03PS1) 10Elukey: Refactor ferm rules in hiera for Kafka Jumbo to increase maintainability [puppet] - 10https://gerrit.wikimedia.org/r/627241 (https://phabricator.wikimedia.org/T204957) [09:53:29] (03CR) 10Jbond: puppetmaster::backend: replace hiera with lookup, data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624342 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [09:55:06] (03PS6) 10Hashar: base: enable git protocol version2 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) [09:56:09] (03PS2) 10Elukey: Refactor ferm rules in hiera for Kafka Jumbo to increase maintainability [puppet] - 10https://gerrit.wikimedia.org/r/627241 (https://phabricator.wikimedia.org/T204957) [09:57:11] (03CR) 10Hashar: [C: 03+1] "Rebased to resolve Gerrit flagging this as being in merge conflict due to "add remote syslog queues" which is actually does not conflict b" [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [09:57:38] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624335 (owner: 10Dzahn) [09:59:45] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/25059/" [puppet] - 10https://gerrit.wikimedia.org/r/627241 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [10:03:52] (03CR) 10Elukey: [C: 03+2] Refactor ferm rules in hiera for Kafka Jumbo to increase maintainability [puppet] - 10https://gerrit.wikimedia.org/r/627241 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [10:08:39] (03CR) 10Filippo Giunchedi: [C: 03+1] Streamline Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/626647 (owner: 10Muehlenhoff) [10:10:46] 10Operations, 10Wikidata, 10Wikimedia-Mailing-lists: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 (10Lydia_Pintscher) Outsch. I didn't realize it's that much... From my side it's fine no to archive this one. I don't think anyone should rely on the arch... [10:11:03] 10Operations, 10Wikidata, 10Wikimedia-Mailing-lists: Stop archiving the wikidata-bugs mailinglist in pipermail - https://phabricator.wikimedia.org/T262773 (10Lydia_Pintscher) Deleting the archive should also be fine. [10:11:19] (03CR) 10Filippo Giunchedi: [C: 03+1] Add IDP service registration for grafana [puppet] - 10https://gerrit.wikimedia.org/r/626639 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [10:12:42] (03CR) 10Hashar: "The debian-glue job fails because it can not find the upstream tag to generate the tarball from (upstream/4.7.1). That comes from the def" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [10:13:50] (03CR) 10Filippo Giunchedi: [C: 03+1] Add CAS-enabled vhost for editors/admins [puppet] - 10https://gerrit.wikimedia.org/r/626627 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [10:14:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, clearly noone used this since it was added in 2017 :-)" [puppet] - 10https://gerrit.wikimedia.org/r/627237 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [10:14:24] (03CR) 10Elukey: "> Patch Set 7:" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [10:17:18] (03CR) 10Muehlenhoff: [C: 03+2] Streamline Hiera settings [puppet] - 10https://gerrit.wikimedia.org/r/626647 (owner: 10Muehlenhoff) [10:18:19] (03CR) 10Filippo Giunchedi: [C: 03+2] ferm: fix keyword in NO_TRACK_R_CLIENT [puppet] - 10https://gerrit.wikimedia.org/r/627237 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [10:20:58] (03PS1) 10Filippo Giunchedi: statsd_exporter: stop tracking local statsd connections [puppet] - 10https://gerrit.wikimedia.org/r/627246 (https://phabricator.wikimedia.org/T261633) [10:22:36] (03CR) 10Muehlenhoff: [C: 03+2] Add IDP service registration for grafana [puppet] - 10https://gerrit.wikimedia.org/r/626639 (https://phabricator.wikimedia.org/T262512) (owner: 10Muehlenhoff) [10:25:07] (03PS1) 10Elukey: role::kafka::jumbo::brokers: add webperf nodes to the ferm list [puppet] - 10https://gerrit.wikimedia.org/r/627247 (https://phabricator.wikimedia.org/T204957) [10:26:30] (03CR) 10Filippo Giunchedi: "Take #2, PCC https://puppet-compiler.wmflabs.org/compiler1003/25060/" [puppet] - 10https://gerrit.wikimedia.org/r/627246 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [10:26:32] (03CR) 10Elukey: [C: 03+2] role::kafka::jumbo::brokers: add webperf nodes to the ferm list [puppet] - 10https://gerrit.wikimedia.org/r/627247 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [10:28:23] 10Operations, 10Page Content Service, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, and 2 others: page/summary and page/mobile-html keeps responding code 429 in zh.wiki - https://phabricator.wikimedia.org/T262705 (10Jgiannelos) Hi @Pchelolo >>! In T262705#6454901, @Pchelolo wrote: > cc @... [10:28:39] (03CR) 10Jbond: nginx: add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624357 (owner: 10Dzahn) [10:28:55] jouncebot: noq [10:28:59] jouncebot: now [10:28:59] No deployments scheduled for the next 0 hour(s) and 1 minute(s) [10:29:03] (03CR) 10Jbond: [C: 03+1] service::node: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624346 (owner: 10Dzahn) [10:29:50] (03CR) 10Jbond: ntp::daemon: replace hiera() with lookup(), lint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624332 (owner: 10Dzahn) [10:30:04] jan_drewniak: May I have your attention please! Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T1030) [10:30:10] (03CR) 10Hashar: "> Yep I am aware, it is still WIP since I am fixing some bugs with upstream. Since I'll have to also include some recent commits on top of" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [10:30:17] jan is not here :( [10:30:59] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: bonding: specify original NIC [puppet] - 10https://gerrit.wikimedia.org/r/627248 (https://phabricator.wikimedia.org/T261724) [10:31:17] maybe are portals postponed? [10:31:31] (03PS2) 10Muehlenhoff: piwik: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627199 [10:31:32] 10Operations, 10ops-codfw: ps1-b3-codfw AB feed current > 12A - https://phabricator.wikimedia.org/T262809 (10ayounsi) p:05Triage→03Medium [10:33:50] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627249 (https://phabricator.wikimedia.org/T128546) [10:35:39] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627249 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:06] Urbanecm: there ^ [10:36:17] I see, but I don't see him in this chan :) [10:36:26] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627249 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:36:31] nope, but the work is being done :D [10:36:36] true [10:37:16] not sure what https://gerrit.wikimedia.org/r/c/wikimedia/portals/deploy/+/627235 serves for then [10:37:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: bonding: specify original NIC [puppet] - 10https://gerrit.wikimedia.org/r/627248 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:37:31] jan seems not to be using that repo for the updates? [10:38:08] it's a submodule IIRC [10:38:24] 10Operations, 10Research, 10Wikimedia-Apache-configuration, 10Patch-For-Review: Redirect wikimedia.org/research to research.wikimedia.org instead of some external closed survey - https://phabricator.wikimedia.org/T259979 (10Aklapper) After no feedback for one month, how to receive a review from #Operations... [10:38:41] then portals/deploy is outdated? [10:38:45] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:622116| Bumping portals to master (T128546)]] (duration: 01m 00s) [10:38:49] oh here's jan_drewniak [10:38:51] hi [10:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:52] hi! [10:38:53] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:39:14] jan_drewniak: is wikimedia/portals/deploy still used? [10:39:32] sorry, didn't realize I was logged out of this channel. yes, it is used! [10:39:42] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:622116| Bumping portals to master (T128546)]] (duration: 00m 56s) [10:39:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:02] jan_drewniak: ack, it's because of T262152 [10:40:03] T262152: Updating www.wikinews.org portal (2020-09-06) - https://phabricator.wikimedia.org/T262152 [10:41:01] right, the situation is still a little bit odd with the non-wikipedia project portals [10:41:09] yup [10:41:23] well, for me it's easier to upgrade them at meta :) [10:42:01] https://gerrit.wikimedia.org/r/c/wikimedia/portals/deploy/+/627235 <-- so jan_drewniak - what do we do with these kind of patches? [10:42:14] shall we merge them? [10:43:27] !log deploy scap 3.15.0-1 to canaries - T261234 [10:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:34] T261234: Deploy Scap version 3.15.0-1 - https://phabricator.wikimedia.org/T261234 [10:43:53] I think we can abandon them now. I just uploaded and deployed a "manually" built patch (i.e. built on my machine). That job the creates those patches was broken for a while, but I see it's been fixed now. So in the future, yeah, you can merge patches like that if you see them. [10:44:30] jan_drewniak: ack, and then the submodule in operations/mediawiki-config will use those? [10:44:53] yup, that is basically how everything works :) [10:45:05] Okay! [10:45:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627200 (owner: 10Muehlenhoff) [10:45:30] I did update the views.json file on Meta last week as well [10:45:40] which caused some sister-projects updates [10:45:53] looks all good now [10:46:01] (03CR) 10Jbond: [C: 03+1] icinga: Also enforce access setting in the IDP service definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627201 (owner: 10Muehlenhoff) [10:46:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627202 (owner: 10Muehlenhoff) [10:46:42] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:46:42] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627203 (owner: 10Muehlenhoff) [10:47:14] hauskatze: great, thanks for doing that. The portals repository actually fetches the sister project portals and copies them into the repo as part of the build step, so the updates to those pages are still deployed. [10:47:22] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627204 (owner: 10Muehlenhoff) [10:47:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627205 (owner: 10Muehlenhoff) [10:47:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627226 (owner: 10Muehlenhoff) [10:48:36] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [10:48:37] hauskatze: It's been a bit of a resourcing issue to port the sister-project portals to the git repo, which is why they are separate from the wikipedia.org portal. Ideally they should all be in the same place :/ [10:49:16] jan_drewniak: yup; but for now it sorta works and it's easy to upgrade for us meta-admins [10:49:33] gulp indeed seems to fetch the meta templates [10:49:38] (03CR) 10Jbond: [C: 03+2] base: enable git protocol version2 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [10:49:50] and the Module:Project_portals is very handy to update those pages [10:49:55] (03CR) 10Jbond: [C: 03+2] "Thanks for the info, will deploy now" [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [10:49:58] just {{subst:}} and done :D [10:50:33] !log enable git protocol version2 fleet wide [10:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:47] [!!] the IP allocation migration to Netbox is happening in ~10 minutes. Please do not merge DNS patches until it's over. https://wikitech.wikimedia.org/wiki/DNS/Netbox for context if you missed my email :) [10:51:04] (03PS1) 10Hnowlan: api-gateway: migrate to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/627250 [10:51:19] (03CR) 10Muehlenhoff: icinga: Also enforce access setting in the IDP service definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627201 (owner: 10Muehlenhoff) [10:52:53] (03CR) 10Jbond: [C: 03+2] deployment: remove obsolete Trebuchet git config [puppet] - 10https://gerrit.wikimedia.org/r/626757 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [10:53:08] (03CR) 10Hnowlan: [C: 03+2] changeprop-jobqueue: convert to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/625632 (owner: 10Hnowlan) [10:53:52] <_joe_> hnowlan: <3 [10:54:08] (03CR) 10Jbond: [C: 03+2] phabricator: migrate to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/626758 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [10:54:22] (03Merged) 10jenkins-bot: changeprop-jobqueue: convert to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/625632 (owner: 10Hnowlan) [10:59:02] !log create LACP bundle to labtestvirt2003 [10:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:47] (03PS1) 10Jbond: git: remove the creates line in update-gitconfig as its refresh only [puppet] - 10https://gerrit.wikimedia.org/r/627251 [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: May I have your attention please! European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T1100) [11:00:04] hauskatze: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] (03CR) 10Jbond: [C: 03+2] git: remove the creates line in update-gitconfig as its refresh only [puppet] - 10https://gerrit.wikimedia.org/r/627251 (owner: 10Jbond) [11:00:16] hauskatze: i can deploy today [11:00:16] Presente [11:00:23] !log Mass importing IPs from PuppetDB into Netbox T244153 [11:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:29] T244153: Import IP addresses, interfaces and DNS names into Netbox for Primary Interfaces - https://phabricator.wikimedia.org/T244153 [11:01:10] (03CR) 10Muehlenhoff: [C: 03+2] piwik: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627199 (owner: 10Muehlenhoff) [11:01:25] (03CR) 10Urbanecm: [C: 03+2] [frwiktionary] Create new namespace "Conjugaison" & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626877 (https://phabricator.wikimedia.org/T262298) (owner: 10MarcoAurelio) [11:02:13] (03Merged) 10jenkins-bot: [frwiktionary] Create new namespace "Conjugaison" & associated talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626877 (https://phabricator.wikimedia.org/T262298) (owner: 10MarcoAurelio) [11:03:00] hauskatze: straight syncing the namespace one [11:03:08] ack [11:04:09] (03PS2) 10Urbanecm: [itwiki] Increase $wgAutoConfirmAge and $wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626876 (https://phabricator.wikimedia.org/T262738) (owner: 10MarcoAurelio) [11:04:13] (03CR) 10Urbanecm: [C: 03+2] [itwiki] Increase $wgAutoConfirmAge and $wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626876 (https://phabricator.wikimedia.org/T262738) (owner: 10MarcoAurelio) [11:04:28] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 0ee0d8f7422afe9c4ce215613c1dd212da85a466: [frwiktionary] Create new namespace "Conjugaison" & associated talk (T262298) (duration: 00m 56s) [11:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:35] T262298: New Namespace for the French Wiktionary : "Conjugaison" - https://phabricator.wikimedia.org/T262298 [11:04:59] (03Merged) 10jenkins-bot: [itwiki] Increase $wgAutoConfirmAge and $wgAutoConfirmCount [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626876 (https://phabricator.wikimedia.org/T262738) (owner: 10MarcoAurelio) [11:05:31] !log [urbanecm@mwmaint2001 ~]$ mwscript namespaceDupes.php --wiki=frwiktionary --fix # T262298 # P12576 [11:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:03] hauskatze: namespace is live, and should work [11:06:06] (lmk if it doesn't) [11:06:07] (03PS1) 10Jbond: git: fix template and phab configueration [puppet] - 10https://gerrit.wikimedia.org/r/627253 [11:06:09] Urbanecm: adding | phaste after at the end of a command triggers ProdPasteBot right? [11:06:39] (03CR) 10Jbond: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/627251 (owner: 10Jbond) [11:06:42] hauskatze: yup, phaste is a script that pastes its input to phabricator [11:06:49] (03CR) 10Jbond: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/627253 (owner: 10Jbond) [11:06:55] i use that to make cmd output sharing easier :) [11:07:09] hauskatze: second patch is at mwdebug2001 [11:07:40] checking apisandbox on 2001, only way I can think of testing that one [11:08:34] (03PS2) 10Jbond: git: fix template and phab configueration [puppet] - 10https://gerrit.wikimedia.org/r/627253 [11:09:46] (03CR) 10Jbond: [C: 03+2] git: fix template and phab configueration [puppet] - 10https://gerrit.wikimedia.org/r/627253 (owner: 10Jbond) [11:09:56] !log Stop MySQL on s5 and s8 eqiad primary master - lag will show up on labsdb hosts T261455 [11:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:02] T261455: Mon, Sept 14 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3 - https://phabricator.wikimedia.org/T261455 [11:10:07] hauskatze: my main account (340 itwiki edits) is autoconfirmed, WMF one (0 edits) is not [11:10:56] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Mon, Sept 14 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3 - https://phabricator.wikimedia.org/T261455 (10Marostegui) >>! In T261455#6423007, @Marostegui wrote: > Please take extra care with db1087, db1100 and db1109, they are an eqiad masters and lots of slaves hang... [11:11:06] Urbanecm: it looks it cannot be checked [11:11:12] via api sandbox [11:11:14] (03CR) 10Hnowlan: [C: 03+2] changeprop: migrate to using the new simplified helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/625626 (owner: 10Hnowlan) [11:11:18] okay, so syncing anyway - at least it doesn't break anything :) [11:11:30] but if you tested it yourself and found it to work, then okay [11:12:37] (03Merged) 10jenkins-bot: changeprop: migrate to using the new simplified helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/625626 (owner: 10Hnowlan) [11:12:51] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 47fe87c5756f9e4d1aad059925a5b289322460c5: [itwiki] Increase $wgAutoConfirmAge and $wgAutoConfirmCount (T262738) (duration: 00m 56s) [11:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:56] hauskatze: done [11:12:57] T262738: Change autoconfirmed settings on itwiki - https://phabricator.wikimedia.org/T262738 [11:13:03] anything else? [11:13:09] not from me [11:13:14] closing then :) [11:13:19] !log EU B&C window done [11:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:53] (03CR) 10Volans: [C: 03+2] sre.ganeti.makevm: adapt to Netbox DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/623545 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [11:15:16] 10Operations, 10ops-codfw: ps1-b3-codfw AB feed current > 12A - https://phabricator.wikimedia.org/T262809 (10ayounsi) Same for ps1-a6-codfw [11:15:18] (03CR) 10Volans: [C: 03+2] scripts: enable primary IPs options [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/626712 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [11:16:36] (03Merged) 10jenkins-bot: sre.ganeti.makevm: adapt to Netbox DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/623545 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [11:17:09] (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: add Netbox DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/626738 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [11:17:25] hashar: i have deployed your git config changes. I had to send a couple of follow up patches to fix some minor issues, i taged yuo on both let me know if you need more info/context [11:17:44] jbond42: awesome thanks [11:18:01] np [11:18:14] (03Merged) 10jenkins-bot: sre.hosts.decommission: add Netbox DNS automation [cookbooks] - 10https://gerrit.wikimedia.org/r/626738 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [11:19:04] (03CR) 10Hashar: "Ahhhhh thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/627251 (owner: 10Jbond) [11:19:16] (03PS1) 10Majavah: Add arbcom-ru.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/627256 (https://phabricator.wikimedia.org/T262812) [11:19:44] !log Deploy MCR schema change on s1, this will generate lag on s1 labsdb - T238966 [11:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:51] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [11:20:15] jbond42: sorry you had to do all those follow up. Clearly I was not focusing on friday when I have done those changes :\ [11:20:29] volans: can dns patches be merged? [11:20:34] I will write the thanksfull announce later this afternoon :] [11:20:37] meh i reviewed them so enough blame to share ;) [11:20:39] !log Remove triggers from db1124:3311 - T238966 [11:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:47] (03PS1) 10Majavah: Add arbcom_ruwiki to $private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/627257 (https://phabricator.wikimedia.org/T262812) [11:20:57] hauskatze: which one? [11:21:02] depends a bit on the content :) [11:21:22] volans: you said 'no' a while ago, but https://gerrit.wikimedia.org/r/627256 [11:21:35] yes, thx for asking, still in the process [11:21:47] this one can go, no problem at all [11:22:00] it's unrelated to the current work [11:22:22] (03CR) 10MarcoAurelio: "Volans would prefer if we waited to merge this one after a process running now is finished." [dns] - 10https://gerrit.wikimedia.org/r/627256 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah) [11:22:38] oh, too quick to write <_< [11:22:50] lol [11:23:54] (03PS1) 10Marostegui: db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/627258 [11:24:26] (03CR) 10Marostegui: [C: 03+2] db1106: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/627258 (owner: 10Marostegui) [11:24:50] !log volans@cumin1001 START - Cookbook sre.hosts.decommission [11:24:51] (03CR) 10MarcoAurelio: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/627256 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah) [11:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:14] fixed, somewhat [11:26:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 for MCR', diff saved to https://phabricator.wikimedia.org/P12578 and previous config saved to /var/cache/conftool/dbconfig/20200914-112648-marostegui.json [11:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:10] Urbanecm: so apparently I forgot to add an alias :| [11:27:24] hauskatze: I don't mind deploying a followup :) [11:27:25] (03PS2) 10Hnowlan: api-gateway: migrate to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/627250 [11:27:26] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [11:27:27] push a patch and ping me ;) [11:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:32] 10Operations, 10Patch-For-Review: Upgrade debmonitor to Buster - https://phabricator.wikimedia.org/T261489 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `debmonitor2001.codfw.wmnet` - debmonitor2001.codfw.wmnet (**WARN**) - Downtimed host on Icinga - Found... [11:27:36] ok [11:27:42] gerrit ui edit [11:28:37] (03CR) 10Hnowlan: "pcc https://puppet-compiler.wmflabs.org/compiler1003/25009/" [puppet] - 10https://gerrit.wikimedia.org/r/626119 (owner: 10Hnowlan) [11:29:38] Urbanecm: hmm, they requested `conj` in lowercase [11:29:42] is that possible at all? [11:29:50] shouldn't it be `CONJ` [11:30:00] wiktionary allows lowercases though [11:30:10] good question, I never tried to set a namespace with first letter lowercase [11:30:17] it may or may not work [11:30:27] hauskatze: we can just try that and see? :-) [11:30:50] okay [11:33:05] (03CR) 10MarcoAurelio: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627212 (https://phabricator.wikimedia.org/T262298) (owner: 10MarcoAurelio) [11:33:19] Urbanecm: ^ [11:33:28] [!!] IP allocation migration to Netbox completed. You can resume normal merges in the DNS repo, but keep in mind https://wikitech.wikimedia.org/wiki/DNS/Netbox#Transition_FAQ [11:35:08] I'm still not sure 'conj' would work [11:35:42] !log jmm@cumin1001 START - Cookbook sre.hosts.decommission [11:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:36] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [11:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:41] 10Operations, 10Patch-For-Review: Upgrade debmonitor to Buster - https://phabricator.wikimedia.org/T261489 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1001 for hosts: `debmonitor1001.eqiad.wmnet` - debmonitor1001.eqiad.wmnet (**WARN**) - Downtimed host on Icinga - Found Gan... [11:38:20] I'm not sure who's in charge of the beta cluster, but looks like the certificate is expired: The certificate for en.wikipedia.beta.wmflabs.org expired on 9/14/2020 https://phabricator.wikimedia.org/T262816 [11:38:47] zeljkof: I'm not sure who is in charge either [11:38:54] is there anything you need done? [11:39:05] probably Krenair knows best [11:39:08] extend the certificate? :D [11:39:14] I've never run acme-chief [11:39:37] (and ACME reminds me of Wile E. Coyote and The Roadrunner fwiw) [11:39:59] !log volans@cumin1001 START - Cookbook sre.ganeti.makevm [11:39:59] !log volans@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [11:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:25] hauskatze: speaking of acme, I've just stumbled upon this today :) https://www.acme.com/catalog/acme.html [11:41:02] lolol [11:41:02] !log volans@cumin1001 START - Cookbook sre.ganeti.makevm [11:41:03] !log volans@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [11:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:33] (03PS2) 10Muehlenhoff: Remove debmonitor1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/625841 (https://phabricator.wikimedia.org/T261489) [11:41:49] zeljkof: is there any docs? [11:41:56] re cert renewal [11:42:09] hauskatze: I don't know [11:42:17] (03CR) 10Filippo Giunchedi: [C: 03+2] statsd_exporter: stop tracking local statsd connections [puppet] - 10https://gerrit.wikimedia.org/r/627246 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [11:42:24] just noticed the problem, I have no clue how to fix it [11:42:28] (03PS3) 10MarcoAurelio: Follow-up 0ee0d8f: [frwiktionary] Create `conj` alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627212 (https://phabricator.wikimedia.org/T262298) [11:43:19] (03PS4) 10MarcoAurelio: Follow-up 0ee0d8f: [frwiktionary] Create `conj` alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627212 (https://phabricator.wikimedia.org/T262298) [11:43:23] hauskatze: okay, so you ready? [11:43:51] Urbanecm: "I was born ready" (joking) [11:44:04] yes I am [11:44:28] Okay, let's test it :) [11:44:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove debmonitor1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/625841 (https://phabricator.wikimedia.org/T261489) (owner: 10Muehlenhoff) [11:44:52] (03CR) 10Urbanecm: [C: 03+2] Follow-up 0ee0d8f: [frwiktionary] Create `conj` alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627212 (https://phabricator.wikimedia.org/T262298) (owner: 10MarcoAurelio) [11:45:40] (03Merged) 10jenkins-bot: Follow-up 0ee0d8f: [frwiktionary] Create `conj` alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627212 (https://phabricator.wikimedia.org/T262298) (owner: 10MarcoAurelio) [11:45:56] !log volans@cumin1001 START - Cookbook sre.ganeti.makevm [11:45:56] !log volans@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [11:45:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:07] I hope they don't ask me to create an uppercase alias as well [11:46:11] hauskatze: pulled onto mwdebug2001, test and let me know [11:46:14] :D [11:46:18] checking [11:46:56] look that it works [11:47:42] so, I should sync, I guess? [11:47:59] running further tests [11:48:02] please hold on [11:48:11] okay [11:48:46] !log volans@cumin1001 START - Cookbook sre.ganeti.makevm [11:48:47] !log volans@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [11:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:49] Urbanecm: so it looks the only page in the Conjugaison NS is espéranto/gladi [11:49:58] conj:espéranto/gladi redirects to Conjugaison:espéranto/gladi [11:50:00] in mwdebug [11:50:06] so I guess... okay? [11:50:10] probably [11:50:20] I was looking at the phaste and Conjugaison:aimer does not exist [11:50:29] !log volans@cumin1001 START - Cookbook sre.ganeti.makevm [11:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:33] I was smh about that [11:50:40] easy way to test is to go to https://fr.wiktionary.org/wiki/conj:foo [11:50:45] it should redirect you to the canonical namespace [11:50:47] it wfm [11:50:51] yup that too [11:51:21] let's sync [11:51:30] and wait for Conj and CONJ requests later [11:51:35] yup [11:51:39] or cOnj [11:51:42] bet you 10 wiki-euros [11:52:08] syncing [11:53:01] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fea8861db550746bfef496df2ef522dffc580a7d: Follow-up 0ee0d8f: [frwiktionary] Create `conj` alias (T262298) (duration: 00m 56s) [11:53:06] here you go hauskatze [11:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:10] T262298: New Namespace for the French Wiktionary : "Conjugaison" - https://phabricator.wikimedia.org/T262298 [11:53:26] Urbanecm: namespaceDupes needed just in case? [11:53:31] good point [11:53:48] 0 links to fix, 0 were resolvable, 0 were deleted. [11:53:52] we're good [11:55:14] awesome [11:55:29] Okay then so I'll close it again [11:56:15] cool [11:56:39] and off for lunch [11:56:50] thanks for the deploys [11:57:21] np [11:58:06] (03PS1) 10Filippo Giunchedi: pontoon: add thanos variables to o11y [puppet] - 10https://gerrit.wikimedia.org/r/627261 [11:58:08] (03PS1) 10Filippo Giunchedi: pontoon: assing thanos hosts [puppet] - 10https://gerrit.wikimedia.org/r/627262 [11:59:15] zeljkof: added a comment on-task [11:59:26] hauskatze: thanks, just saw it [11:59:32] ok [11:59:36] lunch, etc [12:01:18] (03CR) 10JMeybohm: [C: 03+1] wikifeeds: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/626132 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [12:01:36] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add thanos variables to o11y [puppet] - 10https://gerrit.wikimedia.org/r/627261 (owner: 10Filippo Giunchedi) [12:01:45] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: assing thanos hosts [puppet] - 10https://gerrit.wikimedia.org/r/627262 (owner: 10Filippo Giunchedi) [12:03:10] !log volans@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [12:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:54] !log T187984 migration script on otrs1001 now in step 31/44 [12:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:01] T187984: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 [12:08:15] !log volans@cumin1001 START - Cookbook sre.hosts.decommission [12:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:08] !log volans@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [12:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:17] (03PS1) 10Volans: sre.ganeti.makevm: fix IP allocation [cookbooks] - 10https://gerrit.wikimedia.org/r/627263 (https://phabricator.wikimedia.org/T244153) [12:10:43] 10Operations, 10Patch-For-Review: Upgrade debmonitor to Buster - https://phabricator.wikimedia.org/T261489 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by volans@cumin1001 for hosts: `debmonitor2001.codfw.wmnet` - debmonitor2001.codfw.wmnet (**FAIL**) - Failed downtime host on Icinga (lik... [12:13:16] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: 168 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:13:27] (03CR) 10Volans: [C: 03+2] "Tested on real life." [cookbooks] - 10https://gerrit.wikimedia.org/r/627263 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [12:14:44] (03Merged) 10jenkins-bot: sre.ganeti.makevm: fix IP allocation [cookbooks] - 10https://gerrit.wikimedia.org/r/627263 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [12:16:33] (03PS1) 10JMeybohm: lvs: Remove mobileapps non-TLS endpoint from LVS [puppet] - 10https://gerrit.wikimedia.org/r/627265 (https://phabricator.wikimedia.org/T255876) [12:16:35] (03PS1) 10JMeybohm: lvs: Completely remove mobileapps-http service stanza [puppet] - 10https://gerrit.wikimedia.org/r/627266 (https://phabricator.wikimedia.org/T255876) [12:22:39] (03CR) 10Alexandros Kosiaris: [C: 03+1] lvs: Completely remove mobileapps-http service stanza [puppet] - 10https://gerrit.wikimedia.org/r/627266 (https://phabricator.wikimedia.org/T255876) (owner: 10JMeybohm) [12:23:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] lvs: Remove mobileapps non-TLS endpoint from LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627265 (https://phabricator.wikimedia.org/T255876) (owner: 10JMeybohm) [12:24:53] !log rebooting sodium for kernel update [12:24:56] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:40] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:29:42] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10hashar) 05Open→03Resolved a:03hashar Thank you @M... [12:30:45] !log rotate SNMP community on all the PDUs - T246890 [12:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:52] !log ayounsi@cumin1001 START - Cookbook sre.pdus.rotate-snmp [12:30:53] (03PS1) 10JMeybohm: lvs: Remove blubberoid non-TLS endpoint from LVS 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/627268 (https://phabricator.wikimedia.org/T236017) [12:30:55] (03PS1) 10JMeybohm: lvs: Remove blubberoid non-TLS endpoint from LVS 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/627269 (https://phabricator.wikimedia.org/T236017) [12:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:57] (03PS1) 10JMeybohm: lvs: Remove blubberoid non-TLS endpoint from LVS 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/627270 (https://phabricator.wikimedia.org/T236017) [12:32:07] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.pdus.rotate-snmp (exit_code=1) [12:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:38:20] (03PS2) 10JMeybohm: lvs: Remove mobileapps non-TLS endpoint from LVS 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/627265 (https://phabricator.wikimedia.org/T255876) [12:38:22] (03PS2) 10JMeybohm: lvs: Remove mobileapps non-TLS endpoint from LVS 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/627266 (https://phabricator.wikimedia.org/T255876) [12:38:24] (03PS1) 10JMeybohm: lvs: Remove mobileapps non-TLS endpoint from LVS 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/627271 (https://phabricator.wikimedia.org/T255876) [12:39:55] (03PS2) 10Muehlenhoff: Remove debmonitor1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/625840 (https://phabricator.wikimedia.org/T261489) [12:40:21] (03CR) 10JMeybohm: lvs: Remove mobileapps non-TLS endpoint from LVS 2/3 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627265 (https://phabricator.wikimedia.org/T255876) (owner: 10JMeybohm) [12:41:43] 10Operations, 10MW-on-K8s, 10TechCom-RFC, 10serviceops, 10Patch-For-Review: RFC: PHP microservice for containerized shell execution - https://phabricator.wikimedia.org/T260330 (10tstarling) An open question is what to do about shell pipelines. Currently if you do `Shell::command('foo|bar')` then foo will... [12:42:56] (03PS1) 10Jbond: pdu.rotate-snmp disable urllib warnings and add ability to coninue on errors [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 [12:44:13] (03CR) 10Muehlenhoff: [C: 03+2] Remove debmonitor1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/625840 (https://phabricator.wikimedia.org/T261489) (owner: 10Muehlenhoff) [12:45:52] (03PS1) 10Muehlenhoff: debmonitor: Remove now obsolete compat code for stretch [puppet] - 10https://gerrit.wikimedia.org/r/627273 [12:47:04] 10Operations, 10Patch-For-Review: Upgrade debmonitor to Buster - https://phabricator.wikimedia.org/T261489 (10MoritzMuehlenhoff) 05Open→03Resolved The old stretch instances (debmonitor1001/2001) have been removed. [12:47:06] (03CR) 10Volans: "some comment inline" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (owner: 10Jbond) [12:47:33] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627273 (owner: 10Muehlenhoff) [12:47:49] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Remove now obsolete compat code for stretch [puppet] - 10https://gerrit.wikimedia.org/r/627273 (owner: 10Muehlenhoff) [12:49:32] !log replacing pdu's in racks d4 and d5 eqiad [12:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:49] !log ladsgroup@mwmaint2001:~$ mwscript extensions/Wikibase/repo/maintenance/changePropertyDataType.php --wiki=wikidatawiki --property-id P1438 --new-data-type external-id (T262198) [12:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:55] T262198: Convert P1438 from string to external ID - https://phabricator.wikimedia.org/T262198 [12:51:25] !log correction it's replacing the pdu's in racks d5 and d6 [12:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:53] (03PS2) 10Muehlenhoff: yarn: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627200 [12:57:04] (03PS1) 10Elukey: role::kafka::jumbo::broker: allow logstash and centrallog (ipv6) addresses [puppet] - 10https://gerrit.wikimedia.org/r/627274 (https://phabricator.wikimedia.org/T204957) [12:59:33] (03PS2) 10JMeybohm: Remove etcd100[123] hosts [dns] - 10https://gerrit.wikimedia.org/r/626337 (https://phabricator.wikimedia.org/T239835) [13:00:24] PROBLEM - Host cloudcephosd1012.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:03:00] (03CR) 10Muehlenhoff: [C: 03+2] yarn: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627200 (owner: 10Muehlenhoff) [13:03:10] PROBLEM - Juniper alarms on asw2-d-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:06:21] RECOVERY - Host cloudcephosd1012.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [13:07:04] RECOVERY - Juniper alarms on asw2-d-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:10:11] !log volans@cumin1001 START - Cookbook sre.dns.netbox [13:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:07] (03PS1) 10Majavah: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) [13:16:09] (03PS2) 10Majavah: Add arbcom-ru.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/627256 (https://phabricator.wikimedia.org/T262812) [13:17:54] (03CR) 10Majavah: "This is missing the logo, but I believe everything else needed is present." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah) [13:19:31] (03CR) 10Elukey: "> Patch Set 7:" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [13:22:54] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: Beta cluster certificates have expired - https://phabricator.wikimedia.org/T262806 (10Aklapper) Same as {T262816}? [13:23:06] (03CR) 10Muehlenhoff: [C: 03+1] "I think applying the cherrypicked patches via quilt/source format 3 is actually the easiest and cleanest solution? Simply apply those as d" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [13:24:28] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: Beta cluster certificates have expired - https://phabricator.wikimedia.org/T262806 (10Zoranzoki21) >>! In T262806#6458765, @Aklapper wrote: > Same as {T262816}? Looks so, I'm closing this one as duplicate. [13:24:49] 10Operations, 10Beta-Cluster-Infrastructure, 10Traffic, 10HTTPS: Beta cluster certificates have expired - https://phabricator.wikimedia.org/T262806 (10Zoranzoki21) [13:27:31] (03PS1) 10Giuseppe Lavagetto: varnish: raise the zhwiki-restbase rate-limit by 5x [puppet] - 10https://gerrit.wikimedia.org/r/627279 (https://phabricator.wikimedia.org/T262691) [13:27:34] (03CR) 10Elukey: "> Patch Set 7:" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [13:28:23] (03CR) 10CDanis: [C: 03+1] varnish: raise the zhwiki-restbase rate-limit by 5x [puppet] - 10https://gerrit.wikimedia.org/r/627279 (https://phabricator.wikimedia.org/T262691) (owner: 10Giuseppe Lavagetto) [13:29:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] varnish: raise the zhwiki-restbase rate-limit by 5x [puppet] - 10https://gerrit.wikimedia.org/r/627279 (https://phabricator.wikimedia.org/T262691) (owner: 10Giuseppe Lavagetto) [13:32:24] !log installing websockify stretch updates [13:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:34] 10Operations, 10ops-codfw, 10netops: (Need by: ) codfw:rack/setup/new management switches - https://phabricator.wikimedia.org/T253154 (10Papaul) [13:32:44] (03PS2) 10Muehlenhoff: icinga: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627201 [13:34:10] (03PS3) 10Muehlenhoff: icinga: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627201 [13:34:15] (03PS1) 10Klausman: Icinga: Authorize myself on assorted Icinga bits [puppet] - 10https://gerrit.wikimedia.org/r/627282 [13:34:18] (03CR) 10Muehlenhoff: "Amended to drop cas-icinga.w.o" [puppet] - 10https://gerrit.wikimedia.org/r/627201 (owner: 10Muehlenhoff) [13:36:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627282 (owner: 10Klausman) [13:36:36] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 5 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10Mholloway) I believe the search issue is separate from the zhwiki rate limiting. [13:36:49] (03CR) 10Klausman: [C: 03+2] Icinga: Authorize myself on assorted Icinga bits [puppet] - 10https://gerrit.wikimedia.org/r/627282 (owner: 10Klausman) [13:36:58] Thanks, Moritz [13:40:47] yw, note that the Icinga auth mechanism is a bit of a mess, the authentication at LDAP/IDP is case-insensitive, while the CGIs expect the exact casing in the conffile, so make sure to log it as "Klausman", otherwise the downtime options will remain greyed out [13:41:12] Roger [13:41:20] (casefolding yaaaaay) [13:42:09] !log volans@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:13] !log installing dbus security updates on stretch [13:42:16] tbf tho, the new ext4 casefolding is a) very very well thought-out, and b) let me bin six million crufty scripts I had written for my gaming group's setup [13:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:25] moritzm: the netbox alias that is including netbox-dev breaks stuff can I split it? [13:43:43] (03CR) 10JMeybohm: [C: 04-1] "I do see diffs for eqiad and codfw and I'm missing some deletes." (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/627250 (owner: 10Hnowlan) [13:43:49] volans: ofc! [13:44:15] let's simply split to netbox-tes and then we can do netbox-all which combines both? [13:45:37] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10faidon) >>! In T250053#6455534, @wiki_willy wrote: > In general, I haven't been a big fan of how the Netbox errors are reported. An onsite engineer could install a bunch of new hardware... [13:45:58] (03PS1) 10Volans: cumin aliases: split Netbox and add a catchall [puppet] - 10https://gerrit.wikimedia.org/r/627284 [13:46:00] moritzm: lol, did you read my mind? :D [13:46:35] as you want for the name test/canary/dev/standalone, no prefeernce here [13:48:21] (03PS2) 10Elukey: role::kafka::jumbo::broker: allow logstash and centrallog (ipv6) addresses [puppet] - 10https://gerrit.wikimedia.org/r/627274 (https://phabricator.wikimedia.org/T204957) [13:48:24] hehe :-) test is fine I think [13:48:46] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627284 (owner: 10Volans) [13:48:54] or canary, +1d [13:49:06] thx! [13:49:18] (03CR) 10Volans: [C: 03+2] cumin aliases: split Netbox and add a catchall [puppet] - 10https://gerrit.wikimedia.org/r/627284 (owner: 10Volans) [13:49:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627274 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [13:50:06] (03CR) 10Elukey: [C: 03+2] role::kafka::jumbo::broker: allow logstash and centrallog (ipv6) addresses [puppet] - 10https://gerrit.wikimedia.org/r/627274 (https://phabricator.wikimedia.org/T204957) (owner: 10Elukey) [13:50:27] !log volans@cumin1001 START - Cookbook sre.dns.netbox [13:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:12] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:15] (03PS2) 10Muehlenhoff: hue: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627205 [13:58:47] cccccclkivgggichlhudtdgtjjjtufuvfiedkruegtcl [13:59:10] (03CR) 10Mholloway: [C: 03+1] wikifeeds: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/626132 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [14:01:21] (03CR) 10Mholloway: [C: 03+1] "ACK" [deployment-charts] - 10https://gerrit.wikimedia.org/r/626270 (owner: 10Giuseppe Lavagetto) [14:01:24] (03CR) 10Milimetric: [V: 03+2 C: 03+2] eventstreams - bump to image version 2020-09-09-201733-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/626223 (https://phabricator.wikimedia.org/T261556) (owner: 10Ottomata) [14:02:02] marostegui: as long as you've got channel ops, do you mind updating the topic to say "Clinic Duty: rzl"? [14:04:24] (03PS1) 10Volans: sre.dns.netbox: add --force argument [cookbooks] - 10https://gerrit.wikimedia.org/r/627287 [14:05:07] rzl: sure [14:05:21] rzl: congrats [14:05:35] thanks 🙃 [14:06:04] (03PS2) 10Volans: sre.dns.netbox: add --force argument [cookbooks] - 10https://gerrit.wikimedia.org/r/627287 [14:07:38] (03CR) 10Volans: [C: 03+2] "SElf-merging to unblock migration, please review it anyway and I'll address any comment in a follow up patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/627287 (owner: 10Volans) [14:08:08] (03CR) 10Muehlenhoff: [C: 03+2] hue: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627205 (owner: 10Muehlenhoff) [14:08:58] (03Merged) 10jenkins-bot: sre.dns.netbox: add --force argument [cookbooks] - 10https://gerrit.wikimedia.org/r/627287 (owner: 10Volans) [14:09:26] !log milimetric@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [14:09:26] !log milimetric@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [14:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:56] !log volans@cumin1001 START - Cookbook sre.dns.netbox [14:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:00] Ahh Urbanecm sorry I missed your question before! :o [14:12:10] yes indeed we can sycn file without the i18n [14:12:17] that can come later as part of whatever deploy [14:12:25] i guess i'll scheduel my change for todays backport again [14:12:25] ottomata: tbh I already forgot what question it was :D [14:12:52] (03PS1) 10Ottomata: Revert "Revert "Default to using API json formatversion=2"" [extensions/EventStreamConfig] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/627215 [14:12:53] sure, let's try it again :D [14:13:19] (03CR) 10Ottomata: "The i18n change does not need a full deploy; it can go out later as part of the regular chain." [extensions/EventStreamConfig] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/627215 (owner: 10Ottomata) [14:14:42] !log milimetric@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [14:14:42] !log milimetric@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [14:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:57] !log volans@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:01] (03PS1) 10Esanders: DiscussionTool: Fix task comments for second round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627288 [14:18:03] (03PS1) 10Esanders: Enable DiscussionTools beta on jawiki & viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627289 (https://phabricator.wikimedia.org/T261654) [14:20:41] PROBLEM - DPKG on elastic2040 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:22:14] ^ elastic2040 should recover soonish [14:22:23] PROBLEM - ps1-d8-codfw-infeed-load-tower-A-phase-Y on ps1-d8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:22:24] PROBLEM - Host ps1-d5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:22:49] PROBLEM - ps1-d8-codfw-infeed-load-tower-B-phase-X on ps1-d8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:22:55] PROBLEM - Host ps1-d6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [14:23:15] PROBLEM - ps1-d8-codfw-infeed-load-tower-B-phase-Y on ps1-d8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:24:02] !log milimetric@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [14:24:02] !log milimetric@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [14:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:58] 10Operations, 10Goal, 10Patch-For-Review: FY2020-2021 Q1 DC switchover and switchback - https://phabricator.wikimedia.org/T243314 (10Trizek-WMF) [14:25:04] (03PS4) 10Cicalese: Allow public access to API Portal main page for private launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626229 (https://phabricator.wikimedia.org/T262480) [14:25:12] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) 05Open→03Resolved Let's warp-up and close: **Planning**: the planning set, based on previous switchover, is now stable.... [14:25:17] PROBLEM - ps1-d8-codfw-infeed-load-tower-B-phase-Z on ps1-d8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:26:12] (03PS1) 10Volans: sre.dns.netbox: fix --force argument [cookbooks] - 10https://gerrit.wikimedia.org/r/627290 [14:26:13] PROBLEM - Juniper alarms on asw2-d-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:26:56] junper alarm addressed [14:27:03] (03CR) 10jerkins-bot: [V: 04-1] sre.dns.netbox: fix --force argument [cookbooks] - 10https://gerrit.wikimedia.org/r/627290 (owner: 10Volans) [14:27:27] RECOVERY - Juniper alarms on asw2-d-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:28:24] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10JMeybohm) a:03JMeybohm [14:28:48] 10Operations, 10serviceops, 10Kubernetes, 10Patch-For-Review, 10Release Pipeline (Blubber): Move blubberoid to use TLS only. - https://phabricator.wikimedia.org/T236017 (10JMeybohm) a:05Joe→03JMeybohm [14:30:19] (03PS8) 10Elukey: Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) [14:30:21] PROBLEM - DPKG on mw1381 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:30:59] PROBLEM - DPKG on elastic2057 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:31:41] PROBLEM - DPKG on ms-be1059 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:32:51] (03PS9) 10Elukey: Add basic Debian packaging [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) [14:33:44] (03PS1) 10Ema: varnish: remove zhwiki-restbase special case [puppet] - 10https://gerrit.wikimedia.org/r/627291 (https://phabricator.wikimedia.org/T262691) [14:33:57] PROBLEM - DPKG on elastic2047 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:36:15] (03CR) 10BBlack: [C: 03+1] varnish: remove zhwiki-restbase special case [puppet] - 10https://gerrit.wikimedia.org/r/627291 (https://phabricator.wikimedia.org/T262691) (owner: 10Ema) [14:37:27] RECOVERY - DPKG on ms-be1059 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:37:27] RECOVERY - DPKG on elastic2057 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:37:27] RECOVERY - DPKG on elastic2047 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:37:27] RECOVERY - DPKG on elastic2040 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:37:37] (03CR) 10Ema: [C: 03+2] varnish: remove zhwiki-restbase special case [puppet] - 10https://gerrit.wikimedia.org/r/627291 (https://phabricator.wikimedia.org/T262691) (owner: 10Ema) [14:38:33] RECOVERY - DPKG on mw1381 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:39:07] (03CR) 10Herron: [C: 03+2] hieradata: enable rsyslog queues in ulsfo/eqsin [puppet] - 10https://gerrit.wikimedia.org/r/627232 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [14:39:27] (03CR) 10Herron: [C: 03+1] hieradata: enable rsyslog queues in ulsfo/eqsin [puppet] - 10https://gerrit.wikimedia.org/r/627232 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [14:40:55] PROBLEM - DPKG on elastic2059 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:41:54] (03CR) 10Herron: [C: 03+2] nagios-nrpe-server systemd unit: use /run for PID files [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [14:42:04] (03PS7) 10Herron: nagios-nrpe-server systemd unit: use /run for PID files [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [14:42:09] PROBLEM - DPKG on db2127 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:44:16] (03PS2) 10Muehlenhoff: thanos: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627204 [14:45:53] RECOVERY - DPKG on db2127 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:45:54] RECOVERY - DPKG on elastic2059 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:46:10] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [14:49:34] (03PS1) 10Ema: varnish: make Accept-Language lowercase [puppet] - 10https://gerrit.wikimedia.org/r/627295 [14:50:35] PROBLEM - DPKG on acrab is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:50:43] PROBLEM - ps1-d8-codfw-infeed-load-tower-A-phase-Z on ps1-d8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:50:53] PROBLEM - DPKG on elastic1046 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:52:04] herron: ain't moderator (not admin) ML passwords not working? [14:52:07] PROBLEM - ps1-d8-codfw-infeed-load-tower-A-phase-X on ps1-d8-codfw is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:52:41] RECOVERY - DPKG on elastic1046 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:52:41] RECOVERY - DPKG on acrab is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:53:15] PROBLEM - DPKG on db2112 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [14:55:14] !log ferm rules added to kafka-jumbo1009, 1006 and 1008 up to now [14:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:59] I don't see any drop registered from ferm [14:56:09] (wrong chan but ok in here too :) [14:56:12] (03CR) 10Muehlenhoff: [C: 03+2] thanos: Also enforce access setting in the IDP service definition [puppet] - 10https://gerrit.wikimedia.org/r/627204 (owner: 10Muehlenhoff) [14:56:30] (03CR) 10CDanis: [C: 03+1] codfw-prod: add ms-be2057 at object weight 100 [software/swift-ring] - 10https://gerrit.wikimedia.org/r/625604 (https://phabricator.wikimedia.org/T261633) (owner: 10Filippo Giunchedi) [14:58:47] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f4b26b824e0: Failed to establish a new connection: [Errno 111] Connection [14:58:47] ://wikitech.wikimedia.org/wiki/Search%23Administration [14:58:53] (03PS1) 10JMeybohm: lvs: Rename termbox-https to termbox [puppet] - 10https://gerrit.wikimedia.org/r/627297 (https://phabricator.wikimedia.org/T254581) [14:58:57] (03PS1) 10JMeybohm: lvs: Remove termbox non-TLS endpoint from LVS 1/3 [puppet] - 10https://gerrit.wikimedia.org/r/627298 (https://phabricator.wikimedia.org/T254581) [14:58:59] (03PS1) 10JMeybohm: lvs: Remove termbox non-TLS endpoint from LVS 2/3 [puppet] - 10https://gerrit.wikimedia.org/r/627299 (https://phabricator.wikimedia.org/T254581) [14:59:01] (03PS1) 10JMeybohm: lvs: Remove blubberoid non-TLS endpoint from LVS 3/3 [puppet] - 10https://gerrit.wikimedia.org/r/627300 (https://phabricator.wikimedia.org/T254581) [14:59:31] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:00:57] RECOVERY - DPKG on db2112 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:02:17] PROBLEM - DPKG on aqs1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:02:17] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:41] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [15:02:53] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, timed_out: False, number_of_pending_tasks: 0, initializing_shards: 0, active_shards: 976, active_shards_percent_as_number: 100.0, number_of_nodes: 6, relocating_shards: 0, unassigned_shards: 0, status: green, delayed_unass [15:02:53] number_of_in_flight_fetch: 0, active_primary_shards: 543, number_of_data_nodes: 3 https://wikitech.wikimedia.org/wiki/Search%23Administration [15:04:11] I'm hoping to do a quick config deploy now. Any objections? [15:05:18] (03CR) 10Cicalese: [C: 03+2] Allow public access to API Portal main page for private launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626229 (https://phabricator.wikimedia.org/T262480) (owner: 10Cicalese) [15:06:09] (03Merged) 10jenkins-bot: Allow public access to API Portal main page for private launch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626229 (https://phabricator.wikimedia.org/T262480) (owner: 10Cicalese) [15:06:45] RECOVERY - DPKG on aqs1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:06:56] (03PS2) 10Volans: sre.dns.netbox: fix --force argument [cookbooks] - 10https://gerrit.wikimedia.org/r/627290 [15:08:57] (03CR) 10Volans: [C: 03+2] "Ditto for addressing comment afterwards" [cookbooks] - 10https://gerrit.wikimedia.org/r/627290 (owner: 10Volans) [15:09:56] (03Merged) 10jenkins-bot: sre.dns.netbox: fix --force argument [cookbooks] - 10https://gerrit.wikimedia.org/r/627290 (owner: 10Volans) [15:11:27] !log completed pdu swap in eqiad racks d5/d6 [15:11:30] !log volans@cumin1001 START - Cookbook sre.dns.netbox [15:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:05] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:17:08] !log cicalese@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:626229|Allow public access to API Portal main page for private launch]] (duration: 00m 57s) [15:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:18] Deployment finished - thanks! [15:19:40] (03PS1) 10Volans: sre.dns.netbox: improve help message [cookbooks] - 10https://gerrit.wikimedia.org/r/627318 [15:22:38] 10Operations, 10DBA, 10observability: Prometheus/MariaDB counts a 'SELECT ... FOR UPDATE' query as an UPDATE query - https://phabricator.wikimedia.org/T262579 (10lmata) Is there any specific action you'd like us to take regarding the exporter? [15:23:24] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Enable CAS authentication for Grafana - https://phabricator.wikimedia.org/T262512 (10fgiunchedi) [15:23:51] !log enable stricter ferm rules on kafka-jumbo1007 and kafka-jumbo1005 [15:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:53] (03CR) 10Volans: [C: 03+2] "trivial improvement" [cookbooks] - 10https://gerrit.wikimedia.org/r/627318 (owner: 10Volans) [15:27:07] 10Operations, 10DBA, 10observability: Prometheus/MariaDB counts a 'SELECT ... FOR UPDATE' query as an UPDATE query - https://phabricator.wikimedia.org/T262579 (10jcrespo) I think the title no longer reflects reality- I am not sure anymore if there was an issue, but even if it was, I don't believe it is on th... [15:27:10] (03CR) 10Hashar: "> I think applying the cherrypicked patches via quilt/source format 3 is actually the easiest and cleanest solution? Simply apply those as" [debs/hue] - 10https://gerrit.wikimedia.org/r/618728 (https://phabricator.wikimedia.org/T233073) (owner: 10Elukey) [15:27:38] (03PS1) 10RobH: ps1-c[23]-eqiad update [puppet] - 10https://gerrit.wikimedia.org/r/627319 (https://phabricator.wikimedia.org/T261455) [15:27:58] (03Merged) 10jenkins-bot: sre.dns.netbox: improve help message [cookbooks] - 10https://gerrit.wikimedia.org/r/627318 (owner: 10Volans) [15:28:33] (03PS1) 10Hnowlan: api-gateway: disable connection reuse [deployment-charts] - 10https://gerrit.wikimedia.org/r/627322 (https://phabricator.wikimedia.org/T262490) [15:28:33] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [15:28:42] (03CR) 10RobH: [C: 03+2] ps1-c[23]-eqiad update [puppet] - 10https://gerrit.wikimedia.org/r/627319 (https://phabricator.wikimedia.org/T261455) (owner: 10RobH) [15:29:47] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops, 10Patch-For-Review: Mon, Sept 14 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3 - https://phabricator.wikimedia.org/T261455 (10RobH) [15:35:32] hi all! [15:35:40] !log installing gnutls28 security updates on stretch [15:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:46] if i directly modify code on mwdebug2002, will I get shot? [15:36:50] or is it ok to just run git apply there, to test a patch? [15:37:27] if nobody says STOP, I'll do it :) [15:39:25] oh, right, no git repo. So it would be good old vim... [15:39:32] (03CR) 10Ppchelko: api-gateway: disable connection reuse (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/627322 (https://phabricator.wikimedia.org/T262490) (owner: 10Hnowlan) [15:39:41] ...and then scap pull to clean up? [15:40:49] 10Operations, 10CirrusSearch, 10Elasticsearch, 10Discovery-Search (Current work): Search is currently too busy - https://phabricator.wikimedia.org/T262694 (10CBogen) [15:41:06] 10Operations, 10CirrusSearch, 10Elasticsearch, 10Discovery-Search (Current work): Search is currently too busy - https://phabricator.wikimedia.org/T262694 (10RKemper) Moving forward we've decided to add the above proposed alert before closing this ticket. [15:41:21] (03PS1) 10Hashar: zuul: in spec, use compile.with_all_deps [puppet] - 10https://gerrit.wikimedia.org/r/627325 [15:41:45] <_joe_> duesen: yes that's ok [15:41:55] <_joe_> but don't tell this to anyone [15:41:59] _joe_: excellent, thank you [15:42:00] <_joe_> jouncebot: next [15:42:00] In 1 hour(s) and 17 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T1700) [15:44:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Use `wmgWikibaseClientItemAndPropertySourceName` instead of `wmgWikibaseClientLocalEntitySourceName` in Wikibase.php (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622993 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:44:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The assumption that talking to LVS removes the advantage of using persistent connections is erroneous: we use LVS-DR so the connection, on" [deployment-charts] - 10https://gerrit.wikimedia.org/r/627322 (https://phabricator.wikimedia.org/T262490) (owner: 10Hnowlan) [15:45:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Remove `wmgWikibaseClientLocalEntitySourceName` from InitialiseSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622994 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:45:23] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add `wmgWikibaseClientItemAndPropertySourceName` to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622612 (https://phabricator.wikimedia.org/T258060) (owner: 10Itamar Givon) [15:45:38] !log restarting apache/FPM on mw2271/m2272 (codfw canaries) to pick up GNU TLS update [15:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:54] (03CR) 10Hnowlan: api-gateway: disable connection reuse (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/627322 (https://phabricator.wikimedia.org/T262490) (owner: 10Hnowlan) [15:47:09] (03PS1) 10RobH: Revert "ps1-c[23]-eqiad update" [puppet] - 10https://gerrit.wikimedia.org/r/627216 [15:51:40] (03CR) 10Dzahn: [C: 03+2] add new parsoid servers to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/626721 (https://phabricator.wikimedia.org/T247441) (owner: 10Dzahn) [15:52:03] jouncebot: next [15:52:03] In 1 hour(s) and 7 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T1700) [15:52:35] (03PS1) 10Muehlenhoff: Extend Cumin alias for mw-canary to also include the canary API servers [puppet] - 10https://gerrit.wikimedia.org/r/627327 [15:53:29] 10Operations, 10Platform Engineering, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 4 others: High numbers of HTTP 429 errors - https://phabricator.wikimedia.org/T262691 (10ema) p:05Unbreak!→03High Users should not get 429s anymore, lowering priority to high while waiting for confirmation. For... [15:53:36] (03PS1) 10RLazarus: Revert "trafficserver: Cache-ban pages with localhost links from page content service" [puppet] - 10https://gerrit.wikimedia.org/r/627328 (https://phabricator.wikimedia.org/T262437) [15:54:16] !log restarting apache on webperf* to pick up GNU TLS security update [15:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:32] (03CR) 10CDanis: [C: 03+1] Revert "trafficserver: Cache-ban pages with localhost links from page content service" [puppet] - 10https://gerrit.wikimedia.org/r/627328 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [15:56:51] (03CR) 10Effie Mouzeli: "Should we then add another alias for appserver only canaries?" [puppet] - 10https://gerrit.wikimedia.org/r/627327 (owner: 10Muehlenhoff) [15:58:15] (03CR) 10Ema: [C: 03+1] Revert "trafficserver: Cache-ban pages with localhost links from page content service" [puppet] - 10https://gerrit.wikimedia.org/r/627328 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [15:58:31] !log dzahn@cumin1001 conftool action : set/weight=10; selector: dc=codfw,cluster=parsoid,name=parse20[1-2][0-9].codfw.wmnet [15:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:58:41] (03CR) 10RobH: [C: 03+2] Revert "ps1-c[23]-eqiad update" [puppet] - 10https://gerrit.wikimedia.org/r/627216 (owner: 10RobH) [15:58:58] 10Operations, 10CirrusSearch, 10Elasticsearch, 10Discovery-Search (Current work): Search is currently too busy - https://phabricator.wikimedia.org/T262694 (10CBogen) a:03RKemper [15:59:11] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=parsoid,name=parse2001.codfw.wmnet [15:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:40] !log dzahn@cumin1001 conftool action : set/weight=10; selector: dc=codfw,cluster=parsoid,name=parse20[0-2][0-9].codfw.wmnet [16:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:27] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=parsoid,name=parse200[0-9].codfw.wmnet [16:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:30] !log completed the rollout of restrictive kafka ferm rules on the Kafka jumbo cluster [16:04:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:41] (03PS1) 10RobH: ps1-d[56] update [puppet] - 10https://gerrit.wikimedia.org/r/627330 (https://phabricator.wikimedia.org/T261453) [16:06:12] (03PS2) 10RobH: ps1-d[56] update [puppet] - 10https://gerrit.wikimedia.org/r/627330 (https://phabricator.wikimedia.org/T261453) [16:06:15] (03PS2) 10Ema: varnish: make Accept-Language lowercase [puppet] - 10https://gerrit.wikimedia.org/r/627295 (https://phabricator.wikimedia.org/T262428) [16:06:28] (03CR) 10JMeybohm: [C: 03+1] add dns-disc for releases servers [dns] - 10https://gerrit.wikimedia.org/r/623465 (owner: 10Dzahn) [16:06:53] (03CR) 10RobH: [C: 03+2] ps1-d[56] update [puppet] - 10https://gerrit.wikimedia.org/r/627330 (https://phabricator.wikimedia.org/T261453) (owner: 10RobH) [16:07:20] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=parsoid,name=parse20[1-2][0-9].codfw.wmnet [16:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:02] RECOVERY - mediawiki-installation DSH group on parse2002 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:08:02] RECOVERY - mediawiki-installation DSH group on parse2006 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:09:19] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: enable rsyslog queues in ulsfo/eqsin [puppet] - 10https://gerrit.wikimedia.org/r/627232 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [16:10:04] (03CR) 10RLazarus: [C: 03+2] Revert "trafficserver: Cache-ban pages with localhost links from page content service" [puppet] - 10https://gerrit.wikimedia.org/r/627328 (https://phabricator.wikimedia.org/T262437) (owner: 10RLazarus) [16:11:16] RECOVERY - mediawiki-installation DSH group on parse2007 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:12:36] ^ me, adding new parse servers that are not yet pooled (but being added to pybal config) [16:13:32] RECOVERY - mediawiki-installation DSH group on parse2013 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:13:37] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10RobH) a:05Jclark-ctr→03Cmjohnson It appears all the steps by onsites were done, but its unclear. If there are any pending steps for these, please do so and t... [16:14:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Tue, Sept 15: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [16:16:24] RECOVERY - Host ps1-d6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.36 ms [16:16:36] RECOVERY - Host ps1-d5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.84 ms [16:17:26] PROBLEM - ps1-d6-eqiad-infeed-load-tower-A-phase-X on ps1-d6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:17:27] PROBLEM - ps1-d5-eqiad-infeed-load-tower-B-phase-Z on ps1-d5-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:17:27] PROBLEM - ps1-d5-eqiad-infeed-load-tower-A-phase-Y on ps1-d5-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:17:30] PROBLEM - ps1-d6-eqiad-infeed-load-tower-A-phase-Z on ps1-d6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:17:30] PROBLEM - ps1-d6-eqiad-infeed-load-tower-A-phase-Y on ps1-d6-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:18:50] RECOVERY - mediawiki-installation DSH group on parse2008 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:20:44] RECOVERY - mediawiki-installation DSH group on parse2004 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:21:30] RECOVERY - ps1-d6-eqiad-infeed-load-tower-A-phase-X on ps1-d6-eqiad is OK: SNMP OK - ps1-d6-eqiad-infeed-load-tower-A-phase-X 298 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:21:30] RECOVERY - ps1-d5-eqiad-infeed-load-tower-B-phase-Z on ps1-d5-eqiad is OK: SNMP OK - ps1-d5-eqiad-infeed-load-tower-B-phase-Z 283 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:21:30] RECOVERY - ps1-d5-eqiad-infeed-load-tower-A-phase-Y on ps1-d5-eqiad is OK: SNMP OK - ps1-d5-eqiad-infeed-load-tower-A-phase-Y 251 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:21:38] RECOVERY - ps1-d6-eqiad-infeed-load-tower-A-phase-Z on ps1-d6-eqiad is OK: SNMP OK - ps1-d6-eqiad-infeed-load-tower-A-phase-Z 286 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:21:38] RECOVERY - ps1-d6-eqiad-infeed-load-tower-A-phase-Y on ps1-d6-eqiad is OK: SNMP OK - ps1-d6-eqiad-infeed-load-tower-A-phase-Y 255 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:22:22] RECOVERY - mediawiki-installation DSH group on parse2001 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:22] RECOVERY - mediawiki-installation DSH group on parse2003 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:22] RECOVERY - mediawiki-installation DSH group on parse2005 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:22] RECOVERY - mediawiki-installation DSH group on parse2009 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:22] RECOVERY - mediawiki-installation DSH group on parse2010 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:22] RECOVERY - mediawiki-installation DSH group on parse2011 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:22] RECOVERY - mediawiki-installation DSH group on parse2012 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:23] RECOVERY - mediawiki-installation DSH group on parse2014 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:23] RECOVERY - mediawiki-installation DSH group on parse2015 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:24] RECOVERY - mediawiki-installation DSH group on parse2016 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:24] RECOVERY - mediawiki-installation DSH group on parse2018 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:25] RECOVERY - mediawiki-installation DSH group on parse2017 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:25] RECOVERY - mediawiki-installation DSH group on parse2019 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:22:26] RECOVERY - mediawiki-installation DSH group on parse2020 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:23:14] those ps1 -d[56] are expected [16:23:36] and are now online, no further icinga errors for pdus are expected for d[56] [16:23:40] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 113.9 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [16:23:42] thanks robh, ACK [16:24:52] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Tue, Sept 15: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [16:25:05] (03PS1) 10Jcrespo: mariadb: Add arbcom_ruwiki to the list of private wikis [puppet] - 10https://gerrit.wikimedia.org/r/627331 (https://phabricator.wikimedia.org/T262832) [16:25:58] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Tue, Sept 15: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [16:26:12] (03CR) 10Jcrespo: "I figured I could help you a bit by creating the patch myself, but will wait for your ok to +2 and for you to restart the servers (unless " [puppet] - 10https://gerrit.wikimedia.org/r/627331 (https://phabricator.wikimedia.org/T262832) (owner: 10Jcrespo) [16:29:17] (03CR) 10Dzahn: [C: 03+2] Add arbcom-ru.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/627256 (https://phabricator.wikimedia.org/T262812) (owner: 10Majavah) [16:31:27] (03PS1) 10Bartosz Dziewoński: flaggedrevs: Move setting of wgFlaggedRevsAutopromote and wgFlaggedRevsAutoconfirm out of wgExtensionFunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627332 (https://phabricator.wikimedia.org/T237191) [16:32:40] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Tue, Sept 15: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) Please note that ps1-d6-eqiad does not see ps2-d6-eqiad, I suspect it is not linked correctly via cable. The new netbox entries for these two... [16:33:40] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid,name=parse2001.codfw.wmnet [16:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:48] (03PS2) 10Bartosz Dziewoński: flaggedrevs: Move setting of wgFlaggedRevsAutopromote and wgFlaggedRevsAutoconfirm out of wgExtensionFunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627332 (https://phabricator.wikimedia.org/T237191) [16:36:05] !log pooled the first of the new parsoid servers - parse2001 (T247441) [16:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:11] T247441: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 [16:40:47] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) hardware diagnostics error below {F32350724} [16:44:32] (03PS2) 10Majavah: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) [16:48:02] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=parsoid,name=parse2001.codfw.wmnet [16:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:56] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid,name=parse2001.codfw.wmnet [16:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:36] PROBLEM - Check systemd state on ms-be2053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:59:36] (03PS3) 10Majavah: Initial configuration for arbcom_ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627278 (https://phabricator.wikimedia.org/T262812) [17:00:04] ryankemper: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T1700). [17:00:14] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Tue, Sept 15: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) PDUs show correctly in icinga, so the errors for them are legit: ps1-d6-eqiad doesn't see ps2, so it has errors. [17:00:20] Urbanecm: any idea why the bot hasn't touched T262812? [17:00:21] T262812: Create private arbcom-ru wiki - https://phabricator.wikimedia.org/T262812 [17:01:06] Majavah: not sure, perhaps I made the form look wrong :). Check the syntax, I can look soon [17:02:54] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627334 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [17:06:44] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627334 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [17:07:00] (03CR) 10CRusnov: [C: 03+2] interface_automation: Blacklist all interfaces that start with 'lo' [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627334 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [17:08:04] 10Operations, 10ops-codfw: ps1-b3-codfw AB feed current > 12A - https://phabricator.wikimedia.org/T262809 (10Papaul) 05Open→03Resolved a:03Papaul Done [17:13:46] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2053 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:16:50] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid,name=parse2002.codfw.wmnet [17:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:57] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid,name=parse200[0-9].codfw.wmnet [17:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:30] RECOVERY - Check systemd state on ms-be2053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:45] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=parsoid,name=parse20[1-2][0-9].codfw.wmnet [17:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:46] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 78.31 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [17:44:40] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2053 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:45:33] (03PS2) 10Jbond: pdu.rotate-snmp disable urllib warnings and add ability to coninue on errors [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 [17:45:35] (03CR) 10Bartosz Dziewoński: "Huh, the diffConfig job looks cool, so that checks whether this commit actually made any changes to the configuration? Does it works with " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627332 (https://phabricator.wikimedia.org/T237191) (owner: 10Bartosz Dziewoński) [17:46:17] (03PS3) 10Jbond: WIP pdu.rotate-snmp disable urllib warnings and add ability to coninue on errors [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 [17:46:54] is anyone around who could review a config change before the deployment window? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/627332 [17:47:01] it is a little… silly [17:47:22] (03CR) 10jerkins-bot: [V: 04-1] WIP pdu.rotate-snmp disable urllib warnings and add ability to coninue on errors [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (owner: 10Jbond) [17:47:43] (03CR) 10Jbond: WIP pdu.rotate-snmp disable urllib warnings and add ability to coninue on errors (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (owner: 10Jbond) [17:49:13] sure I can take a look [17:50:11] ugh, flaggedrevs [17:50:16] maybe not [17:50:17] i know, right [17:51:01] !log all new parse* parsoid hardware pooled now and set to active in netbox, deploy in 10 min will add to $wgLinterSubmitterWhitelist (T247441) [17:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:07] T247441: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 [17:51:43] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627336 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [17:54:03] (03CR) 10Dzahn: [C: 03+1] "+1 and also +1 to Effie, for consistency we should have one for canary-app, one for canary-api and one that combines both (this one)" [puppet] - 10https://gerrit.wikimedia.org/r/627327 (owner: 10Muehlenhoff) [17:59:24] jouncebot: bring it [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T1800). [18:00:04] Tchanders, Mutante, Ashot1997, ottomata, and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:18] ahoy [18:00:22] yo [18:00:23] Hi [18:00:25] I can deploy today [18:00:28] o/ [18:00:41] Tchanders: are you around? [18:01:54] (03CR) 10Bartosz Dziewoński: "To confirm that I didn't mess anything up in this large diff, a command like this might be helpful:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627332 (https://phabricator.wikimedia.org/T237191) (owner: 10Bartosz Dziewoński) [18:02:03] mutante: is it possible to test your patch? [18:02:31] (03PS3) 10Urbanecm: Add logo Wordmark and Tagline for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626740 (https://phabricator.wikimedia.org/T259985) (owner: 10Ashot1997) [18:02:38] (03CR) 10Urbanecm: [C: 03+2] Add logo Wordmark and Tagline for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626740 (https://phabricator.wikimedia.org/T259985) (owner: 10Ashot1997) [18:02:45] Ashot1997: I'll start with your patch [18:02:55] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "Default to using API json formatversion=2"" [extensions/EventStreamConfig] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/627215 (owner: 10Ottomata) [18:03:22] Urbanecm cool, thanks ^_^ [18:03:25] (03Merged) 10jenkins-bot: Add logo Wordmark and Tagline for hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626740 (https://phabricator.wikimedia.org/T259985) (owner: 10Ashot1997) [18:03:40] Urbanecm: i don't think so. besides "errors identified by parsoid will be exposed". it is for https://www.mediawiki.org/wiki/Extension:Linter [18:04:05] mutante: okay. Would you mind syncing that yourself at the end? [18:04:19] Ashot1997: your patch is at mwdebug2001, could you test please? [18:04:20] Urbanecm: yes, i would [18:04:33] (03PS1) 10Volans: wmf-auto-reimage: update Netbox interfaces [puppet] - 10https://gerrit.wikimedia.org/r/627337 (https://phabricator.wikimedia.org/T244153) [18:04:59] mutante: I'm sorry, does that mean "I would mind"? [18:05:03] Urbanecm: yes [18:05:33] (03CR) 10jerkins-bot: [V: 04-1] wmf-auto-reimage: update Netbox interfaces [puppet] - 10https://gerrit.wikimedia.org/r/627337 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [18:05:53] mutante: ack, I'll do that at the end [18:06:20] Urbanecm: thank you! [18:07:02] Urbanecm: It looks okay [18:07:03] Ashot1997: ping? :-) [18:07:08] ah, you were on it, thanks [18:07:09] syncing it then [18:07:27] (03PS2) 10Volans: wmf-auto-reimage: update Netbox interfaces [puppet] - 10https://gerrit.wikimedia.org/r/627337 (https://phabricator.wikimedia.org/T244153) [18:08:39] !log urbanecm@deploy1001 Synchronized static/images/mobile/copyright/: 699f5e8c2a50f35e98850ea32f7847d183600351: Add logo Wordmark and Tagline for hywiki (T259985) (duration: 00m 56s) [18:08:41] Urbanecm: Thank you ^_^ [18:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:45] T259985: Add new mobile watermark for Armenian Wikipedia - https://phabricator.wikimedia.org/T259985 [18:09:45] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 699f5e8c2a50f35e98850ea32f7847d183600351: Add logo Wordmark and Tagline for hywiki (T259985) (duration: 00m 55s) [18:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:53] Ashot1997: happy to help :-) [18:09:55] Urbanecm: when it comes time to it feel free to just sync mine, skipping mwdebug; we tested that last week :) [18:10:03] sure ottomata [18:10:42] (03CR) 10Urbanecm: [C: 03+2] flaggedrevs: Move setting of wgFlaggedRevsAutopromote and wgFlaggedRevsAutoconfirm out of wgExtensionFunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627332 (https://phabricator.wikimedia.org/T237191) (owner: 10Bartosz Dziewoński) [18:11:25] (03Merged) 10jenkins-bot: flaggedrevs: Move setting of wgFlaggedRevsAutopromote and wgFlaggedRevsAutoconfirm out of wgExtensionFunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627332 (https://phabricator.wikimedia.org/T237191) (owner: 10Bartosz Dziewoński) [18:12:21] MatmaRex: pulled onto mwdebug2001, can you test, please? [18:13:11] Urbanecm: i can't really test the functionality [18:13:24] since the autopromote rules only kick in when an eligible user makes an edit [18:13:32] but the site looks still up [18:13:51] MatmaRex: okay, for some reason I thought you have a way [18:13:56] syncing it then [18:14:25] Urbanecm: i'm planning to just watch the log on plwiki [18:14:30] MatmaRex: ack [18:15:02] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) a:05Dzahn→03MNovotny_WMF [18:15:04] MatmaRex: I have 275 edits at plwiki, is that enough to expect an autopromotion? [18:15:12] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Dzahn) p:05Triage→03Medium [18:15:17] !log urbanecm@deploy1001 Synchronized wmf-config/flaggedrevs.php: 720e6cbfe1800fe32dc65c209240ba08706dbb17: flaggedrevs: Move setting of wgFlaggedRevsAutopromote and wgFlaggedRevsAutoconfirm out of wgExtensionFunctions (T237191) (duration: 00m 56s) [18:15:23] done [18:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:25] T237191: FlaggedRevs: Automatic user promotion stopped working on some wikis on June 24, 2019 - https://phabricator.wikimedia.org/T237191 [18:15:42] Tchanders: ping? are you around? [18:15:42] Urbanecm: we have it at 500 apparently [18:15:47] but some other wiki might work [18:16:03] okay, thanks, I can try later [18:16:05] Urbanecm: Sorry, was in a meeting that overran! [18:16:17] I am around now if that's still good? [18:16:33] Tchanders: sure, I'm doing some other patches now, I'll ping you soon :) [18:16:36] (03PS3) 10Urbanecm: add new parse* servers to $wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626719 (https://phabricator.wikimedia.org/T247441) (owner: 10Dzahn) [18:16:43] Thanks! [18:17:28] (03CR) 10Urbanecm: [C: 03+2] "B&C" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626719 (https://phabricator.wikimedia.org/T247441) (owner: 10Dzahn) [18:17:34] Urbanecm: closest i have is: i can confirm all these servers have parsoid, all icinga alerts are green, they are pooled, the comments say to add them here (anytime), https://www.mediawiki.org/w/api.php?action=query&list=linterrors should still work [18:17:58] code comments just say to not forget this and it's just for finding lint issues [18:18:14] (03Merged) 10jenkins-bot: add new parse* servers to $wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626719 (https://phabricator.wikimedia.org/T247441) (owner: 10Dzahn) [18:18:57] thanks mutante [18:19:11] thanks a lot Urbanecm [18:19:34] I pulled it onto mwdebug2001, https://www.mediawiki.org/w/api.php?action=query&list=linterrors works, and so does https://www.mediawiki.org/w/api.php?action=query&list=linterrors [18:19:38] * https://www.mediawiki.org/wiki/Special:LintErrors [18:19:59] Urbanecm: confirmed, looks good to me [18:20:03] okay, syncing [18:21:14] it's like a whitelist, which server is allowed to send lint issues [18:21:21] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 27ba5a1da1fb00e721cfa82dd4cd1fbac2541053: add new parse* servers to $wgLinterSubmitterWhitelist (T247441) (duration: 00m 56s) [18:21:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:27] T247441: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 [18:21:31] gotcha mutante [18:21:34] should be live [18:21:38] thanks :) [18:21:54] (03PS2) 10Urbanecm: Remove the 'investigate' right from testwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620091 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [18:21:58] api.php for linterrors still looking ok [18:22:00] (03CR) 10Urbanecm: [C: 03+2] Remove the 'investigate' right from testwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620091 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [18:22:06] ack mutante [18:22:21] Tchanders: your patches are going next :) [18:22:45] (03Merged) 10jenkins-bot: Remove the 'investigate' right from testwiki and frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620091 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [18:23:25] Tchanders: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/620091/ is pulled to mwdebug2001 now, can you test, please? [18:23:36] oh, investigate rolling [18:23:40] Urbanecm: Thanks, taking a look [18:26:02] (03Merged) 10jenkins-bot: Revert "Revert "Default to using API json formatversion=2"" [extensions/EventStreamConfig] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/627215 (owner: 10Ottomata) [18:26:08] (03PS4) 10Jbond: WIP pdu.rotate-snmp disable urllib warnings and add ability to coninue on errors [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 [18:26:25] ottomata: fyi, syncing your patch w/o i18n build [18:26:57] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10RLazarus) p:05Triage→03Medium [18:27:01] (03CR) 10jerkins-bot: [V: 04-1] WIP pdu.rotate-snmp disable urllib warnings and add ability to coninue on errors [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (owner: 10Jbond) [18:27:09] 10Operations, 10SRE-Access-Requests, 10cloud-services-team (Kanban): wikitech-static access for Sam Reed - https://phabricator.wikimedia.org/T262468 (10RLazarus) Added to the agenda for the next SRE meeting, 2020-09-21. [18:28:14] Urbanecm: Looks good [18:28:24] thanks, syncing that too :-) [18:28:25] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627339 (owner: 10CRusnov) [18:28:29] Urbanecm: thank you [18:29:01] (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627337 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [18:29:34] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627339 (owner: 10CRusnov) [18:29:59] (03CR) 10CRusnov: [C: 03+2] reports/cables.py: Exclude servers from interface name report [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627339 (owner: 10CRusnov) [18:30:08] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.8/extensions/EventStreamConfig/includes/: a4c86089371319ae5a3bb6053c4a9b3e83130286: Default to using API json formatversion=2 (T251609) (duration: 00m 57s) [18:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:14] T251609: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events - https://phabricator.wikimedia.org/T251609 [18:30:19] ottomata: done [18:30:28] (03PS2) 10Urbanecm: Remove 'investigate' from $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620092 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [18:30:33] (03CR) 10Urbanecm: [C: 03+2] Remove 'investigate' from $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620092 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [18:30:35] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: update Netbox interfaces [puppet] - 10https://gerrit.wikimedia.org/r/627337 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [18:31:00] looks good Urbanecm ty [18:31:13] (03Merged) 10jenkins-bot: Remove 'investigate' from $wgAvailableRights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/620092 (https://phabricator.wikimedia.org/T260175) (owner: 10Tchanders) [18:31:38] (03PS2) 10Urbanecm: Enable Special:Investigate on itwiki, eswiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626715 (https://phabricator.wikimedia.org/T262436) (owner: 10Tchanders) [18:32:09] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d2fa6533a594c8544342954eae19a4a0f7baeff0: Remove the investigate right from testwiki and frwiki (T260175) (duration: 00m 56s) [18:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:15] T260175: Remove investigate right in favor of checkuser right for Special:Investigate - https://phabricator.wikimedia.org/T260175 [18:32:39] Tchanders: first patch is tested [18:32:41] *synced [18:33:03] Urbanecm: Great, still looks fine [18:36:58] I'm straight syncing the available rights, it works from the global group management interface [18:37:11] (03CR) 10Urbanecm: [C: 03+2] Enable Special:Investigate on itwiki, eswiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626715 (https://phabricator.wikimedia.org/T262436) (owner: 10Tchanders) [18:37:16] Urbanecm: fyi my patch seems to be working as expected: https://phabricator.wikimedia.org/T237191#6459947 [18:37:32] thanks MatmaRex [18:37:55] (03Merged) 10jenkins-bot: Enable Special:Investigate on itwiki, eswiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/626715 (https://phabricator.wikimedia.org/T262436) (owner: 10Tchanders) [18:38:26] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 7d1939323cc3ea5dacf67a43d4d359c114203a66: Remove investigate from $wgAvailableRights (T260175) (duration: 00m 56s) [18:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:33] T260175: Remove investigate right in favor of checkuser right for Special:Investigate - https://phabricator.wikimedia.org/T260175 [18:39:52] Tchanders: Enable Special:Investigate on itwiki, eswiki and svwiki is fetched to mwdebug2001, can you test it, please? [18:41:01] Urbanecm: Thanks - looks great except for some missing Italian translations - let me check if that's OK... [18:41:09] sure, waiting [18:42:34] (03PS5) 10Jbond: WIP cookbook sre.pdu: Fix reboot logic [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 [18:43:31] (03CR) 10jerkins-bot: [V: 04-1] WIP cookbook sre.pdu: Fix reboot logic [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (owner: 10Jbond) [18:44:40] Urbanecm: Sorry, I think we need those translations. Would it be OK if I amended that patch just to enable on Spanish now? [18:44:57] Tchanders: sure, upload a follow-up patch, and I'll deploy them both. [18:45:19] Urbanecm: Thanks - just doing that [18:46:26] * hauskatze did most of the 'es' Investigate translations [18:46:38] thanks hauskatze :-) [18:46:48] I think I missed the aliases but I may add them later [18:46:53] (03PS6) 10Jbond: WIP cookbook sre.pdu: Fix reboot logic [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 [18:46:57] or not, depends on my mood [18:47:38] !log volans@cumin1001 START - Cookbook sre.hosts.downtime [18:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:45] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:49:58] (03PS1) 10Tchanders: Don't enable Special:Investigate on itwiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627341 (https://phabricator.wikimedia.org/T262436) [18:51:00] Urbanecm: That's the follow-up [18:51:05] okay, thanks [18:51:07] looking [18:51:19] (03CR) 10Urbanecm: [C: 03+2] Don't enable Special:Investigate on itwiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627341 (https://phabricator.wikimedia.org/T262436) (owner: 10Tchanders) [18:51:21] (03PS7) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 [18:52:04] (03Merged) 10jenkins-bot: Don't enable Special:Investigate on itwiki and svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627341 (https://phabricator.wikimedia.org/T262436) (owner: 10Tchanders) [18:52:46] (03CR) 10Jbond: "Ready for review but see inline note re urllib3.disable_warnings" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 (owner: 10Jbond) [18:53:33] Tchanders: pulled onto mwdebug2001, can you make sure it works please? [18:53:34] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627342 (owner: 10CRusnov) [18:53:40] (03PS8) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 [18:54:03] Urbanecm: Looks great - thank you! [18:54:06] (03PS9) 10Jbond: cookbook sre.pdu: Fix reboot logic and other minor fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/627272 [18:54:20] thanks, syncing [18:54:53] (03PS2) 10CRusnov: reports/cables.py: Exclude virtual machines from name check [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627342 [18:55:36] Urbanecm: Awesome, thanks [18:57:28] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a5d56edc7460ac43492f9c04cff86c1b03e56fa4: e2f47980c371b52b1b66957f2bff2266745ab00a: Enable Special:Investigate on eswiki (T262436) (duration: 00m 56s) [18:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:36] T262436: Deploy Special:Investigate to Spanish, Swedish and Italian wikipedias - https://phabricator.wikimedia.org/T262436 [18:57:42] Tchanders: should be live [18:58:20] anything else? [18:59:51] Urbanecm: Oh yes, I see it. I think everything's OK now - I'll schedule enabling the others for next week. Thanks! [18:59:59] cool! [19:00:33] !log Morning B&C done [19:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:08] (03PS7) 10Ahmon Dancy: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [19:03:29] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627201 (owner: 10Muehlenhoff) [19:03:39] (03CR) 10jerkins-bot: [V: 04-1] Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [19:08:53] (03PS8) 10Ahmon Dancy: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [19:09:34] (03PS1) 10RLazarus: admin: Add Sudhanshu Gautam to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/627348 (https://phabricator.wikimedia.org/T262785) [19:12:40] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Tue, Sept 15 PDU Upgrade 12pm-4pm UTC- Racks C2 and C3 - https://phabricator.wikimedia.org/T261455 (10wiki_willy) [19:12:43] (03CR) 10Ahmon Dancy: Factor out datacenters lists (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [19:14:39] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Mon, Sept 14: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10wiki_willy) [19:14:59] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By:TBD) rack/setup/install rows C and D new PDUs - https://phabricator.wikimedia.org/T253694 (10wiki_willy) [19:16:35] (03PS2) 10CDanis: admin: Add Sudhanshu Gautam to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/627348 (https://phabricator.wikimedia.org/T262785) (owner: 10RLazarus) [19:16:57] (03CR) 10CDanis: [C: 03+1] admin: Add Sudhanshu Gautam to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/627348 (https://phabricator.wikimedia.org/T262785) (owner: 10RLazarus) [19:22:46] (03CR) 10Jbond: admin: Add Sudhanshu Gautam to ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627348 (https://phabricator.wikimedia.org/T262785) (owner: 10RLazarus) [19:24:56] (03PS3) 10RLazarus: admin: Add Sudhanshu Gautam to wmf group [puppet] - 10https://gerrit.wikimedia.org/r/627348 (https://phabricator.wikimedia.org/T262785) [19:25:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627348 (https://phabricator.wikimedia.org/T262785) (owner: 10RLazarus) [19:26:17] (03CR) 10RLazarus: [C: 03+2] admin: Add Sudhanshu Gautam to wmf group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627348 (https://phabricator.wikimedia.org/T262785) (owner: 10RLazarus) [19:31:19] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:32:02] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Mon, Sept 14: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [19:32:34] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: New Date - Mon, Sept 14: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [19:32:50] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [19:34:16] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) I've removed the due/work date as the majority of the onsite work was completed today. All that is pending onsite work is: [] - onsite to update this task with the as... [19:36:15] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [19:37:44] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Sudhanshu Gautam - https://phabricator.wikimedia.org/T262785 (10RLazarus) 05Open→03Resolved p:05Triage→03Medium a:03RLazarus ` rzl@mwmaint2001:~$ ldapsearch -x cn=wmf | grep sudhanshugautam member: uid=sudhan... [19:38:20] (03PS1) 10Volans: wmf-auto-reimage: fix Netbox update [puppet] - 10https://gerrit.wikimedia.org/r/627352 (https://phabricator.wikimedia.org/T244153) [19:38:59] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: PDU Upgrade 12pm-4pm UTC- Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [19:39:27] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: PDU Upgrade Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [19:40:47] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627336 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [19:41:17] PROBLEM - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [2000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [19:42:30] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627342 (owner: 10CRusnov) [19:42:57] (03PS1) 10CDanis: modify wgEventStreams to reference NEL schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627353 (https://phabricator.wikimedia.org/T262087) [19:52:10] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: PDU Upgrade Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [19:56:23] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: PDU Upgrade Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [19:56:42] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: PDU Upgrade Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) So the only pending item is ps1-d6-eqiad cannot see ps2-d6-eqiad. I suspect the silver link cable is not seated. [19:59:20] 10Operations, 10ops-eqiad, 10DC-Ops: PDU Upgrade Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [20:00:04] halfak and accraze: Time to snap out of that daydream and deploy Services – Graphoid / ORES. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T2000). [20:00:23] 10Operations, 10ops-eqiad, 10DC-Ops: Mon, Sept 14th - PDU Upgrade Racks D5 and D6 - https://phabricator.wikimedia.org/T261453 (10RobH) [20:01:21] (03CR) 10Krinkle: Factor out datacenters lists (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [20:01:35] (03CR) 10Cwhite: [C: 03+2] raid, smart: bypass facter timeout by calling facter script directly [puppet] - 10https://gerrit.wikimedia.org/r/626405 (https://phabricator.wikimedia.org/T251293) (owner: 10Cwhite) [20:03:11] (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627352 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [20:10:49] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [20:13:23] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [20:16:37] (03CR) 10Ahmon Dancy: Factor out datacenters lists (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [20:16:39] ACKNOWLEDGEMENT - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [2000.0] Ryan Kemper https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=codfw&var-smoothing=1&viewPanel=39&from=1599207346029&to=1600416946031 Appears to just be related to the dc switchover - eqiad has been offline (intentionally) https:// [20:16:39] a.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [20:19:19] (03PS9) 10Ahmon Dancy: Factor out datacenters lists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 [20:19:58] (03PS2) 10CDanis: modify wgEventStreams to reference NEL schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627353 (https://phabricator.wikimedia.org/T262087) [20:21:33] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: fix Netbox update [puppet] - 10https://gerrit.wikimedia.org/r/627352 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [20:22:03] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [20:22:06] (03CR) 10Ottomata: [C: 03+1] modify wgEventStreams to reference NEL schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627353 (https://phabricator.wikimedia.org/T262087) (owner: 10CDanis) [20:22:27] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [20:22:52] (03CR) 10CDanis: [C: 03+2] modify wgEventStreams to reference NEL schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627353 (https://phabricator.wikimedia.org/T262087) (owner: 10CDanis) [20:23:39] (03Merged) 10jenkins-bot: modify wgEventStreams to reference NEL schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627353 (https://phabricator.wikimedia.org/T262087) (owner: 10CDanis) [20:24:04] jouncebot: now [20:24:04] For the next 0 hour(s) and 35 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T2000) [20:24:33] I don't see any activity from accraze so I'm going ahead [20:26:00] !log cdanis@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a588eb0c6 T262087 modify wgEventStreams to reference NEL schema (duration: 00m 56s) [20:26:05] (03CR) 10Ahmon Dancy: [C: 04-1] "Holding" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622193 (owner: 10Ahmon Dancy) [20:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:08] T262087: Deploy an updated eventgate-logging-external with NEL patches - https://phabricator.wikimedia.org/T262087 [20:36:37] !log volans@cumin1001 START - Cookbook sre.hosts.downtime [20:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:46] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:33] (03CR) 10Mholloway: [C: 03+2] Mobileapps: Use backtick to simplify templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/626196 (owner: 10Ppchelko) [20:39:42] (03PS1) 10CDanis: logstash collectors: accept Network Error Logging reports [puppet] - 10https://gerrit.wikimedia.org/r/627364 (https://phabricator.wikimedia.org/T257527) [20:41:02] (03Merged) 10jenkins-bot: Mobileapps: Use backtick to simplify templates [deployment-charts] - 10https://gerrit.wikimedia.org/r/626196 (owner: 10Ppchelko) [20:44:54] (03PS1) 10Volans: wmf-auto-reimage: fix Netbox update (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/627367 (https://phabricator.wikimedia.org/T244153) [20:48:04] (03CR) 10CRusnov: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627367 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [20:48:21] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: fix Netbox update (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/627367 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [20:50:02] jouncebot: now [20:50:02] For the next 0 hour(s) and 9 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T2000) [20:50:57] 10Operations, 10serviceops: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 (10Dzahn) [20:51:40] (03CR) 10Ottomata: logstash collectors: accept Network Error Logging reports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627364 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [20:53:39] (03PS1) 10Mholloway: Update mobileapps to 2020-09-14-204752-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/627369 [20:55:11] (03CR) 10Mholloway: [C: 03+2] Update mobileapps to 2020-09-14-204752-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/627369 (owner: 10Mholloway) [20:55:37] jouncebot: next [20:55:37] In 0 hour(s) and 4 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T2100) [20:55:56] (03CR) 10CDanis: logstash collectors: accept Network Error Logging reports (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/627364 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [20:56:24] (03PS2) 10CDanis: logstash collectors: accept Network Error Logging reports [puppet] - 10https://gerrit.wikimedia.org/r/627364 (https://phabricator.wikimedia.org/T257527) [20:56:28] (03Merged) 10jenkins-bot: Update mobileapps to 2020-09-14-204752-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/627369 (owner: 10Mholloway) [20:56:45] i was going to try to squeeze in a mobileapps deployment, but it can wait until the security deployment is done [20:58:57] 10Operations, 10serviceops: move 20 new codfw parsoid servers (parse2*) into production - https://phabricator.wikimedia.org/T247441 (10Dzahn) 05Open→03Resolved - servers had OS installed - servers had puppet role applied - icinga checks confirmed all green - added to conftool data - set weight for all to 1... [21:00:04] Reedy and sbassett: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T2100). [21:03:51] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/627364 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [21:03:54] !log volans@cumin1001 START - Cookbook sre.hosts.downtime [21:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:59] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:08] RECOVERY - CirrusSearch codfw 95th percentile latency - more_like on graphite1004 is OK: OK: Less than 20.00% above the threshold [1200.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=39&fullscreen&orgId=1&var-cluster=codfw&var-smoothing=1 [21:10:08] (03CR) 10CRusnov: [C: 03+2] reports/cables.py: Exclude virtual machines from name check [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627342 (owner: 10CRusnov) [21:10:15] (03CR) 10CRusnov: [C: 03+2] interface_automation: Fix the messages when setting IP as primary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/627336 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [21:12:00] (03PS1) 10Ssingh: wikidough: update landing page text [puppet] - 10https://gerrit.wikimedia.org/r/627372 [21:14:43] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/25063/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/627372 (owner: 10Ssingh) [21:14:58] (03CR) 10Ssingh: [C: 03+2] wikidough: update landing page text [puppet] - 10https://gerrit.wikimedia.org/r/627372 (owner: 10Ssingh) [21:17:17] (03PS1) 10Volans: wmf-auto-reimage: move Netbox update later on [puppet] - 10https://gerrit.wikimedia.org/r/627374 (https://phabricator.wikimedia.org/T244153) [21:18:00] (03CR) 10CRusnov: [C: 03+1] "LGTM hopefully it works" [puppet] - 10https://gerrit.wikimedia.org/r/627374 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [21:18:29] (03CR) 10Volans: [C: 03+2] wmf-auto-reimage: move Netbox update later on [puppet] - 10https://gerrit.wikimedia.org/r/627374 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans) [21:23:28] (03CR) 10CDanis: [C: 03+2] "thanks for the review! doing a staged rollout as discussed" [puppet] - 10https://gerrit.wikimedia.org/r/627364 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [21:24:01] !log T257527 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo cumin 'R:Class ~ "(?i)profile::logstash::collector7"' 'disable-puppet "cdanis rolling out Ifa3c68e4"' [21:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:07] T257527: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 [21:27:55] cdanis: out of curiosity, 'C:profile::logstash::collector7' was missing something? [21:30:17] volans: you're suggesting I did something other than just C-r for 'profile' [21:30:36] !log T257527 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕠🍺 sudo cumin 'R:Class ~ "(?i)profile::logstash::collector7"' 'enable-puppet "cdanis rolling out Ifa3c68e4"' [21:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:42] lol [21:30:42] T257527: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 [21:32:43] !log volans@cumin1001 START - Cookbook sre.hosts.downtime [21:32:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:55] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:41] (03PS3) 10Dzahn: puppetdb: (re)move hiera lookup for db pass to profile [puppet] - 10https://gerrit.wikimedia.org/r/624340 [21:40:01] (03CR) 10Dzahn: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/25064/puppetdb2002.codfw.wmnet/change.puppetdb2002.codfw.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/624340 (owner: 10Dzahn) [21:43:27] (03PS3) 10DannyS712: [Beta cluster] add a fake 'UselessRightForTesting' to available rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/583745 (https://phabricator.wikimedia.org/T241503) [21:47:15] (03PS1) 10Bstorm: wikireplicas: Proposal for a proxy setup on multi-instance replicas [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) [21:48:16] (03CR) 10Bstorm: [C: 04-2] "This is submitted for discussion purposes, primarily until we have some multi-instance replica hosts to test it against. Setting -2 to dis" [puppet] - 10https://gerrit.wikimedia.org/r/627379 (https://phabricator.wikimedia.org/T260389) (owner: 10Bstorm) [21:55:26] (03CR) 10Bstorm: "Iee32e82031518b1e9b90e more or less completes the server-side setup (if it looks sane to everyone). I am more confident in the work on thi" [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [21:55:35] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Dzahn) [21:56:46] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10Dzahn) [21:59:43] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10CDanis) An update on my last known disposition of the issue: * It appears to be an intermittent problem; individual outages are not long in duration, but there hav... [22:10:23] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10Bstorm) We are currently trying to make it more straightforward to manage th... [22:12:03] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10Bstorm) >>! In T261145#6435719, @jbond wrote: > Further it sounds like it mi... [22:31:47] 10Operations, 10Wikimedia-Mailing-lists: Create arbcom-ru@wikimedia.org - https://phabricator.wikimedia.org/T262525 (10eross) >>! In T262525#6455525, @Adamant.pwn wrote: > Welp, there's one more question. You said we should contact ITS team, which is former #WMF-Office-IT, I guess? Project's description says w... [22:39:52] (03PS5) 10Dzahn: scap: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624343 (https://phabricator.wikimedia.org/T209953) [22:40:37] (03PS4) 10Catrope: Enable and configure GrowthExperiments on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625963 (https://phabricator.wikimedia.org/T254239) [22:45:48] !log mholloway-shell@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [22:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:26] (03PS1) 10Dzahn: profile::scap::dsh: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/627391 [22:49:53] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [22:49:53] !log mholloway-shell@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [22:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:36] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/25067/" [puppet] - 10https://gerrit.wikimedia.org/r/624343 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:51:11] 10Operations, 10ops-eqiad, 10DC-Ops: Netbox report accounting icinga alert - https://phabricator.wikimedia.org/T250053 (10wiki_willy) To your first question, I was hoping there could be some type of autogenerated task that assigns each dc engineer a Phabricator task by data center site. The idea is that it... [22:51:43] 10Operations, 10Traffic, 10netops: Wikimedia projects not reachable for some Telecom Italia users - https://phabricator.wikimedia.org/T262869 (10CDanis) Today we had reports of an issue from @Andyrom75 that was happening all the time on their Wind (AS1267) mobile connection, and was happening under some circ... [22:57:19] 10Operations, 10Analytics, 10Patch-For-Review: Deploy an updated eventgate-logging-external with NEL patches - https://phabricator.wikimedia.org/T262087 (10CDanis) I believe the only thing left to do is to perform a rolling restart of the eventgate-logging-external pods (or the container within them). I'd l... [22:59:29] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [22:59:29] !log mholloway-shell@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [22:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:36] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki protected page) is CRITICAL: Test Fetch enwiki protected page returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [22:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200914T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:32] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [23:04:34] PROBLEM - kubelet operational latencies on kubernetes1014 is CRITICAL: instance=kubernetes1014.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:08:26] PROBLEM - kubelet operational latencies on kubernetes1014 is CRITICAL: instance=kubernetes1014.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:12:18] RECOVERY - kubelet operational latencies on kubernetes1014 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [23:24:00] 10Operations, 10Android-app-Bugs, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, and 4 others: Incorrect language variant returned for PCS endpoints - https://phabricator.wikimedia.org/T249284 (10JoeWalsh) 05Open→03Resolved a:03JoeWalsh @holger.knust this bug is separate from T256491.... [23:26:54] (03PS1) 10Catrope: Enable and configure GrowthExperiments on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/627393 (https://phabricator.wikimedia.org/T255027) [23:28:58] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Papaul, That error just points to the existence of errors. Were there any other errors following that? I'm sorry if I didn't mak... [23:29:59] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Hey Michael, No need to be sorry I knew I had to continue diagnostics which I did and it was taking too long and was waiting for it... [23:49:32] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [23:51:33] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [23:53:24] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29