[00:00:04] RoanKattouw, Niharika, and Urbanecm: How many deployers does it take to do Evening backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T0000). [00:00:04] AndyRussG: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:02:38] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@8ca6884]: cirrus_namespace_map: Use retries when fetching [00:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:09] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10brennen) [00:03:32] o/ [00:03:59] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@8ca6884]: cirrus_namespace_map: Use retries when fetching (duration: 01m 21s) [00:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:31] RoanKattouw Niharika Urbanecm hiiiiiii............ ;) [00:10:58] AndyRussG: hi [00:11:15] I guess no one claimed the window yet [00:11:24] I can deploy today, if AndyRussG is still here :) [00:11:54] Urbanecm: hey yeah that'd be fantastic! [00:12:04] Urbanecm: thanks so much!!!! [00:12:29] It's just two teen changes to stop an error on prod (that actually doesn't affect users, but should be stopped) [00:13:39] okay [00:14:03] AndyRussG: do the backport and config changes depend on each other? [00:14:08] I didn't know if I should make the branch commits, wasn't sure what the policy currently was [00:14:15] Urbanecm: no no dependency [00:14:19] okay [00:14:41] The CentralNotice commit is on CN's wmf_deploy branch, which is what we deploy [00:14:47] (03PS1) 10Urbanecm: Remove optedOutCampaigns property from impression data [extensions/CentralNotice] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664665 (https://phabricator.wikimedia.org/T275054) [00:16:56] AndyRussG: oh, does CN work in a different way than other exts? [00:17:04] (ie. via wmf.XX branches?) [00:17:27] the config has a -1 by you AndyRussG, will +2 once you remove it [00:17:49] Urbanecm: in the way the wmf.XX branches are deployed, it's the same [00:17:56] got it [00:18:05] (03CR) 10Urbanecm: [C: 03+2] Remove optedOutCampaigns property from impression data [extensions/CentralNotice] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664665 (https://phabricator.wikimedia.org/T275054) (owner: 10Urbanecm) [00:18:15] just the difference is the commits for those branches are taken from wmf_deploy rather than master [00:18:22] (03PS1) 10Urbanecm: Remove optedOutCampaigns property from impression data [extensions/CentralNotice] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664966 (https://phabricator.wikimedia.org/T275054) [00:18:26] got it [00:18:29] (03CR) 10Urbanecm: [C: 03+2] Remove optedOutCampaigns property from impression data [extensions/CentralNotice] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664966 (https://phabricator.wikimedia.org/T275054) (owner: 10Urbanecm) [00:19:00] (03CR) 10AndyRussG: [C: 03+1] Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054) (owner: 10AndyRussG) [00:19:23] (03PS3) 10Urbanecm: Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054) (owner: 10AndyRussG) [00:19:27] Urbanecm: k done... yeah I just put the -1 because sometimes folks forget it shouldn't merge until deploy time [00:19:33] (03CR) 10Urbanecm: [C: 03+2] Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054) (owner: 10AndyRussG) [00:19:35] hehe [00:19:45] people who do that shouldn't have +2 on config, IMO [00:20:38] (03Merged) 10jenkins-bot: Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054) (owner: 10AndyRussG) [00:20:57] oh well.. it's just that fr stack works differently [00:21:20] i see [00:21:33] AndyRussG: pulled to mwdebug1001 if you wish to test it there [00:22:02] Urbanecm: yes one sec.... ahh forgot I haven't yet installed the mwdebug extension on this computer, one sec [00:22:02] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:23:39] (03Merged) 10jenkins-bot: Remove optedOutCampaigns property from impression data [extensions/CentralNotice] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/664665 (https://phabricator.wikimedia.org/T275054) (owner: 10Urbanecm) [00:23:46] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [00:23:56] (03Merged) 10jenkins-bot: Remove optedOutCampaigns property from impression data [extensions/CentralNotice] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664966 (https://phabricator.wikimedia.org/T275054) (owner: 10Urbanecm) [00:24:28] Urbanecm: I'm not seeing the change live, but it takes a bit for the RL cache to roll over still, no? [00:24:35] lemme try debug=true [00:24:43] yes, debug=true should make it work [00:25:26] also pulled the backports to mwdebug1001 [00:27:42] Urbanecm: ok I see it now, looks fine :) [00:27:52] great, syncing [00:28:37] !log urbanecm@deploy1001 sync-file aborted: 08b32c453a1e879e6321ebec39122d0e06e14714: Remove wgCentralNoticeImpressionEventSampleRate; will default to 0 (T275054) (duration: 00m 00s) [00:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:44] T275054: Many invalid CentralNoticeImpression events - https://phabricator.wikimedia.org/T275054 [00:29:23] Urbanecm: thanks! the config change is not something I can see easily in a browser I think [00:30:23] I think the smoke test there is just that the site isn't burning down (very very unlikely, we just eliminated some config lines) [00:30:23] yup, I expect this to just work [00:31:06] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 08b32c453a1e879e6321ebec39122d0e06e14714: Remove wgCentralNoticeImpressionEventSampleRate; will default to 0 (T275054) (duration: 02m 17s) [00:31:10] config should be live [00:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:54] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.31/extensions/CentralNotice/resources/ext.centralNotice.display/state.js: ff444c28eacbac45476b8fbaed82bc3d8fc4dc66: Remove optedOutCampaigns property from impression data (T275054) (duration: 01m 09s) [00:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:57] Urbanecm: cool yes I don't see anything bad :) [00:34:06] cool :) [00:34:30] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.30/extensions/CentralNotice/resources/ext.centralNotice.display/state.js: dd64e44886727871fa0d2e0e87960d7d8ffba451: Remove optedOutCampaigns property from impression data (T275054) (duration: 01m 08s) [00:34:33] and done [00:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:36] T275054: Many invalid CentralNoticeImpression events - https://phabricator.wikimedia.org/T275054 [00:34:36] AndyRussG: anything else? [00:34:50] Urbanecm: no that's fantasmic thanks so so much :D [00:34:55] np :) [00:35:05] :) [00:35:22] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 131 probes of 601 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:44:02] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [00:46:50] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 601 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:49:57] !log robh@cumin1001 START - Cookbook sre.dns.netbox [00:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:16] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1439572160 and 132 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:02] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 16088 and 146 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:15] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:04] twentyafterfour: #bothumor I � Unicode. All rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T0100). [01:11:56] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) [01:16:30] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10Jclark-ctr) a:05Cmjohnson→03RobH racked & cabled, bios configured, network configured. handing over to Rob for imaging [01:32:28] !log robh@cumin1001 START - Cookbook sre.dns.netbox [01:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:39] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10wiki_willy) Nice work, thanks @Jclark-ctr >>! In T260445#6839851, @Jclark-ctr wrote: > racked & cabled, bios configured, network configured. handing ove... [01:37:56] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Jclark-ctr) @jcrespo Is host still down can maintenance be preformed on host? [01:38:27] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:41:28] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [01:42:00] no you dont icinga i just fixed that [01:47:00] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Urbanecm) >>! In T274459#6839541, @thcipriani wrote: > This is not a testing service. We have the gitlab-test project in labs. This is our initial small production GitLab that folks... [01:53:08] jouncebot: now [01:53:08] For the next 0 hour(s) and 6 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T0100) [01:53:18] (03PS2) 10Urbanecm: hewikisource: Allow sysops to grant/revoke reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664941 (https://phabricator.wikimedia.org/T274796) [01:53:22] (03CR) 10Urbanecm: [C: 03+2] hewikisource: Allow sysops to grant/revoke reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664941 (https://phabricator.wikimedia.org/T274796) (owner: 10Urbanecm) [01:54:36] (03Merged) 10jenkins-bot: hewikisource: Allow sysops to grant/revoke reviewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664941 (https://phabricator.wikimedia.org/T274796) (owner: 10Urbanecm) [01:56:31] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: fe646957eb9b09377b07545ff194a726fd0cc6c7: hewikisource: Allow sysops to grant/revoke reviewer (T274796) (duration: 01m 07s) [01:56:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:37] T274796: Several permission changes for he.wikisource - https://phabricator.wikimedia.org/T274796 [01:58:14] (03Abandoned) 10Urbanecm: Add 'editcontentmodel' permission to 'massmessage-senders' on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/454762 (https://phabricator.wikimedia.org/T202597) (owner: 10Vogone) [01:58:33] PROBLEM - Host cloudnet1004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:04:26] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [02:10:45] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) >>! In T274459#6839919, @Urbanecm wrote: >>>! In T274459#6839541, @thcipriani wrote: >> This is not a testing service. We have the gitlab-test project in labs. This is o... [02:13:01] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Reedy) >>! In T274459#6839951, @thcipriani wrote: >>>! In T274459#6839919, @Urbanecm wrote: >>>>! In T274459#6839541, @thcipriani wrote: >>> This is not a testing service. We have t... [02:24:05] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10Jclark-ctr) Unable to update using DVD Iso is 10gb. Dual layer dvd is only 8.5gb. I was able to update firmware using... [02:25:34] RECOVERY - Host cloudnet1004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [02:30:23] 10SRE, 10GitLab, 10SRE-Access-Requests, 10Patch-For-Review: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10thcipriani) [02:32:01] (03PS1) 10Milimetric: analytics/reportupdater: remove ee job [puppet] - 10https://gerrit.wikimedia.org/r/664988 [02:34:26] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) >>! In T274459#6839954, @Reedy wrote: >>>! In T274459#6839951, @thcipriani wrote: >> This will not be manually maintained. > > Might want to update T274953 then... > >... [02:55:37] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) updated firmware for idrac for the remainder, will update bios and image tomorrow [03:04:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:18:30] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:20:06] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.103 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:32] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:50:02] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.056 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:54:59] (03CR) 10Razzi: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/664988 (owner: 10Milimetric) [04:57:59] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Marostegui) Yes, you can proceed anytime Thanks [05:00:12] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 56.73 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:00:16] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 55.69 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:01:26] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [05:01:58] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 86 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:02:00] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:03:08] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [06:10:37] !log Reboot dbproxy1014 for kernel upgrade [06:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:01] (03PS1) 10Marostegui: mariadb: Remove db1075 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/664999 (https://phabricator.wikimedia.org/T274235) [06:14:40] (03PS1) 10Marostegui: Revert "wmnet: Failover m1-master to dbproxy1012" [dns] - 10https://gerrit.wikimedia.org/r/664971 [06:15:24] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Failover m1-master to dbproxy1012" [dns] - 10https://gerrit.wikimedia.org/r/664971 (owner: 10Marostegui) [06:25:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1075.eqiad.wmnet [06:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:29:18] (03CR) 10Marostegui: [C: 03+2] mariadb: Remove db1075 from puppet [puppet] - 10https://gerrit.wikimedia.org/r/664999 (https://phabricator.wikimedia.org/T274235) (owner: 10Marostegui) [06:31:06] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:31:08] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1075.eqiad.wmnet - https://phabricator.wikimedia.org/T274235 (10Marostegui) [06:37:22] (03CR) 10Elukey: analytics/reportupdater: remove ee job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664988 (owner: 10Milimetric) [06:41:03] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:39] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host zookeeper-test1002.eqiad.wmnet [06:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:11] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host zookeeper-test1002.eqiad.wmnet [06:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:09] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host archiva1002.wikimedia.org [06:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:33] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host archiva1002.wikimedia.org [06:56:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:14] (03CR) 10Elukey: [C: 03+1] hiera: install memcached 1.6 on mc1036 [puppet] - 10https://gerrit.wikimedia.org/r/664778 (https://phabricator.wikimedia.org/T270315) (owner: 10Effie Mouzeli) [07:12:19] (03CR) 10Elukey: [C: 03+2] "Thanks a lot for the feedback, really appreciated :)" [puppet] - 10https://gerrit.wikimedia.org/r/664788 (owner: 10Elukey) [07:12:25] (03PS6) 10Elukey: partman: add reuse recipes for Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/664788 [07:29:24] 10SRE, 10Epic: Migrate all of production metal and VMs to Buster or later - https://phabricator.wikimedia.org/T247045 (10elukey) [07:29:47] (03PS1) 10Elukey: hadoop: set Buster for all worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/665005 (https://phabricator.wikimedia.org/T231067) [07:32:13] !log Reboot es2026, es2027, es2028 for kernel upgrade [07:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:19] (03CR) 10Elukey: [C: 03+2] hadoop: set Buster for all worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/665005 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [07:59:16] !log Reboot es2029, es2030, es2031, es2032, es2033, es2034 for kernel upgrade [07:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:49] PROBLEM - puppet last run on grafana2001 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:01:36] !log upgrade grafana* to 7.4.2 - T263747 [08:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:43] T263747: Upgrade Grafana to 7.4 - https://phabricator.wikimedia.org/T263747 [08:02:13] (03CR) 10Elukey: [C: 03+2] install_server: add custom reuse recipes for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/665049 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [08:03:53] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Joe) >>! In T274459#6839962, @thcipriani wrote: >>>! In T274459#6839954, @Reedy wrote: >>>>! In T274459#6839951, @thcipriani wrote: >>> This will not be manually maintained. >> >>... [08:04:21] (03PS1) 10KartikMistry: WIP: Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) [08:05:12] (03CR) 10jerkins-bot: [V: 04-1] WIP: Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) (owner: 10KartikMistry) [08:05:33] RECOVERY - puppet last run on grafana2001 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:10:06] (03PS1) 10Muehlenhoff: Update comment for snapshots access [puppet] - 10https://gerrit.wikimedia.org/r/665052 [08:10:47] (03PS2) 10KartikMistry: WIP: Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) [08:11:34] (03PS1) 10Joal: Use parquet as Hive default file format [puppet] - 10https://gerrit.wikimedia.org/r/665054 (https://phabricator.wikimedia.org/T168554) [08:11:40] (03CR) 10Filippo Giunchedi: [C: 03+1] Update comment for snapshots access [puppet] - 10https://gerrit.wikimedia.org/r/665052 (owner: 10Muehlenhoff) [08:13:07] (03CR) 10jerkins-bot: [V: 04-1] Use parquet as Hive default file format [puppet] - 10https://gerrit.wikimedia.org/r/665054 (https://phabricator.wikimedia.org/T168554) (owner: 10Joal) [08:15:26] (03PS3) 10KartikMistry: WIP: Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) [08:16:57] (03CR) 10Muehlenhoff: [C: 03+2] Update comment for snapshots access [puppet] - 10https://gerrit.wikimedia.org/r/665052 (owner: 10Muehlenhoff) [08:21:20] (03PS1) 10Ayounsi: Netbox 2.10 compatibility [homer/public] - 10https://gerrit.wikimedia.org/r/665056 (https://phabricator.wikimedia.org/T265084) [08:21:24] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE [08:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:25] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE [08:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:32] (03CR) 10Elukey: Use parquet as Hive default file format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665054 (https://phabricator.wikimedia.org/T168554) (owner: 10Joal) [08:26:36] 10SRE: sessionstore SSL cert CRIT in Icinga since > 6 days - https://phabricator.wikimedia.org/T275090 (10akosiaris) [08:28:13] 10SRE, 10GitLab, 10SRE-Access-Requests, 10Patch-For-Review: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10akosiaris) [08:31:08] !log Upgrade kernel on db1154 and db1155 (sanitarium running buster hosts) [08:31:09] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10MoritzMuehlenhoff) 05Open→03Resolved Excellent! Closing the task, then. [08:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:15] (03PS1) 10Ryan Kemper: kibana: only render if explicitly set [puppet] - 10https://gerrit.wikimedia.org/r/665058 (https://phabricator.wikimedia.org/T262211) [08:34:13] 10SRE: sessionstore SSL cert CRIT in Icinga since > 6 days - https://phabricator.wikimedia.org/T275090 (10akosiaris) [08:35:07] 10SRE: sessionstore SSL cert CRIT in Icinga since > 6 days - https://phabricator.wikimedia.org/T275090 (10akosiaris) [08:35:59] (03CR) 10Joal: Use parquet as Hive default file format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665054 (https://phabricator.wikimedia.org/T168554) (owner: 10Joal) [08:36:54] (03PS2) 10Joal: Use parquet as Hive default file format [puppet] - 10https://gerrit.wikimedia.org/r/665054 (https://phabricator.wikimedia.org/T168554) [08:37:21] (03CR) 10jerkins-bot: [V: 04-1] Use parquet as Hive default file format [puppet] - 10https://gerrit.wikimedia.org/r/665054 (https://phabricator.wikimedia.org/T168554) (owner: 10Joal) [08:37:47] (03PS1) 10Marostegui: install_server: Do not format db1172 [puppet] - 10https://gerrit.wikimedia.org/r/665059 [08:38:54] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1172 [puppet] - 10https://gerrit.wikimedia.org/r/665059 (owner: 10Marostegui) [08:39:30] (03PS2) 10Ryan Kemper: kibana: only render if explicitly set [puppet] - 10https://gerrit.wikimedia.org/r/665058 (https://phabricator.wikimedia.org/T262211) [08:42:52] (03PS1) 10Hashar: admin: matrix.py now resolves path to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/665060 [08:43:22] (03PS1) 10Marostegui: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665061 [08:44:03] (03CR) 10Hashar: "Previously one had change directory to execute the script" [puppet] - 10https://gerrit.wikimedia.org/r/665060 (owner: 10Hashar) [08:47:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1090:3312, db1090:3317 T274333', diff saved to https://phabricator.wikimedia.org/P14408 and previous config saved to /var/cache/conftool/dbconfig/20210218-084758-marostegui.json [08:48:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:07] T274333: decommission db1090.eqiad.wmnet - https://phabricator.wikimedia.org/T274333 [08:49:58] (03CR) 10Awight: Remove wgCentralNoticeImpressionEventSampleRate (will default to 0) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664889 (https://phabricator.wikimedia.org/T275054) (owner: 10AndyRussG) [08:52:16] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 81 probes of 684 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:53:40] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 69 probes of 601 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:57:22] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 11 probes of 684 (alerts on 35) - https://atlas.ripe.net/measurements/1791307/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:57:45] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10Joe) >>! In T274459#6839962, @thcipriani wrote: > Made a bit of an update there, but the text there did/does match my understanding. We're working with contractors on the initial se... [08:58:00] (03PS3) 10Joal: Use parquet as Hive default file format [puppet] - 10https://gerrit.wikimedia.org/r/665054 (https://phabricator.wikimedia.org/T168554) [08:58:38] (03PS1) 10Muehlenhoff: Add mattcleinman to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665062 (https://phabricator.wikimedia.org/T274958) [08:58:44] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 48 probes of 601 (alerts on 65) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:59:16] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10ayounsi) > will need to access to one another That's default for all our infra. > Networking Requirements: external IP For that traffic flow and overall network diagram would be us... [09:01:30] (03CR) 10Muehlenhoff: [C: 03+2] Add mattcleinman to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665062 (https://phabricator.wikimedia.org/T274958) (owner: 10Muehlenhoff) [09:07:06] (03CR) 10Volans: [C: 03+1] "LGTM, adding Cas for visibility." [homer/public] - 10https://gerrit.wikimedia.org/r/665056 (https://phabricator.wikimedia.org/T265084) (owner: 10Ayounsi) [09:08:35] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10Volans) Yes the addition of a revert of netbox changes on failure in the makevm cookbook was already in the TODO, I didn't check if it has already a task but is something... [09:09:58] 10SRE, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10affects-Kiwix-and-openZIM: Pages whose title ends with semicolon (;) are intermittently inaccessible (likely due to ATS) - https://phabricator.wikimedia.org/T238285 (10Kelson) [09:10:31] (03CR) 10Ryan Kemper: "One problem with this approach so far:" [puppet] - 10https://gerrit.wikimedia.org/r/665058 (https://phabricator.wikimedia.org/T262211) (owner: 10Ryan Kemper) [09:14:03] 10SRE, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) In order to catch calls to mediawiki that are not monitoring and go to port 8... [09:16:18] (03CR) 10Elukey: Use parquet as Hive default file format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665054 (https://phabricator.wikimedia.org/T168554) (owner: 10Joal) [09:17:23] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10jijiki) >>! In T274459#6839962, @thcipriani wrote: >>>! In T274459#6839954, @Reedy wrote: >>>>! In T274459#6839951, @thcipriani wrote: >>> This will not be manually maintained. >>... [09:20:53] 10SRE, 10SRE-Access-Requests: Superset Access for Matt Cleinman - https://phabricator.wikimedia.org/T274958 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @MattCleinman : I've added you to the group, you should be able to access Superset now. I'm closing the task, please reopen if you run... [09:23:03] (03Abandoned) 10Filippo Giunchedi: Revert "logstash: add ulogd ecs filter + tests" [puppet] - 10https://gerrit.wikimedia.org/r/664661 (owner: 10Filippo Giunchedi) [09:24:00] PROBLEM - Check systemd state on analytics1059 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:00] (03PS1) 10Elukey: presto: remove hive.force-local-scheduling from config [puppet] - 10https://gerrit.wikimedia.org/r/665067 (https://phabricator.wikimedia.org/T266640) [09:24:44] (03CR) 10Elukey: [C: 03+2] presto: remove hive.force-local-scheduling from config [puppet] - 10https://gerrit.wikimedia.org/r/665067 (https://phabricator.wikimedia.org/T266640) (owner: 10Elukey) [09:25:31] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review: Standardize the logging format - https://phabricator.wikimedia.org/T234565 (10fgiunchedi) Just finished moving ulogd logs and dashboard over to ECS -- LGTM https://logstash.wikimedia.org/app/dashboards#/view/AW5v7YTUarkxubcmAwPB [09:26:50] (03CR) 10Effie Mouzeli: [C: 04-2] "We seem to be a number of concerns about this effort discussed in T274459, I think we should hold off until we have thought this through." [puppet] - 10https://gerrit.wikimedia.org/r/664904 (https://phabricator.wikimedia.org/T274458) (owner: 10Dzahn) [09:30:22] (03PS1) 10Muehlenhoff: Add cbogen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665068 (https://phabricator.wikimedia.org/T258413) [09:33:11] (03PS1) 10Elukey: presto: add new specific settings for internal TLS conns [puppet] - 10https://gerrit.wikimedia.org/r/665069 (https://phabricator.wikimedia.org/T266640) [09:33:43] (03CR) 10Elukey: [C: 03+2] presto: add new specific settings for internal TLS conns [puppet] - 10https://gerrit.wikimedia.org/r/665069 (https://phabricator.wikimedia.org/T266640) (owner: 10Elukey) [09:36:41] (03CR) 10Muehlenhoff: [C: 03+2] Add cbogen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665068 (https://phabricator.wikimedia.org/T258413) (owner: 10Muehlenhoff) [09:39:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff @CBogen : I've added you to the group, you should be able to access Superset... [09:39:18] 10SRE, 10SRE-Access-Requests: Requesting access to gerrit-root and gerrit-admin for dancy - https://phabricator.wikimedia.org/T275050 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:41:07] 10SRE, 10SRE-Access-Requests: Requesting access to gerrit-root and gerrit-admin for dancy - https://phabricator.wikimedia.org/T275050 (10MoritzMuehlenhoff) Needs approval by @thcipriani [09:46:44] !log upgrade presto to 0.246-wmf on an-coord1001, an-presto*, stat100x [09:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:04] 10SRE, 10Traffic: provision more machines for eqsin caches - https://phabricator.wikimedia.org/T275046 (10MoritzMuehlenhoff) p:05Triage→03High [09:47:20] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:48:08] !log restarting backup* hosts T271913 [09:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:27] (03CR) 10Joal: Use parquet as Hive default file format (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665054 (https://phabricator.wikimedia.org/T168554) (owner: 10Joal) [09:48:37] (03PS2) 10Urbanecm: Enable GrowthExperiments on thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664849 (https://phabricator.wikimedia.org/T274646) [09:56:22] 10SRE, 10GitLab, 10SRE-Access-Requests, 10Patch-For-Review: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10akosiaris) As @MoritzMuehlenhoff says this will need to approval in the meeting, but let me add my counter arguments against this: * There is no precedent for thi... [09:58:37] (03CR) 10Elukey: [C: 03+2] Use parquet as Hive default file format [puppet] - 10https://gerrit.wikimedia.org/r/665054 (https://phabricator.wikimedia.org/T168554) (owner: 10Joal) [09:58:55] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10akosiaris) Hi, thanks for this request. I see a lot of people have commented already, but I have a number of questions as well. Some technical ones are inline but I got more gener... [10:01:50] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I 've commented on the task but putting an explicit -2 here until the SRE meeting takes place." [puppet] - 10https://gerrit.wikimedia.org/r/664902 (https://phabricator.wikimedia.org/T274953) (owner: 10Dzahn) [10:02:15] <_joe_> jouncebot: next [10:02:16] In 0 hour(s) and 57 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T1100) [10:02:25] <_joe_> ok, I think I can sneak in my change now [10:02:42] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:05:04] (03PS1) 10Elukey: Move analytics-hive.eqiad.wmnet to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/665072 (https://phabricator.wikimedia.org/T168554) [10:05:53] (03CR) 10Elukey: [C: 03+2] Move analytics-hive.eqiad.wmnet to an-coord1002 [dns] - 10https://gerrit.wikimedia.org/r/665072 (https://phabricator.wikimedia.org/T168554) (owner: 10Elukey) [10:07:52] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:38] 10SRE, 10GitLab, 10SRE-Access-Requests, 10Patch-For-Review: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10jcrespo) > The contractors will work outside of the operations/puppet repo for the next 6 months, while we are hiring new SREs to take over and rework the implement... [10:11:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Revert "Revert "Switch restbase calls to be channeled via envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664655 (https://phabricator.wikimedia.org/T266855) (owner: 10Giuseppe Lavagetto) [10:11:51] (03Merged) 10jenkins-bot: Revert "Revert "Switch restbase calls to be channeled via envoy"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664655 (https://phabricator.wikimedia.org/T266855) (owner: 10Giuseppe Lavagetto) [10:12:32] (03CR) 10Kormat: [C: 03+1] db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665061 (owner: 10Marostegui) [10:13:08] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:28] (03PS1) 10Muehlenhoff: Update address for rmurthy, converted to staff [puppet] - 10https://gerrit.wikimedia.org/r/665077 [10:15:42] (03PS1) 10Urbanecm: Add https://seer.ufrgs.br to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665078 (https://phabricator.wikimedia.org/T270962) [10:18:17] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10MoritzMuehlenhoff) @Pablo : I've added you to the cn=wmf LDAP group, can you please retry Jupyterlab? [10:20:25] <_joe_> marostegui / kormat I'm about to deploy my mediawiki-config change [10:20:35] <_joe_> do you want to group yours in the deploy too? [10:20:36] _joe_: i'm delighted for you [10:20:59] _joe_: No, I am not deploying yet. Thanks though :* [10:21:06] <_joe_> kormat: I should remember not to try to be helpful or polite with you [10:21:11] :D [10:21:22] _joe_: I can give tips if needed! [10:27:53] !log dcaro@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1001.eqiad.wmnet [10:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:54] PROBLEM - Prometheus cloudmetrics1002/labs restarted: beware possible monitoring artifacts on cloudmetrics1002 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/labs [10:30:04] !log oblivian@deploy1001 Synchronized wmf-config/ProductionServices.php: Switch restbase calls to envoy (duration: 01m 15s) [10:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:46] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1001.eqiad.wmnet [10:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:48] (03PS1) 10Kormat: integration_env: Pass through unknown args to deploy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665080 [10:36:42] (03PS2) 10Muehlenhoff: Update address for rmurthy, converted to staff [puppet] - 10https://gerrit.wikimedia.org/r/665077 [10:38:10] (03PS1) 10Urbanecm: Revert "Temporarily add cswiki-black-ribbon.png as a static resource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664977 [10:39:16] _joe_: still deploying, or can I sneak ^^ in? [10:41:28] (03PS3) 10Volans: debmonitor::client: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [10:41:31] (03PS3) 10Volans: debmonitor: ensure installation order [puppet] - 10https://gerrit.wikimedia.org/r/662650 [10:41:37] (03CR) 10Kormat: [C: 03+2] integration_env: Pass through unknown args to deploy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665080 (owner: 10Kormat) [10:41:44] <_joe_> Urbanecm: please go, I'm done :) [10:41:51] ok, thanks :) [10:41:58] (03CR) 10Urbanecm: [C: 03+2] Revert "Temporarily add cswiki-black-ribbon.png as a static resource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664977 (owner: 10Urbanecm) [10:41:59] !log restarting dbprov* hosts T271913 [10:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:56] (03Merged) 10jenkins-bot: Revert "Temporarily add cswiki-black-ribbon.png as a static resource" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664977 (owner: 10Urbanecm) [10:42:59] (03PS2) 10Kormat: README.md: Update reqs for integration testing. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664829 [10:43:02] PROBLEM - Prometheus cloudmetrics1001/labs restarted: beware possible monitoring artifacts on cloudmetrics1001 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/labs [10:44:11] (03Merged) 10jenkins-bot: integration_env: Pass through unknown args to deploy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665080 (owner: 10Kormat) [10:44:40] (03PS3) 10Kormat: README.md: Update reqs for integration testing. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664829 [10:44:53] (03PS2) 10Urbanecm: Add https://seer.ufrgs.br to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665078 (https://phabricator.wikimedia.org/T270962) [10:44:57] (03CR) 10Urbanecm: [C: 03+2] Add https://seer.ufrgs.br to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665078 (https://phabricator.wikimedia.org/T270962) (owner: 10Urbanecm) [10:45:24] !log urbanecm@deploy1001 Synchronized static/images: d1db3005144c1c6fc212bde49127ea13627857be: Revert "Temporarily add cswiki-black-ribbon.png as a static resource" (duration: 01m 09s) [10:45:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:49] (03Merged) 10jenkins-bot: Add https://seer.ufrgs.br to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665078 (https://phabricator.wikimedia.org/T270962) (owner: 10Urbanecm) [10:46:18] (03CR) 10David Caro: utils: add script to run docker ci tests locally (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [10:46:23] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [10:47:31] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 33ab68f3d54dcb411c47b03fa8e283fa3077ea85: Add https://seer.ufrgs.br to the wgCopyUploadsDomains allowlist of Wikimedia Commons (T270962) (duration: 01m 09s) [10:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:37] T270962: Add https://seer.ufrgs.br/debatesdoner to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T270962 [10:47:43] * Urbanecm done [10:48:52] (03CR) 10Volans: "Compiler looks happy now, merging:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [10:52:48] RECOVERY - Prometheus cloudmetrics1002/labs restarted: beware possible monitoring artifacts on cloudmetrics1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/labs [10:54:13] 10SRE, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10akosiaris) [10:56:10] (03CR) 10Volans: [C: 03+2] debmonitor::client: replace cron with systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [10:56:19] (03CR) 10Volans: [C: 03+2] debmonitor: ensure installation order [puppet] - 10https://gerrit.wikimedia.org/r/662650 (owner: 10Volans) [10:59:32] (03PS2) 10Marostegui: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665061 [11:00:05] mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T1100). [11:00:18] (03CR) 10Volans: "Thanks Daniel for the patch! All looks good so far, I'll keep an eye in case of any issue." [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [11:01:40] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665061 (owner: 10Marostegui) [11:02:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/663812 (https://phabricator.wikimedia.org/T242855) (owner: 10Alexandros Kosiaris) [11:02:26] (03Merged) 10jenkins-bot: db-eqiad.php: Depool pc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665061 (owner: 10Marostegui) [11:03:07] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool pc1009" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664978 [11:03:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool pc1009 (duration: 01m 08s) [11:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:06] RECOVERY - Prometheus cloudmetrics1001/labs restarted: beware possible monitoring artifacts on cloudmetrics1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/labs [11:04:11] !log Upgrade and reboot pc1009 [11:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:12] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:09:26] something going on with wikidata maybe? [11:11:00] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool pc1009" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664978 (owner: 10Marostegui) [11:12:10] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool pc1009" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664978 (owner: 10Marostegui) [11:12:56] (03CR) 10Kormat: [C: 03+2] README.md: Update reqs for integration testing. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664829 (owner: 10Kormat) [11:13:34] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool pc1009 (duration: 01m 09s) [11:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:10] (03CR) 10Gergő Tisza: [C: 03+1] "Looks good. Will the mentor list be added in a separate patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664849 (https://phabricator.wikimedia.org/T274646) (owner: 10Urbanecm) [11:15:00] (03CR) 10Urbanecm: "> Patch Set 2: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664849 (https://phabricator.wikimedia.org/T274646) (owner: 10Urbanecm) [11:16:10] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [11:16:38] (03Merged) 10jenkins-bot: README.md: Update reqs for integration testing. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/664829 (owner: 10Kormat) [11:17:32] (03PS3) 10Urbanecm: Enable GrowthExperiments on thwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664849 (https://phabricator.wikimedia.org/T274646) [11:17:35] <_joe_> uhm [11:17:52] (03CR) 10Urbanecm: "> Patch Set 2: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664849 (https://phabricator.wikimedia.org/T274646) (owner: 10Urbanecm) [11:18:22] <_joe_> jynus: I doubt it, it was localized on appservers [11:18:35] <_joe_> if it's the database, you usually see both clusters spike in latency [11:19:31] <_joe_> getMulti() /srv/mediawiki/php-1.36.0-wmf.30/includes/libs/objectcache/MemcachedPeclBagOStuff.php:342 [11:19:39] <_joe_> this is where most threads were stuck [11:19:56] <_joe_> and indeed https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=26&from=now-3h&orgId=1&to=now&var-cluster=appserver&var-datasource=eqiad%20prometheus%2Fops&var-method=GET [11:21:38] <_joe_> looks like sqlblobstore https://grafana.wikimedia.org/d/2Zx07tGZz/wanobjectcache?orgId=1&from=now-1h&to=now [11:22:39] someone excessively fetching old revisions? [11:23:54] (03CR) 10Volans: "reply inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [11:24:40] Or maybe cause I depooled one of the parsercache hosts and that means we lose 1/3 of that [11:24:55] or that [11:24:55] As the host that gets pooled in is empty for that particular shard [11:25:07] sounds more likely [11:25:14] and it matches the timeline [11:26:57] (03CR) 10Urbanecm: [C: 03+2] Silent deprecate ProtectionForm::buildForm [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664265 (https://phabricator.wikimedia.org/T274889) (owner: 10Jdlrobson) [11:33:55] (03PS1) 10Elukey: Revert "Move analytics-hive.eqiad.wmnet to an-coord1002" [dns] - 10https://gerrit.wikimedia.org/r/664979 [11:37:49] (03PS1) 10Marostegui: Bug: T274333 [puppet] - 10https://gerrit.wikimedia.org/r/665087 (https://phabricator.wikimedia.org/T274333) [11:38:04] <_joe_> marostegui: heh not a huge hit overall [11:38:10] <_joe_> that's good news, actually :) [11:38:14] (03CR) 10jerkins-bot: [V: 04-1] Bug: T274333 [puppet] - 10https://gerrit.wikimedia.org/r/665087 (https://phabricator.wikimedia.org/T274333) (owner: 10Marostegui) [11:38:29] haha [11:38:40] looks like my vim did something crazy [11:39:15] (03PS2) 10Marostegui: db1090: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/665087 (https://phabricator.wikimedia.org/T274333) [11:40:12] (03CR) 10Marostegui: [C: 03+2] db1090: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/665087 (https://phabricator.wikimedia.org/T274333) (owner: 10Marostegui) [11:43:01] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s: Add docker-registry credentials to pull restricted images [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [11:43:19] (03CR) 10David Caro: utils: add script to run docker ci tests locally (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [11:49:04] !log restart db1102 T271913 [11:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:08] (03Merged) 10jenkins-bot: Silent deprecate ProtectionForm::buildForm [core] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/664265 (https://phabricator.wikimedia.org/T274889) (owner: 10Jdlrobson) [11:54:11] (03PS1) 10Giuseppe Lavagetto: services: remove monitoring from http restbase endpoint [puppet] - 10https://gerrit.wikimedia.org/r/665088 [11:54:13] (03PS1) 10Giuseppe Lavagetto: services: remove restbase http LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/665089 (https://phabricator.wikimedia.org/T244843) [11:54:15] (03PS1) 10Giuseppe Lavagetto: restbase: remove references to the non-https LVS [puppet] - 10https://gerrit.wikimedia.org/r/665090 (https://phabricator.wikimedia.org/T244843) [11:54:26] 10SRE, 10Datasets-General-or-Unknown, 10Dumps-Generation, 10netops: Packets discarded on dumpsdata1001 - https://phabricator.wikimedia.org/T273713 (10ArielGlenn) Dumpsata1003 is now the primary x,ml/sql dumps NFS server with its shiny 10G NIC. Let's keep an eye on both hosts. The new run starts on the 20th... [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T1200). [12:00:04] No GERRIT patches in the queue for this window AFAICS. [12:00:48] yay [12:01:15] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.31/includes/HookContainer/DeprecatedHooks.php: 28aa8718549b76c88e9757a273e0c602479b8d8b: Silent deprecate ProtectionForm::buildForm (T274889) (duration: 01m 14s) [12:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:21] T274889: Use of ProtectionForm::buildForm hook (used in FlaggedRevsUIHooks::onProtectionForm) was deprecated in MediaWiki 1.36 - https://phabricator.wikimedia.org/T274889 [12:08:56] 10SRE, 10cloud-services-team (Kanban): [ceph] Test and upgrade to kernel ~15 - https://phabricator.wikimedia.org/T274565 (10dcaro) [12:09:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cumin: aliases: introduce alias for cloudgw-codfw1dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664805 (owner: 10Arturo Borrero Gonzalez) [12:16:36] 10SRE, 10cloud-services-team (Kanban): [ceph] Test and upgrade to kernel ~15 - https://phabricator.wikimedia.org/T274565 (10dcaro) [12:18:19] (03CR) 10Giuseppe Lavagetto: "https://grafana-rw.wikimedia.org/d/7mUxtYVGk/jayme-ipvs_backend_connections?orgId=1&var-datasource=thanos&var-port=7231&var-address=All&fr" [puppet] - 10https://gerrit.wikimedia.org/r/665088 (owner: 10Giuseppe Lavagetto) [12:20:21] !log restart db1140 T271913 [12:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:46] (03PS1) 10Jbond: P:idp: Add script to delete u2f rgistrations [puppet] - 10https://gerrit.wikimedia.org/r/665095 [12:29:48] (03PS1) 10Jbond: base::standard_packages: install python3-wmflib on all buster serveres [puppet] - 10https://gerrit.wikimedia.org/r/665096 [12:31:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28122/console" [puppet] - 10https://gerrit.wikimedia.org/r/665096 (owner: 10Jbond) [12:32:08] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28124/console" [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [12:32:43] (03CR) 10Cparle: [C: 03+1] "> Patch Set 12:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [12:33:06] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665096 (owner: 10Jbond) [12:34:58] 10SRE, 10cloud-services-team (Kanban): cloudgw: linux kernel >= 5.6 highly convenient - https://phabricator.wikimedia.org/T275129 (10aborrero) [12:35:20] (03CR) 10Volans: [C: 03+1] "Makes sense to me. Thanks for the patch. Typo inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/665096 (owner: 10Jbond) [12:38:18] (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [12:39:05] (03PS13) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [12:39:41] (03CR) 10ArielGlenn: Generation of json dumps for wikimedia commons (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [12:39:46] (03PS2) 10Jbond: base::standard_packages: install python3-wmflib on all buster servers [puppet] - 10https://gerrit.wikimedia.org/r/665096 [12:39:54] (03CR) 10Kosta Harlan: api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [12:41:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services: remove monitoring from http restbase endpoint [puppet] - 10https://gerrit.wikimedia.org/r/665088 (owner: 10Giuseppe Lavagetto) [12:41:57] (03PS2) 10Jbond: P:idp: Add script to delete u2f rgistrations [puppet] - 10https://gerrit.wikimedia.org/r/665095 [12:42:27] (03PS3) 10Jbond: base::standard_packages: install python3-wmflib on all buster servers [puppet] - 10https://gerrit.wikimedia.org/r/665096 [12:47:29] (03CR) 10Jbond: [C: 03+2] base::standard_packages: install python3-wmflib on all buster servers (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/665096 (owner: 10Jbond) [12:52:23] (03CR) 10Cparle: [C: 03+1] Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [12:52:59] (03PS2) 10Giuseppe Lavagetto: services: remove restbase http LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/665089 (https://phabricator.wikimedia.org/T244843) [12:55:04] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28125/console" [puppet] - 10https://gerrit.wikimedia.org/r/665089 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:59:28] 10SRE, 10envoy, 10serviceops, 10Service-Architecture: Using envoy to connect from MediaWiki to restbase causes an explosion of live LVS connections. - https://phabricator.wikimedia.org/T266855 (10Joe) 05Open→03Resolved [12:59:54] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create a yaml structure for defining apache virtualhosts for mediawiki, that can be used both in puppet and in helm charts. - https://phabricator.wikimedia.org/T272305 (10Joe) 05Open→03Resolved [12:59:56] 10SRE, 10MW-on-K8s, 10serviceops: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) [13:07:27] (03CR) 10Muehlenhoff: "Looks good to me, few comments/nits inline." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [13:11:39] (03CR) 10Elukey: [C: 03+2] Revert "Move analytics-hive.eqiad.wmnet to an-coord1002" [dns] - 10https://gerrit.wikimedia.org/r/664979 (owner: 10Elukey) [13:17:01] (03CR) 10Nikerabbit: WIP: Enable Section Translation on Bengali Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) (owner: 10KartikMistry) [13:21:55] (03PS3) 10Jbond: P:idp: Add script to delete u2f rgistrations [puppet] - 10https://gerrit.wikimedia.org/r/665095 [13:22:13] (03CR) 10Jbond: "updated thanks" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [13:22:19] (03PS4) 10Jbond: P:idp: Add script to delete u2f rgistrations [puppet] - 10https://gerrit.wikimedia.org/r/665095 [13:28:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [13:35:11] (03PS5) 10Hnowlan: mtail: create separate metrics histogram based on endpoint [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) [13:36:22] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10Pablo) It worked! Thanks @MoritzMuehlenhoff @Ottomata :) {F34112337} [13:36:55] (03CR) 10jerkins-bot: [V: 04-1] mtail: create separate metrics histogram based on endpoint [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) (owner: 10Hnowlan) [13:36:57] (03CR) 10Volans: "general question inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [13:37:17] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add sources to specialSiteLinkGroups Wikibase setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655428 (https://phabricator.wikimedia.org/T138332) (owner: 10Ladsgroup) [13:37:51] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:40:20] (03CR) 10Muehlenhoff: [C: 03+1] P:idp: Add script to delete u2f rgistrations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [13:43:03] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:31] (03PS1) 10Jbond: admin: update matrix.py [puppet] - 10https://gerrit.wikimedia.org/r/665104 [13:46:19] (03CR) 10Jbond: "thanks hashar this look fine, however i have just create another CR which uses pathlib and allows one to override the file from the argume" [puppet] - 10https://gerrit.wikimedia.org/r/665060 (owner: 10Hashar) [13:46:58] (03Abandoned) 10Hashar: admin: matrix.py now resolves path to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/665060 (owner: 10Hashar) [13:49:18] !log restart db1150 T271913 [13:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:49] (03CR) 10Jbond: P:idp: Add script to delete u2f rgistrations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [13:53:24] (03CR) 10Elukey: [C: 03+1] pylint: remove unnecessary disable comments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664809 (owner: 10Volans) [13:56:15] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [13:57:09] (03CR) 10Volans: [C: 03+2] pylint: remove unnecessary disable comments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664809 (owner: 10Volans) [13:59:12] !log installing intel-microcode security updates on buster [13:59:15] (03Merged) 10jenkins-bot: pylint: remove unnecessary disable comments [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664809 (owner: 10Volans) [13:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:42] (03CR) 10Elukey: [C: 03+1] fileio: add new module (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664810 (owner: 10Volans) [14:00:10] (03PS2) 10Jbond: admin: update matrix.py [puppet] - 10https://gerrit.wikimedia.org/r/665104 [14:00:44] (03CR) 10Volans: "reply inline" (031 comment) [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664810 (owner: 10Volans) [14:05:51] (03CR) 10Hashar: "Looks good. There is some formatting issue when using --help cause __doc__ is line wrapped :\" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/665104 (owner: 10Jbond) [14:07:29] (03CR) 10Volans: [C: 04-1] "Replies inline. It leaves the local checkout permissions modified." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [14:07:52] hashar: i just sent a new PS for the formating issues can you re-test [14:08:17] (03CR) 10Muehlenhoff: [C: 03+1] P:idp: Add script to delete u2f rgistrations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [14:08:48] (03CR) 10Hashar: [C: 03+1] "works now with python 3.7 :)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665104 (owner: 10Jbond) [14:12:59] (03PS3) 10Jbond: admin: update matrix.py [puppet] - 10https://gerrit.wikimedia.org/r/665104 [14:13:41] (03CR) 10Hashar: [C: 03+1] admin: update matrix.py [puppet] - 10https://gerrit.wikimedia.org/r/665104 (owner: 10Jbond) [14:14:08] (03CR) 10Jbond: [C: 03+2] admin: update matrix.py [puppet] - 10https://gerrit.wikimedia.org/r/665104 (owner: 10Jbond) [14:15:26] (03PS1) 10Kormat: integration: Fix master_heartbeat_period issues. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665105 [14:17:11] (03PS2) 10Kormat: integration: Fix master_heartbeat_period issues. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665105 [14:18:16] (03CR) 10Volans: [C: 04-1] "Small precisation to my last comment" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [14:18:18] (03PS3) 10Kormat: integration: Fix master_heartbeat_period issues. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665105 [14:19:57] (03CR) 10Volans: P:idp: Add script to delete u2f rgistrations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [14:20:31] PROBLEM - Host ml-serve1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:05] PROBLEM - Host ml-serve1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:06] (03CR) 10Jbond: [C: 03+2] P:idp: Add script to delete u2f rgistrations [puppet] - 10https://gerrit.wikimedia.org/r/665095 (owner: 10Jbond) [14:21:15] PROBLEM - Host ml-serve2003 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:31] PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:57] PROBLEM - Host ml-serve1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:01] PROBLEM - Host ml-serve1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:23] klausman, elukey ^^^ [14:22:29] PROBLEM - Host ml-serve2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:31] My bad. [14:22:35] PROBLEM - Host ml-serve2004 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:40] I rebooted for kernel updates and forgot downtime [14:22:45] RECOVERY - Host ml-serve1002 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [14:22:57] RECOVERY - Host ml-serve1001 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [14:23:15] RECOVERY - Host ml-serve2004 is UP: PING OK - Packet loss = 0%, RTA = 33.48 ms [14:23:28] ahhhh machine learning infra down! [14:23:31] :D [14:23:39] RECOVERY - Host ml-serve1004 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:23:39] RECOVERY - Host ml-serve2002 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [14:23:53] RECOVERY - Host ml-serve1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:24:07] RECOVERY - Host ml-serve2003 is UP: PING OK - Packet loss = 0%, RTA = 33.13 ms [14:24:49] RECOVERY - Host ml-serve2001 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [14:25:09] (03CR) 10Kormat: [C: 03+2] integration: Fix master_heartbeat_period issues. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665105 (owner: 10Kormat) [14:25:34] klausman: ack [14:27:19] (03PS2) 10Volans: fileio: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664810 [14:27:43] (03Merged) 10jenkins-bot: integration: Fix master_heartbeat_period issues. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/665105 (owner: 10Kormat) [14:32:00] (03CR) 10Volans: [C: 03+2] fileio: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664810 (owner: 10Volans) [14:34:11] (03Merged) 10jenkins-bot: fileio: add new module [software/pywmflib] - 10https://gerrit.wikimedia.org/r/664810 (owner: 10Volans) [14:35:00] 10SRE, 10ops-eqiad, 10Analytics-Radar, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) 05Resolved→03Open @Cmjohnson sorry if it took me so long to answer but I noticed this updated only now. The two disks that I have now on an-coord1002 may not b... [14:35:52] !log installing libzstd security updates on Buster [14:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:13] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.358 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [14:57:09] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10aborrero) Rebooted the server for a clean start, I see the driver and firmware being loaded. I'll leave here the output for... [15:03:45] (03CR) 10David Caro: utils: add script to run docker ci tests locally (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [15:04:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10CBogen) 05Resolved→03Open >>! In T258413#6840353, @MoritzMuehlenhoff wrote: > @CBogen : I've added you to the group, you should be able to access Superset now.... [15:04:52] 10SRE, 10ops-eqiad, 10Analytics-Radar, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) I am going to attempt to add the new disk to the existing md array, let's see how it goes :) [15:06:51] !log swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837 [15:06:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:58] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [15:08:09] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:30] 10SRE, 10GitLab, 10vm-requests, 10User-brennen: Eqiad: 2 VM request for GitLab - https://phabricator.wikimedia.org/T274459 (10thcipriani) Whoa, catching up on scrollback overnight. My question is: is this the first anyone in SRE has heard about any of this? [15:13:21] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:51] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.45 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [15:20:05] (03CR) 10Ppchelko: [C: 03+1] "indeed we can remove this -1" [puppet] - 10https://gerrit.wikimedia.org/r/663812 (https://phabricator.wikimedia.org/T242855) (owner: 10Alexandros Kosiaris) [15:30:05] !log installing PHP 7.3 security updates on buster [15:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:08] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/663812 (https://phabricator.wikimedia.org/T242855) (owner: 10Alexandros Kosiaris) [15:33:52] <_joe_> !log rebuilding base images for stretch,buster [15:33:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] api-gateway: generic discovery service config option, add linkrecommendation (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [16:03:59] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [16:04:26] (03PS6) 10Hnowlan: mtail: create separate metrics histogram based on endpoint [puppet] - 10https://gerrit.wikimedia.org/r/634207 (https://phabricator.wikimedia.org/T263727) [16:05:40] !log installing libmaxminddb updates from buster 10.8 point release [16:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:50] (03PS1) 10Lucas Werkmeister (WMDE): maintain-meta_p: get case sensitivity from API [puppet] - 10https://gerrit.wikimedia.org/r/665115 [16:09:40] (03PS1) 10Lucas Werkmeister (WMDE): maintain-meta_p: stop reading VariantSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/665116 [16:13:44] (03CR) 10Lucas Werkmeister (WMDE): "I should point out that I haven’t tested this change, since I don’t know how to test it locally." [puppet] - 10https://gerrit.wikimedia.org/r/665115 (owner: 10Lucas Werkmeister (WMDE)) [16:18:06] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.017 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:21:03] (03CR) 10Cwhite: [C: 03+2] mw_rc_irc: add metrics endpoint to udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662124 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [16:23:35] !log restart ircecho on kraz -- deploying new metrics endpoint T216611 [16:23:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:43] T216611: Icinga check for ircecho should check for actual activity - https://phabricator.wikimedia.org/T216611 [16:24:09] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.0.7 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/665118 [16:25:04] (03CR) 10Alexandros Kosiaris: pbuilder: create apt-cache directory before running pbuilder init (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661777 (owner: 10Bstorm) [16:25:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] "feel free to merge, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/661777 (owner: 10Bstorm) [16:26:10] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10Dzahn) for the record, yes, I am unblocked and this issue did NOT show up when I reimaged mwdebug1001. Simply because that VM came back from reboot as normally expected.... [16:27:39] (03PS3) 10Cwhite: profile: add prometheus job for udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611) [16:29:15] (03PS4) 10Cwhite: profile: add prometheus job for udpmxircecho [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611) [16:31:01] (03CR) 10David Caro: "I'm giving up on this for now, there's no way I can make it work as it's expected to work with a reasonable investment of time and effort " [software/spicerack] - 10https://gerrit.wikimedia.org/r/663205 (https://phabricator.wikimedia.org/T274338) (owner: 10David Caro) [16:31:10] (03CR) 10Cwhite: profile: add prometheus job for udpmxircecho (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [16:32:07] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.233 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [16:33:08] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v0.0.7 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/665118 (owner: 10Volans) [16:35:40] (03CR) 10Cwhite: [C: 03+2] "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1003/28127/" [puppet] - 10https://gerrit.wikimedia.org/r/662125 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [16:35:58] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.0.7 [software/pywmflib] - 10https://gerrit.wikimedia.org/r/665118 (owner: 10Volans) [16:37:00] (03PS1) 10RobH: install params for an-worker11(29|33|34|39|40|41) [puppet] - 10https://gerrit.wikimedia.org/r/665120 (https://phabricator.wikimedia.org/T260445) [16:37:11] (03PS2) 10RobH: install params for an-worker11(29|33|34|39|40|41) [puppet] - 10https://gerrit.wikimedia.org/r/665120 (https://phabricator.wikimedia.org/T260445) [16:37:41] (03CR) 10RobH: [C: 03+2] install params for an-worker11(29|33|34|39|40|41) [puppet] - 10https://gerrit.wikimedia.org/r/665120 (https://phabricator.wikimedia.org/T260445) (owner: 10RobH) [16:38:59] (03PS1) 10Volans: Upstream release v0.0.7 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/665121 [16:41:21] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) an-worker11(29|33|34|39|40|41): [x] idrac firmware updated [x] bios firmware updated [x] idrac and bios settings & password upd... [16:45:48] (03CR) 10Volans: [C: 03+2] Upstream release v0.0.7 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/665121 (owner: 10Volans) [16:48:27] (03Merged) 10jenkins-bot: Upstream release v0.0.7 [software/pywmflib] (debian) - 10https://gerrit.wikimedia.org/r/665121 (owner: 10Volans) [16:51:20] !log uploaded python3-wmflib_0.0.7 to apt.wikimedia.org buster-wikimedia [16:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:12] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1129.eqiad.wmnet',... [17:00:04] jbond42 and cdanis: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T1700). [17:03:35] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.242 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:11:16] (03PS1) 10Ottomata: Add cbogen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665125 (https://phabricator.wikimedia.org/T258413) [17:12:04] (03CR) 10Ottomata: [C: 03+2] Add cbogen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/665125 (https://phabricator.wikimedia.org/T258413) (owner: 10Ottomata) [17:14:01] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10Ottomata) The previous patch didn't add you to analytics-privatedata-users, https://gerrit.wikimedia.org/r/c/operations/puppet/+/665125 does,... [17:17:07] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:17:30] (03PS1) 10Cwhite: mw_rc_irc: add check_prometheus alert on no messages being relayed [puppet] - 10https://gerrit.wikimedia.org/r/665129 (https://phabricator.wikimedia.org/T216611) [17:18:27] (03CR) 10jerkins-bot: [V: 04-1] mw_rc_irc: add check_prometheus alert on no messages being relayed [puppet] - 10https://gerrit.wikimedia.org/r/665129 (https://phabricator.wikimedia.org/T216611) (owner: 10Cwhite) [17:18:51] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:19:42] (03PS2) 10Cwhite: mw_rc_irc: add check_prometheus alert on no messages being relayed [puppet] - 10https://gerrit.wikimedia.org/r/665129 (https://phabricator.wikimedia.org/T216611) [17:22:46] (03PS3) 10Cwhite: profile: only set default partition if unset [puppet] - 10https://gerrit.wikimedia.org/r/659422 (https://phabricator.wikimedia.org/T234565) [17:24:42] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr @Jclark-ctr - can you check the connection on aqs1014, then shoot this back over to @RobH to finish off? Thanks, Willy >>! In T267414#6827299... [17:26:25] RECOVERY - MD RAID on an-coord1002 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [17:26:31] \o/ [17:27:02] (03CR) 10Cwhite: [C: 03+2] profile: only set default partition if unset [puppet] - 10https://gerrit.wikimedia.org/r/659422 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:29:10] 10SRE, 10ops-eqiad, 10Analytics-Radar, 10Patch-For-Review: Degraded RAID on an-coord1002 - https://phabricator.wikimedia.org/T270768 (10elukey) 05Open→03Resolved It seems to have worked, thanks! [17:29:47] (03PS1) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 [17:38:46] (03CR) 10BryanDavis: [C: 03+1] "untested, but I did confirm that https://jbo.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=general&format=json returns the exp" [puppet] - 10https://gerrit.wikimedia.org/r/665115 (owner: 10Lucas Werkmeister (WMDE)) [17:39:21] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission frqueue1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T274671 (10wiki_willy) a:03Jclark-ctr [17:41:22] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10wiki_willy) Thanks @Jgreen, do you have a decom task for payments1004 as well? >>! In T266481#6827394, @Jgreen wrote: > @wiki_willy @jclark-ctr we're done with frqueue1002 and can... [17:45:08] (03CR) 10David Caro: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (owner: 10Jbond) [17:49:15] (03PS36) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:52:00] (03PS37) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:52:04] jouncebot: next [17:52:04] In 0 hour(s) and 7 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T1800) [17:53:21] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28131/console" [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:53:25] (03CR) 10Jbond: "thanks for the quick review see inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (owner: 10Jbond) [17:55:28] (03PS38) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:56:50] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28132/console" [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [17:57:30] (03PS2) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 [17:58:26] (03CR) 10Jbond: utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (owner: 10Jbond) [17:58:30] (03PS4) 10KartikMistry: Enable Section Translation on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) [17:58:38] (03PS39) 10Hnowlan: start using imposm as OSM sync tool [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [18:00:04] chrisalbon and accraze: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T1800). [18:01:44] (03CR) 10KartikMistry: Enable Section Translation on Bengali Wikipedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665051 (https://phabricator.wikimedia.org/T271397) (owner: 10KartikMistry) [18:04:34] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1129.eqiad.wmnet', 'an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker11... [18:06:45] (03CR) 10jerkins-bot: [V: 04-1] utils/run_ci_localy.sh: Add a script to allow users to run CI from there laptops [software/spicerack] - 10https://gerrit.wikimedia.org/r/665134 (owner: 10Jbond) [18:10:20] (03CR) 10Krinkle: "omit jobrunner_tls.pp change as this file was deleted in ff35845d29824d9. https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/575392 (https://phabricator.wikimedia.org/T243096) (owner: 10Aaron Schulz) [18:10:45] (03PS4) 10Krinkle: Remove references to obsolete rpc/RunJobs.php endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575392 (https://phabricator.wikimedia.org/T243096) (owner: 10Aaron Schulz) [18:10:52] (03PS5) 10Krinkle: mediawikik: Remove references to obsolete rpc/RunJobs.php endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575392 (https://phabricator.wikimedia.org/T243096) (owner: 10Aaron Schulz) [18:10:58] (03PS6) 10Krinkle: mediawikik: Remove references to obsolete rpc/RunJobs.php endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575392 (https://phabricator.wikimedia.org/T243096) (owner: 10Aaron Schulz) [18:11:03] (03PS7) 10Krinkle: mediawiki: Remove references to obsolete rpc/RunJobs.php endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575392 (https://phabricator.wikimedia.org/T243096) (owner: 10Aaron Schulz) [18:17:57] (03PS1) 10Elukey: install_server: set custom recipe for new worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/665143 (https://phabricator.wikimedia.org/T260445) [18:18:30] 10SRE, 10Desktop Improvements, 10Traffic: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10BBlack) This seems pretty straight-forward operationally, I think we can replicate the techniques used in T256750 for more wikis in general... [18:19:22] (03CR) 10Elukey: [C: 03+2] install_server: set custom recipe for new worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/665143 (https://phabricator.wikimedia.org/T260445) (owner: 10Elukey) [18:26:23] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) [18:26:36] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) renaming this ticket to cover both mwmaint* servers and not be just for eqiad alone [18:27:01] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` an-worker1129.eqiad.wmnet `... [18:27:07] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1129.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1129.... [18:27:09] (03PS1) 10Dzahn: install_server: switch mwmaint2001 to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/665144 (https://phabricator.wikimedia.org/T267607) [18:27:21] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` an-worker1129.eqiad.wmnet `... [18:29:36] 10SRE, 10Desktop Improvements, 10Traffic: CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10ovasileva) >>! In T274784#6841519, @BBlack wrote: > This seems pretty straight-forward operationally, I think we can replicate the techniqu... [18:33:19] (03PS1) 10Bartosz Dziewoński: Fix highlight when new topic is posted without a title [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/665166 (https://phabricator.wikimedia.org/T272666) [18:33:35] (03PS1) 10Bartosz Dziewoński: Make new topic autosave specific to page title [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/665167 (https://phabricator.wikimedia.org/T274949) [18:34:00] (03PS1) 10Bartosz Dziewoński: Fix highlight when new topic is posted without a title [extensions/DiscussionTools] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665168 (https://phabricator.wikimedia.org/T272666) [18:34:10] (03PS1) 10Bartosz Dziewoński: Make new topic autosave specific to page title [extensions/DiscussionTools] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665169 (https://phabricator.wikimedia.org/T274949) [18:34:33] (03PS7) 10Dave Pifke: profiler: wall-clock excimer instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) (owner: 10Ori.livneh) [18:36:50] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools' beta feature for newtopictool on arwiki, cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664819 (https://phabricator.wikimedia.org/T273145) [18:36:52] 10SRE, 10Desktop Improvements, 10Traffic, 10Performance-Team (Radar): CDN cache revalidation on several wikis for desktop improvements deployment pt 2 - https://phabricator.wikimedia.org/T274784 (10Krinkle) [18:36:58] (03PS2) 10Bartosz Dziewoński: Make DiscussionTools' replytool available for everyone on gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664940 (https://phabricator.wikimedia.org/T258554) [18:37:31] (03PS2) 10Dave Pifke: arclamp: add excimer-real pipeline [puppet] - 10https://gerrit.wikimedia.org/r/664591 (https://phabricator.wikimedia.org/T253160) [18:40:05] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.15 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [18:42:42] (03CR) 10Dave Pifke: arclamp: add excimer-real pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/664591 (https://phabricator.wikimedia.org/T253160) (owner: 10Dave Pifke) [18:42:58] (03CR) 10Dzahn: [C: 03+2] install_server: switch mwmaint2001 to use buster installer [puppet] - 10https://gerrit.wikimedia.org/r/665144 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [18:58:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatadata-users group for Carly Bogen - https://phabricator.wikimedia.org/T258413 (10CBogen) 05Open→03Resolved All good now, thank you! [19:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T1900) [19:00:04] MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:18] hi [19:00:40] i filled the whole window today D: [19:00:45] I can deploy today [19:01:02] (03CR) 10Krinkle: [C: 03+1] arclamp: add excimer-real pipeline [puppet] - 10https://gerrit.wikimedia.org/r/664591 (https://phabricator.wikimedia.org/T253160) (owner: 10Dave Pifke) [19:01:13] MatmaRex: usual question: do the config patches depend on backports? [19:01:30] (03CR) 10Urbanecm: [C: 03+2] Fix highlight when new topic is posted without a title [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/665166 (https://phabricator.wikimedia.org/T272666) (owner: 10Bartosz Dziewoński) [19:01:31] yes [19:01:32] (03CR) 10Urbanecm: [C: 03+2] Make new topic autosave specific to page title [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/665167 (https://phabricator.wikimedia.org/T274949) (owner: 10Bartosz Dziewoński) [19:01:34] (03CR) 10Urbanecm: [C: 03+2] Fix highlight when new topic is posted without a title [extensions/DiscussionTools] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665168 (https://phabricator.wikimedia.org/T272666) (owner: 10Bartosz Dziewoński) [19:01:36] (03CR) 10Urbanecm: [C: 03+2] Make new topic autosave specific to page title [extensions/DiscussionTools] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665169 (https://phabricator.wikimedia.org/T274949) (owner: 10Bartosz Dziewoński) [19:01:46] (not technically, but we don't want to enable the code without those fixes) [19:01:53] thanks MatmaRex. Let's wait for CI then :) [19:04:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwmaint2001.codfw.wmnet with reason: OS upgrade [19:04:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwmaint2001.codfw.wmnet with reason: OS upgrade [19:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:33] (03Merged) 10jenkins-bot: Fix highlight when new topic is posted without a title [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/665166 (https://phabricator.wikimedia.org/T272666) (owner: 10Bartosz Dziewoński) [19:08:35] (03Merged) 10jenkins-bot: Make new topic autosave specific to page title [extensions/DiscussionTools] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/665167 (https://phabricator.wikimedia.org/T274949) (owner: 10Bartosz Dziewoński) [19:08:37] (03Merged) 10jenkins-bot: Fix highlight when new topic is posted without a title [extensions/DiscussionTools] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665168 (https://phabricator.wikimedia.org/T272666) (owner: 10Bartosz Dziewoński) [19:08:39] (03Merged) 10jenkins-bot: Make new topic autosave specific to page title [extensions/DiscussionTools] (wmf/1.36.0-wmf.31) - 10https://gerrit.wikimedia.org/r/665169 (https://phabricator.wikimedia.org/T274949) (owner: 10Bartosz Dziewoński) [19:12:12] MatmaRex: backports pulled to mwdebug1001. Not sure if they're standalone-testable, or if you need the config patches too [19:12:48] they are, looking [19:13:16] okay, thanks [19:14:51] Urbanecm: looks good on testwiki [19:14:59] thanks, syncing [19:15:27] the cloud again [19:16:52] :/ [19:17:21] i hope they're not paying for that [19:17:35] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.30/extensions/DiscussionTools/: 9c6cdf5: 97acef6: DiscussionTools backports (T272666; T274949) (duration: 01m 26s) [19:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:42] T272666: Comment highlight appears in an unexpected position - https://phabricator.wikimedia.org/T272666 [19:17:43] T274949: Immortal comment follows me across pages - https://phabricator.wikimedia.org/T274949 [19:18:25] MatmaRex: afraid they are [19:18:36] this is the worst time for irccloud to have an outage [19:18:50] MatmaRex: wmf.30 synced, going with wmf.31 [19:18:52] hi Urbanecm_ ;) [19:19:21] !log urbanecm@deploy1001 sync-file aborted: 1cc29df DiscussionTools backports (T272666; T274949) (duration: 00m 00s) [19:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:17] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools' beta feature for newtopictool on arwiki, cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664819 (https://phabricator.wikimedia.org/T273145) (owner: 10Bartosz Dziewoński) [19:20:49] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.31/extensions/DiscussionTools/: 1cc29df: 6b88aff: DiscussionTools backports (T272666; T274949) (duration: 01m 08s) [19:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:03] done .31 too [19:21:28] (03Merged) 10jenkins-bot: Enable DiscussionTools' beta feature for newtopictool on arwiki, cswiki, huwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664819 (https://phabricator.wikimedia.org/T273145) (owner: 10Bartosz Dziewoński) [19:22:04] MatmaRex: pulled onto mwdebug1001, can you test? [19:22:39] yeah, looking [19:23:17] Urbanecm_: looks good [19:23:23] thanks, syncing [19:23:34] (03CR) 10Urbanecm: [C: 03+2] Make DiscussionTools' replytool available for everyone on gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664940 (https://phabricator.wikimedia.org/T258554) (owner: 10Bartosz Dziewoński) [19:24:49] 10SRE, 10GitLab, 10SRE-Access-Requests, 10Patch-For-Review: Access group for Gitlab contractors - https://phabricator.wikimedia.org/T274953 (10Ladsgroup) I don't know if this is considered but singing {L3} is a requirement for access to production and that contract it's explicitly mentioned that doing non-... [19:24:52] (03Merged) 10jenkins-bot: Make DiscussionTools' replytool available for everyone on gomwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/664940 (https://phabricator.wikimedia.org/T258554) (owner: 10Bartosz Dziewoński) [19:25:05] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: da7b8123ecb373c1de1634ae867fb2f5fbee89ad: Enable DiscussionTools beta feature for newtopictool on arwiki, cswiki, huwiki (T273145) (duration: 01m 12s) [19:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:11] T273145: Deploy config change to make New Discussion Tool available as beta feature - https://phabricator.wikimedia.org/T273145 [19:25:22] MatmaRex: the other patch is on mwdebug1001 now, please test [19:25:49] !log mwmaint2001 - deleting 'home-terbium' from all home directories (yes, it's in Bacula if you really used that, hope you didn't, it's been years since terbium) [19:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:01] Urbanecm_: thanks, also looks good [19:26:04] syncinjg [19:27:36] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: f33f9f71b13d9b9276df88ef6384ec6028ee2e1d: Make DiscussionTools replytool available for everyone on gomwiktionary (T258554) (duration: 01m 05s) [19:27:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:42] T258554: Enable discussion tools on discussion pages on the Konkani Wiktionary - https://phabricator.wikimedia.org/T258554 [19:27:46] MatmaRex: and should be done [19:27:49] anything else? [19:27:52] 10SRE, 10LDAP-Access-Requests: LDAP access to the nda group for Uzoma Ozurumba - https://phabricator.wikimedia.org/T275139 (10Aklapper) > I am tagging you because I am required to do so. Thank you. Hmm, that surprises me. Could you elaborate why you think that you are required to do so? (Link to docs?) Thanks. [19:28:01] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1129.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1129.... [19:28:18] MatmaRex: hi, is it possible to use the new discussion tool somehow before it will be enabled on that wiki? like how the reply tool resourceloader module could just be loaded with user js [19:28:45] Urbanecm_: should be all, thank you! [19:28:50] no problem [19:29:28] Majavah: yes, in the same manner (?dtenable=1 URL parameter, or that "secret" cookie that is set by the user scripts) [19:30:02] which secret cookie? [19:30:05] * Urbanecm_ curious [19:30:25] 'discussiontools-tempenable' [19:30:32] thanks [19:30:50] (i had to look up the name) [19:31:16] thanks! [19:37:43] 10SRE, 10SRE-Access-Requests: Requesting access to gerrit-root and gerrit-admin for dancy - https://phabricator.wikimedia.org/T275050 (10thcipriani) >>! In T275050#6840356, @MoritzMuehlenhoff wrote: > Needs approval by @thcipriani Approved! [19:41:47] (03CR) 10Dzahn: [C: 03+2] arclamp: add excimer-real pipeline [puppet] - 10https://gerrit.wikimedia.org/r/664591 (https://phabricator.wikimedia.org/T253160) (owner: 10Dave Pifke) [19:45:41] PROBLEM - LVS labweb eqiad port 80/tcp - lvs for labweb services: horizon- striker- wikitech IPv4 on labweb.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:47:21] RECOVERY - LVS labweb eqiad port 80/tcp - lvs for labweb services: horizon- striker- wikitech IPv4 on labweb.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 453 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:00:04] marxarelli and twentyafterfour: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210218T2000). [20:02:44] (03CR) 10Ottomata: Alert if kafka max replica lag is steadily increasing (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662005 (https://phabricator.wikimedia.org/T273702) (owner: 10Ottomata) [20:02:59] (03PS2) 10Ottomata: Alert if kafka max replica lag is steadily increasing [puppet] - 10https://gerrit.wikimedia.org/r/662005 (https://phabricator.wikimedia.org/T273702) [20:05:47] 10SRE, 10Traffic: validate or revert the new large_objects_cutoff & nuke_limit settings on upload@eqsin - https://phabricator.wikimedia.org/T275028 (10CDanis) 05Open→03Resolved a:03CDanis Caches have filled, and there's no more fetch failures due to "LRU limited": https://grafana.wikimedia.org/d/00000035... [20:05:50] 10SRE, 10Traffic: cp_upload @ eqsin cascading failures, February 2021 - https://phabricator.wikimedia.org/T274888 (10CDanis) [20:06:24] oh is the train on for now? might watch it finally roll [20:06:42] (03PS1) 10Dduvall: all wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665162 [20:06:44] (03CR) 10Dduvall: [C: 03+2] all wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665162 (owner: 10Dduvall) [20:06:53] I'm here, not sure about marxarelli [20:07:06] here [20:07:49] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.31 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665162 (owner: 10Dduvall) [20:07:49] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:57] (03CR) 10Ottomata: Add alerts on eventgate_validation_errors_total rate for each eventgate service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata) [20:08:59] (03PS4) 10Ottomata: Add alerts on eventgate_validation_errors_total rate for each eventgate service [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) [20:09:39] oooh finally a train done on schedule after the last few weeks :P [20:09:59] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.31 [20:10:03] we'll see :) [20:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:29] (03CR) 10Ottomata: [C: 03+2] Alert if kafka max replica lag is steadily increasing [puppet] - 10https://gerrit.wikimedia.org/r/662005 (https://phabricator.wikimedia.org/T273702) (owner: 10Ottomata) [20:12:31] twentyafterfour: looks ok so far [20:13:09] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:29] (03PS1) 10Andrew Bogott: Horizon logging: try to detect multiple-line log messages [puppet] - 10https://gerrit.wikimedia.org/r/665164 (https://phabricator.wikimedia.org/T268175) [20:17:39] (03PS5) 10Ottomata: Add alerts on eventgate_validation_errors_total rate for each eventgate service [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) [20:18:53] (03PS2) 10Andrew Bogott: Horizon logging: try to detect multiple-line log messages [puppet] - 10https://gerrit.wikimedia.org/r/665164 (https://phabricator.wikimedia.org/T268175) [20:21:40] hmm, seeing a handful of "Invalid user parameter" errors coming from extensions/Echo/includes/model/Event.php [20:22:15] (03CR) 10Andrew Bogott: [C: 03+2] Horizon logging: try to detect multiple-line log messages [puppet] - 10https://gerrit.wikimedia.org/r/665164 (https://phabricator.wikimedia.org/T268175) (owner: 10Andrew Bogott) [20:22:15] 10SRE, 10Gerrit-Privilege-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10bd808) I have blocked User:Pablo_Grass_(WMDE) on wikitech which also revokes that Developer account's access to Gerrit and Phabricator. https://wikitech.wikimedia.org/w/index.php?title=Specia... [20:22:19] 10SRE, 10Gerrit-Privilege-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10bd808) [20:23:00] yeah all the same url [20:23:07] 10SRE, 10Gerrit-Privilege-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10bd808) @WMDE-leszek anything else to do here? [20:26:22] (03CR) 10Dzahn: "thank you for amending and merging it. appreciate it:)" [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [20:27:47] (03PS6) 10Ottomata: Add alerts on eventgate_validation_errors_total rate for each eventgate service [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) [20:28:00] (03CR) 10Tjones: [C: 03+2] "Do we need to manually submit this or do something else to trigger Jenkins?" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203) (owner: 10DCausse) [20:32:10] (03CR) 10Ottomata: [C: 03+2] Add alerts on eventgate_validation_errors_total rate for each eventgate service [puppet] - 10https://gerrit.wikimedia.org/r/661999 (https://phabricator.wikimedia.org/T257237) (owner: 10Ottomata) [20:32:20] Error 1062: Duplicate entry '1234003' for key 'ar_revid_uniq hm sounds like one for the dbas [20:32:47] two of those [20:36:02] apergos: where are you seeing that? [20:36:13] logstash [20:37:40] _id is avDWtncBoqQO90dB7RXR if that helps [20:38:07] it's in the last 10 errors or so [20:41:19] now we're back to 4 more of those echo event ones [20:43:44] i have no idea why that duplicate entry error isn't showing on the mediawiki-errors dashboard [20:43:53] but i can query it by _id [20:44:09] * marxarelli hates kibana with so much hate [20:44:43] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1008 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [20:45:39] me ^ new alert [20:45:42] should be working [20:45:43] checking [20:46:42] why null why null... [20:46:45] the query is right [20:48:07] I *am* on the mdiawiki-errors dashboard [20:49:19] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-test1010 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1010 [20:52:08] same user param error from fr wp now [20:53:08] maybe it's something legit wrong [20:53:22] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1006 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [20:54:58] 10SRE, 10Gerrit-Privilege-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10WMDE-leszek) 05Open→03Resolved a:03WMDE-leszek I think we're done, thanks! @bd808 For the future offboarding of the WMDE staff - is there an official/preferred process for blocking pas... [20:57:26] PROBLEM - Kafka Broker Replica Max Lag is increasing on logstash2001 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=logstash2001 [20:57:34] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 8.2 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [21:00:14] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1007 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007 [21:00:14] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-main2003 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [21:00:14] PROBLEM - Kafka Broker Replica Max Lag is increasing on logstash1011 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011 [21:00:37] 10SRE, 10Gerrit-Privilege-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10bd808) >>! In T268946#6841942, @WMDE-leszek wrote: > I think we're done, thanks! > > @bd808 For the future offboarding of the WMDE staff - is there an official/preferred process for blocking... [21:07:04] 10SRE, 10Gerrit-Privilege-Requests: Offboard Pablo-WMDE from WMF systems - https://phabricator.wikimedia.org/T268946 (10Dzahn) >>! In T268946#6841945, @bd808 wrote: >>>we should not have "staff" Developer accounts at WMDE or WMF. strong agree to that. When the "(WMF)" (and based on that later WMDE) wiki accou... [21:07:56] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-test1009 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1009 [21:08:52] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1002 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [21:08:52] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-main1001 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka-main1001 [21:09:09] fix coming in ^ [21:09:25] (03PS1) 10Ottomata: Fix Kafka Broker Replica Max Lag is increasing alert [puppet] - 10https://gerrit.wikimedia.org/r/665191 (https://phabricator.wikimedia.org/T273702) [21:10:06] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-main2001 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [21:10:36] (03PS1) 10Jdlrobson: Restore logos on Vector (classic version) and use cloud icon for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665192 (https://phabricator.wikimedia.org/T274210) [21:11:18] PROBLEM - Kafka Broker Replica Max Lag is increasing on logstash2003 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=logstash2003 [21:11:42] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-main1002 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka-main1002 [21:11:55] (03CR) 10Ottomata: [C: 03+2] Fix Kafka Broker Replica Max Lag is increasing alert [puppet] - 10https://gerrit.wikimedia.org/r/665191 (https://phabricator.wikimedia.org/T273702) (owner: 10Ottomata) [21:12:05] (03PS1) 10RobH: fixing macs for an-workers [puppet] - 10https://gerrit.wikimedia.org/r/665193 (https://phabricator.wikimedia.org/T260445) [21:12:24] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1009 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [21:13:47] (03CR) 10RobH: [C: 03+2] fixing macs for an-workers [puppet] - 10https://gerrit.wikimedia.org/r/665193 (https://phabricator.wikimedia.org/T260445) (owner: 10RobH) [21:13:55] (03PS2) 10RobH: fixing macs for an-workers [puppet] - 10https://gerrit.wikimedia.org/r/665193 (https://phabricator.wikimedia.org/T260445) [21:15:50] PROBLEM - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1004 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [21:16:32] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1002 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [21:16:32] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-main1001 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka-main1001 [21:16:56] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-test1010 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1010 [21:17:04] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1009 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1009 [21:17:16] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1004 is OK: (C)0.1 gt (W)0 gt -1.794e+04 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [21:17:16] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-test1009 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=test-eqiad&var-kafka_broker=kafka-test1009 [21:17:22] there wew go :) [21:18:27] apergos: ok, i'm seeing a fair number of things showing up on the mediawiki-new-errors dashboard [21:18:30] i'm going to roll back [21:18:31] marxarelli: since thre are 22 of those 'invalid user parameter' errors since deployment ad only like 2 before that in the last week, do we think it's the train? [21:18:44] ok well that answers my question :-) [21:18:50] bummer... [21:19:40] (03CR) 10Ppchelko: [C: 03+1] mediawiki: Remove references to obsolete rpc/RunJobs.php endpoint [puppet] - 10https://gerrit.wikimedia.org/r/575392 (https://phabricator.wikimedia.org/T243096) (owner: 10Aaron Schulz) [21:19:43] apergos: not totally sure about the invalid user param error [21:19:47] but i filed https://phabricator.wikimedia.org/T275161 [21:19:56] it needs some triage and assignment though [21:20:01] rght [21:20:58] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1008 is OK: (C)0.1 gt (W)0 gt -1.617e+04 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1008 [21:21:18] (03PS1) 10Dduvall: Revert "all wikis to 1.36.0-wmf.31" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665197 [21:21:21] (03CR) 10Dduvall: [C: 03+2] Revert "all wikis to 1.36.0-wmf.31" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665197 (owner: 10Dduvall) [21:22:14] (03Merged) 10jenkins-bot: Revert "all wikis to 1.36.0-wmf.31" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665197 (owner: 10Dduvall) [21:22:48] RECOVERY - Kafka Broker Replica Max Lag is increasing on logstash2003 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=logstash2003 [21:25:08] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-main2003 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [21:25:32] 10SRE, 10SRE-Access-Requests: Add Angie Muigai to analytics-privatedata-users - https://phabricator.wikimedia.org/T275140 (10AMuigai) @nshahquinn-wmf I have signed. [21:26:08] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: Revert "all wikis to 1.36.0-wmf.31" [21:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:10] RECOVERY - Kafka Broker Replica Max Lag is increasing on logstash1011 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011 [21:29:37] !log 1.36.0-wmf.31 rolled back due to T275161 and new logspam (T271345) [21:29:38] PROBLEM - Kafka Broker Replica Max Lag is increasing on logstash1010 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010 [21:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:44] T275161: InvalidArgumentException: Invalid user parameter in EchoEvent::create - https://phabricator.wikimedia.org/T275161 [21:29:44] T271345: 1.36.0-wmf.31 deployment blockers - https://phabricator.wikimedia.org/T271345 [21:33:46] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-main2001 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [21:34:52] PROBLEM - Kafka Broker Replica Max Lag is increasing on logstash2002 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=logstash2002 [21:35:27] null...? [21:37:54] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1133.eqiad.wmnet',... [21:38:14] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1006 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1006 [21:40:24] RECOVERY - Kafka Broker Replica Max Lag is increasing on logstash2001 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=logstash2001 [21:42:38] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-main1002 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=main-eqiad&var-kafka_broker=kafka-main1002 [21:47:00] RECOVERY - Kafka Broker Replica Max Lag is increasing on kafka-jumbo1007 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1007 [21:48:10] PROBLEM - Kafka Broker Replica Max Lag is increasing on logstash1012 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1012 [21:51:31] RECOVERY - Kafka Broker Replica Max Lag is increasing on logstash1010 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010 [21:51:45] RECOVERY - Kafka Broker Replica Max Lag is increasing on logstash2002 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=logstash2002 [22:09:34] (03PS2) 10Jdlrobson: Restore logos on Vector (classic version) and use cloud icon for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/665192 (https://phabricator.wikimedia.org/T274210) [22:11:05] RECOVERY - Kafka Broker Replica Max Lag is increasing on logstash1012 is OK: (C)0.1 gt (W)0 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1012 [22:25:56] (03CR) 10Dduvall: [C: 03+2] Add ServiceConfig->getDatacenters() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634550 (owner: 10Ahmon Dancy) [22:26:59] (03CR) 10jerkins-bot: [V: 04-1] Add ServiceConfig->getDatacenters() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634550 (owner: 10Ahmon Dancy) [22:28:38] (03PS3) 10Ahmon Dancy: Add ServiceConfig->getDatacenters() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634550 [22:28:40] (03PS3) 10Ahmon Dancy: Define $wmfDatacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634551 [22:28:42] (03PS3) 10Ahmon Dancy: Use new $wmfDatacenters global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634552 [22:30:21] !log mwmaint2001 - tar-gzipping a lot of old user home data I keep finding, partially museum worthy from several maintenance hosts ago, like places like /root/home-mwmaint1001/username/home-terbium/iron/ :p [22:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:58] (03CR) 10jerkins-bot: [V: 04-1] Add ServiceConfig->getDatacenters() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634550 (owner: 10Ahmon Dancy) [22:31:21] (03CR) 10jerkins-bot: [V: 04-1] Define $wmfDatacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634551 (owner: 10Ahmon Dancy) [22:32:16] (03CR) 10jerkins-bot: [V: 04-1] Use new $wmfDatacenters global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634552 (owner: 10Ahmon Dancy) [22:32:48] (03PS4) 10Ahmon Dancy: Add ServiceConfig->getDatacenters() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634550 [22:32:50] (03PS4) 10Ahmon Dancy: Define $wmfDatacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634551 [22:32:52] (03PS4) 10Ahmon Dancy: Use new $wmfDatacenters global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634552 [22:35:34] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1133.eqiad.wmnet',... [22:56:46] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1133.eqiad.wmnet', 'an-worker1134.eqiad.wmnet', 'an-worker1140.eqia... [23:04:27] !log mwmaint1002 - rsyncing data from mwmaint2001 [23:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:35] (03CR) 10Dduvall: [C: 03+2] Add ServiceConfig->getDatacenters() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634550 (owner: 10Ahmon Dancy) [23:06:33] (03Merged) 10jenkins-bot: Add ServiceConfig->getDatacenters() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634550 (owner: 10Ahmon Dancy) [23:10:28] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mwmaint2001.codfw.wmnet with reason: OS upgrade [23:10:29] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mwmaint2001.codfw.wmnet with reason: OS upgrade [23:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:16] !log mwmaint2001 - will be rebooted for OS upgrade - T267607 [23:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:22] T267607: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 [23:12:51] (oh no, it's a HP) [23:15:44] !log dancy@deploy1001 Synchronized src/ServiceConfig.php: (no justification provided) (duration: 03m 21s) [23:15:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:40] (03CR) 10Dduvall: [C: 03+2] Define $wmfDatacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634551 (owner: 10Ahmon Dancy) [23:16:40] 10SRE, 10serviceops: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mwmaint2001.codfw.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202102182316_dzahn_17848_mwm... [23:17:26] (03CR) 10Dduvall: [C: 03+2] Use new $wmfDatacenters global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634552 (owner: 10Ahmon Dancy) [23:17:43] (03Merged) 10jenkins-bot: Define $wmfDatacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634551 (owner: 10Ahmon Dancy) [23:21:50] (03Merged) 10jenkins-bot: Use new $wmfDatacenters global [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634552 (owner: 10Ahmon Dancy) [23:22:01] !log dancy@deploy1001 Synchronized wmf-config/CommonSettings.php: Syncing https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/634551 (duration: 01m 08s) [23:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:26] (03CR) 10Dzahn: [C: 04-1] "can't reassign variable https://puppet-compiler.wmflabs.org/compiler1002/28129/phab1001.eqiad.wmnet/change.phab1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [23:23:06] mutante: how long before known_hosts is updated following the mwmaint2001 reimage? [23:23:16] we're syncing out a small config change [23:25:18] it should've been depooled [23:26:14] marxarelli: it is in the middle of installing the base system right now [23:26:20] legoktm: it's not in conftool [23:26:24] oh [23:26:38] !log dancy@deploy1001 Synchronized wmf-config/: Syncing https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/634552 (duration: 01m 07s) [23:26:39] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10RobH) John, In reviewing the installations from the relocation of an-worker11(29|33|34|39|40|41), I ran into a couple issues: an-wo... [23:26:40] if it's in scap then that is "manual" entry in the list [23:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:26:46] and I should have thought of that if it is [23:26:54] gotcha [23:27:36] hieradata/common/scap/dsh.yaml: - mwmaint2001.codfw.wmnet [23:27:48] jouncebot: now [23:27:48] No deployments scheduled for the next 0 hour(s) and 32 minute(s) [23:28:40] marxarelli: I can remove it now if that still helps [23:28:46] mutante: looks like we're ok. sorry for the worry [23:28:51] sorry about that, but also did nt expect deploys [23:29:06] yeah, it was our fault for sneaking it in :) [23:29:06] should have checked for that list though, it's not like appservers which you depool in confctl [23:31:05] (03PS1) 10Dzahn: scap: remove mwmaint2001 from "dsh" groups [puppet] - 10https://gerrit.wikimedia.org/r/665225 (https://phabricator.wikimedia.org/T267607) [23:31:17] removing it, the first puppet run will be slower tha the entire OS install [23:32:17] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:32:19] (03CR) 10Dzahn: [C: 03+2] scap: remove mwmaint2001 from "dsh" groups [puppet] - 10https://gerrit.wikimedia.org/r/665225 (https://phabricator.wikimedia.org/T267607) (owner: 10Dzahn) [23:33:30] running puppet on deploy1001 to deploy change in deployment groups [23:34:21] marxarelli: puppet removed it from /etc/dsh/group/mediawiki-installation on deploy1001, you should be unaffected now [23:35:37] Thanks for the help mutante [23:36:26] no problem at all [23:38:13] to the original question how long it takes to update the known hosts.. i think also one puppet run? but it should not matter now that it is out of the group [23:46:04] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwmaint2001.codfw.wmnet with reason: REIMAGE [23:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:10] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwmaint2001.codfw.wmnet with reason: REIMAGE [23:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:08] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1