[00:02:57] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10RobH) [00:05:32] (03PS1) 10Papaul: DNS: Add mgmt and prodcution DNS for db209[7-9] db210[0-2] [dns] - 10https://gerrit.wikimedia.org/r/502651 [00:07:02] * Krinkle is staging on mwdebug1001 [00:08:14] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Papaul) [00:08:40] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Papaul) [00:20:42] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.25/resources/src/startup/: I3b9f1a13379a / Ie9db60e417cca (duration: 01m 01s) [00:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:00] * Krinkle releases deploy handle [00:37:33] !log last scap sync-file failed to mwdebug2002.codfw and mwdebug2001.codfw due to insufficient disk space [00:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:06] twentyafterfour: failed again. Surprised to see that the scap comment exits as success. [00:38:10] command* [00:38:26] is that because the debug ones are allowed to fail? [00:38:31] or because all are? [00:39:54] Krinkle: I'm not sure actually.. scap error handling is a bit weird because it's running a bunch of parallel jobs and I think it might not propogate the error back to the top? I'm actually not sure in this case really. [00:40:03] * twentyafterfour looks at the code in scap [00:40:42] there is a threshold of errors that are allowed, that's probably what happened here [00:40:59] it's not too uncommon for a single mediawiki host to fail for random reasons [00:42:08] Well, in theory that would mean immediate depool [00:42:16] should* [00:42:50] which most deployers aren't able to do afaik, and neither scap. You'd want some kind of way to allow that. And with an upper limit on how many can be depooled before we won't allow new deploys. [01:04:17] PROBLEM - Disk space on elastic1025 is CRITICAL: DISK CRITICAL - free space: /srv 27786 MB (5% inode=99%) [01:21:48] 10Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 10MediaWiki-Page-deletion: Create maintenance script to delete pages with empty histories - https://phabricator.wikimedia.org/T220570 (10GeoffreyT2000) [01:25:49] 10Operations, 10DBA, 10MediaWiki-Maintenance-scripts, 10MediaWiki-Page-deletion: Create maintenance script to delete pages with empty histories - https://phabricator.wikimedia.org/T220570 (10GeoffreyT2000) [01:28:29] RECOVERY - Disk space on elastic1025 is OK: DISK OK [01:34:05] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:32:57] PROBLEM - puppet last run on db1106 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:47:01] (03PS1) 10Gergő Tisza: Revoke editmyuserjsredirect from all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) [03:04:33] RECOVERY - puppet last run on db1106 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [03:39:12] (03CR) 10Jforrester: "Do you think this should go alongside the other user JS-related right fiddles (in that case, for interface-admin) on line ~3771?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) (owner: 10Gergő Tisza) [04:16:58] (03CR) 10Gergő Tisza: "> Do you think this should go alongside the other user JS-related right fiddles (in that case, for interface-admin) on line ~3771?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) (owner: 10Gergő Tisza) [04:17:49] (03PS2) 10Gergő Tisza: Revoke editmyuserjsredirect from all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) [04:44:45] 10Operations, 10MediaWiki-Database, 10MediaWiki-Maintenance-scripts, 10MediaWiki-Page-deletion, 10Wikimedia-Site-requests: Create maintenance script to delete pages with empty histories - https://phabricator.wikimedia.org/T220570 (10Marostegui) [04:49:36] (03PS1) 10Marostegui: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502673 (https://phabricator.wikimedia.org/T217453) [04:54:58] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Marostegui) [04:55:04] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) [04:55:07] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10Marostegui) [05:02:21] 10Operations, 10Release-Engineering-Team (Backlog): mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10Marostegui) p:05Normal→03Unbreak! Both mwdebug2001 and 2002 are now full: T218783#5099940 [05:08:01] <_joe_> !log removing hhvm cache on mwdebug2002 [05:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:08] <_joe_> !log same on mwdebug2001 [05:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:25] <_joe_> marostegui: I think I'll remove old mw versions by hand on those servers [05:15:06] <_joe_> or you won't be able to work [05:17:37] thanks :) [05:19:07] <_joe_> actually I'm not sure what can be deleted and what can't [05:19:28] <_joe_> and I see an incomplete copy of wmf.25 on those servers [05:19:47] PROBLEM - puppet last run on an-worker1090 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:20:09] <_joe_> yes [05:23:07] 10Operations, 10Release-Engineering-Team (Backlog): mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10greg) Since these are vms can a quick fix be to expand their disk? Since they're the oddballs of the mw fleet in that way. I know Tyler wants to fix the scap clean issu... [05:23:30] 10Operations, 10Release-Engineering-Team (Backlog): mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10Joe) Also the last mediawiki train didn't deploy correctly to those servers and will never be able to unless we remove old versions. [05:24:00] I'm super tired, I'll stop talking (I type from bed) [05:25:03] <_joe_> greg-g: I am exploring options, and frankly wasting VM disk space is not my main option. I'm going to prune out the old branches that are not on deploy1001 anymore [05:25:08] <_joe_> manually I mean [05:36:09] RECOVERY - Check systemd state on notebook1003 is OK: OK - running: The system is fully operational [05:36:19] <_joe_> aand... that's not properly doable either [05:36:58] <_joe_> marostegui: I don't see an alternative to: 1 - remove the codfw mwdebug servers from rotation, enlarge the disk, reimage them [05:38:53] 10Operations, 10Release-Engineering-Team (Backlog): mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10Joe) >>! In T219989#5099962, @greg wrote: > Since these are vms can a quick fix be to expand their disk? Since they're the oddballs of the mw fleet in that way. I know... [05:44:58] (03PS1) 10Elukey: admin: add jupyterhub restart capabilities to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/502677 [05:46:11] RECOVERY - puppet last run on an-worker1090 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:52:11] <_joe_> !log setting both mwdebug200{1,2} to pooled = inactive to remove them from scap dsh list and allow deployments, T219989 [05:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:15] T219989: mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 [05:52:25] <_joe_> marostegui: you should be able to deploy shortly [05:52:56] 10Operations, 10Release-Engineering-Team (Backlog): mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10Joe) p:05Unbreak!→03High [05:55:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [05:57:05] <_joe_> WTF? [05:57:13] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:57:23] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:57:24] <_joe_> we're having 5xxs en masse since yesterday after a deploy [05:58:25] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:04:45] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [06:16:57] PROBLEM - mediawiki-installation DSH group on mwdebug2001 is CRITICAL: Host mwdebug2001 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [06:17:21] <_joe_> that ^^ is known [06:17:24] <_joe_> and expected [06:20:23] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:20:31] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:45] 10Operations, 10Mobile-Content-Service, 10RESTBase, 10Reading-Infrastructure-Team-Backlog, 10Services: Continuous errors on several REST API resources (probably related to MCS release) - https://phabricator.wikimedia.org/T220574 (10Joe) [06:23:55] 10Operations, 10Mobile-Content-Service, 10RESTBase, 10Reading-Infrastructure-Team-Backlog, 10Services: Continuous errors on several REST API resources (probably related to MCS release) - https://phabricator.wikimedia.org/T220574 (10Joe) p:05Triage→03Unbreak! [06:29:19] 10Operations, 10Mobile-Content-Service, 10RESTBase, 10Reading-Infrastructure-Team-Backlog, 10Services: Continuous errors on several REST API resources (probably related to MCS release) - https://phabricator.wikimedia.org/T220574 (10Joe) The errors seem to come from restbase: https://logstash.wikimedia.or... [06:32:25] PROBLEM - puppet last run on cloudvirt1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/run-puppet-agent] [06:33:56] 10Operations, 10Mobile-Content-Service, 10RESTBase, 10Reading-Infrastructure-Team-Backlog, 10Services: Continuous errors on several REST API resources (probably related to MCS release) - https://phabricator.wikimedia.org/T220574 (10Joe) I've looked around a bit, and while the number of errors is in gener... [06:44:51] PROBLEM - mediawiki-installation DSH group on mwdebug2002 is CRITICAL: Host mwdebug2002 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [06:47:40] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10jcrespo) Note that despite this host being separate than T219463 for setup because it will have a different puppet role, hardware wi... [06:51:16] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) It is set as spare for now so the installation can go through: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/49... [06:51:27] _joe_: i'm technically on vacation but just saw T220574; patch incoming as soon as local tests pass. [06:51:28] T220574: Continuous errors on several REST API resources (probably related to MCS release) - https://phabricator.wikimedia.org/T220574 [06:52:01] <_joe_> mdholloway: oh <3 [06:52:13] <_joe_> and, you really shouldn't :) [06:52:19] <_joe_> but thanks anyways [06:52:31] no worries, it's an easy fix, just a revert! :) [06:53:42] <_joe_> still, you're on vacation :P [06:56:21] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502673 (https://phabricator.wikimedia.org/T217453) (owner: 10Marostegui) [06:57:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502673 (https://phabricator.wikimedia.org/T217453) (owner: 10Marostegui) [06:58:47] RECOVERY - puppet last run on cloudvirt1017 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:32] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool x1 slaves T217453 (duration: 01m 13s) [06:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:36] T217453: Remove etp_user from echo_target_page in production - https://phabricator.wikimedia.org/T217453 [06:59:58] !log Deploy schema change on x1 master, with replication, lag will happen on x1 T217453 [07:00:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:06] (03CR) 10jenkins-bot: db-eqiad.php: Depool all x1 slaves [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502673 (https://phabricator.wikimedia.org/T217453) (owner: 10Marostegui) [07:03:15] 10Operations, 10Mobile-Content-Service, 10RESTBase, 10Services, and 2 others: Continuous errors on several REST API resources (probably related to MCS release) - https://phabricator.wikimedia.org/T220574 (10Mholloway) a:03Mholloway [07:08:45] !log Upgrading Thumbor servers to python-thumbor-wikimedia to 2.4-1+deb9u1 [07:08:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:36] !log Rolling restart thumbor service [07:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:22] !log depooling maps200[34] to increase cassandra replication factor - T198622 [07:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:25] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [07:18:40] !log mholloway-shell@deploy1001 Started deploy [mobileapps/deploy@efd5bd5]: Revert "Bifurcate imageinfo queries to improve performance" (T220574) [07:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:44] T220574: Continuous errors on several REST API resources (probably related to MCS release) - https://phabricator.wikimedia.org/T220574 [07:22:45] !log mholloway-shell@deploy1001 Finished deploy [mobileapps/deploy@efd5bd5]: Revert "Bifurcate imageinfo queries to improve performance" (T220574) (duration: 04m 05s) [07:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:35] <_joe_> mdholloway: that page now renders! [07:23:37] <_joe_> <3 [07:23:43] \o/ [07:23:49] <_joe_> https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5&from=now-1h&to=now [07:23:54] thanks a lot mdholloway! [07:23:54] <_joe_> thanks a lot [07:24:11] you're welcome :) [07:24:15] <_joe_> mdholloway: I'm sorry, you'll get wikilove next time [07:24:21] <_joe_> :D [07:24:48] <_joe_> also, I hope you're not in your usual TZ [07:25:50] _joe_: nope, i'm in asia visiting in-laws! [07:26:15] just happened to open the laptop at the right time [07:26:25] it's just after noon here [07:26:48] 10Operations, 10Mobile-Content-Service, 10RESTBase, 10Services, and 2 others: Continuous errors on several REST API resources (probably related to MCS release) - https://phabricator.wikimedia.org/T220574 (10Joe) 05Open→03Resolved https://grafana.wikimedia.org/d/myRmf1Pik/varnish-aggregate-client-status... [07:26:57] <_joe_> ok I feel slightly less guilty then :D [07:27:28] huh. fixed just like that [07:27:43] goes to show I really don't understand restbase innards at all [07:27:56] thanks for the fix indeed [07:28:06] also, mea culpa, it was my patch that was responsible for the errors in the first place... [07:28:29] <_joe_> I guess we all owe you a t-shirt then [07:28:32] :-D [07:28:59] <_joe_> welcome to the club (you broke it, then you fixed it) [07:29:16] i knew it would happen one day :P [07:29:23] <_joe_> now go back to your vacation! [07:29:37] now you're officially a member of the sre team... (well, that's what we used to say :-P) [07:29:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [07:29:57] ah, good point, back to vacation mode... [07:36:33] (03PS1) 10Muehlenhoff: Update expiry contact for nathante [puppet] - 10https://gerrit.wikimedia.org/r/502749 [07:41:20] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10MoritzMuehlenhoff) He wasn't removed from the cn=wmf LDAP group, I fixed that: ` jmm@mwmaint1002:~$ ldapsearch -x cn=wmf | grep tbayer member: uid=tbayer,ou=people,dc=wikimedia,dc=org ` [07:42:24] (03CR) 10Vgutierrez: [C: 03+2] archiva: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502503 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [07:42:35] (03PS2) 10Vgutierrez: archiva: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502503 (https://phabricator.wikimedia.org/T220359) [07:42:50] (03CR) 10Muehlenhoff: [C: 03+2] Update expiry contact for nathante [puppet] - 10https://gerrit.wikimedia.org/r/502749 (owner: 10Muehlenhoff) [07:44:21] (03PS3) 10Vgutierrez: archiva: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502503 (https://phabricator.wikimedia.org/T220359) [07:45:17] moritzm is a proper sniper /o\ [07:49:27] (03CR) 10Vgutierrez: [C: 03+2] icinga: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502505 (owner: 10Vgutierrez) [07:49:34] (03PS2) 10Vgutierrez: icinga: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502505 [08:01:58] (03PS2) 10Elukey: aptrepo: update cloudera-jessie to 5.16.1 [puppet] - 10https://gerrit.wikimedia.org/r/500453 (https://phabricator.wikimedia.org/T218343) [08:03:06] (03CR) 10Vgutierrez: [C: 03+2] gerrit: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502506 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [08:03:15] (03PS2) 10Vgutierrez: gerrit: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502506 (https://phabricator.wikimedia.org/T220359) [08:03:34] (03CR) 10Elukey: [C: 03+2] aptrepo: update cloudera-jessie to 5.16.1 [puppet] - 10https://gerrit.wikimedia.org/r/500453 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [08:04:29] (03PS3) 10Vgutierrez: gerrit: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502506 (https://phabricator.wikimedia.org/T220359) [08:08:58] (03CR) 10Vgutierrez: [C: 03+2] install_server: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502509 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [08:09:10] (03PS2) 10Vgutierrez: install_server: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502509 (https://phabricator.wikimedia.org/T220359) [08:10:21] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/deploy restbase2019 and restbase2020 - https://phabricator.wikimedia.org/T217368 (10fgiunchedi) 05Open→03Resolved Hosts are in service, resolving [08:11:13] (03PS17) 10Fsero: Enabling docker registry swift cross dc replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [08:12:20] !log T220265 foreachwiki extensions/WikimediaMaintenance/filebackend/setZoneAccess.php --backend local-multiwrite [08:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:24] T220265: mw:thumbor swift user doesn't have access to wikipedia-commons-local-temp.* swift containers - https://phabricator.wikimedia.org/T220265 [08:14:32] (03CR) 10Filippo Giunchedi: [C: 03+2] Enabling docker registry swift cross dc replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [08:15:51] (03CR) 10Vgutierrez: [C: 03+2] dumps: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502513 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [08:16:00] (03PS2) 10Vgutierrez: dumps: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502513 (https://phabricator.wikimedia.org/T220359) [08:19:44] (03CR) 10DCausse: [C: 04-1] elasticsearch: reset all indices to read/write (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [08:20:09] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:24:02] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Consider removing the last traces of nutcracker in Mediawiki configs - https://phabricator.wikimedia.org/T214275 (10elukey) While working on memcached on mc1022 I tried to check `stat conns` and observed a lot... [08:26:04] !log onimisionipe@deploy1001 Started deploy [kartotherian/deploy@f7518bb] (stretch): Insert maps2003 into stretch environment [08:26:05] (03PS3) 10Gehel: elasticsearch: reset all indices to read/write [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) [08:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:26] !log onimisionipe@deploy1001 Finished deploy [kartotherian/deploy@f7518bb] (stretch): Insert maps2003 into stretch environment (duration: 00m 22s) [08:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:20] 10Operations, 10Acme-chief, 10Traffic: Benefit from acme-chief features in acme-chief clients - https://phabricator.wikimedia.org/T220359 (10Vgutierrez) 05Open→03Stalled [08:33:49] 10Operations, 10Core Platform Team Backlog, 10MediaWiki-General-or-Unknown, 10serviceops, and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) >>! In T219279#5098646, @Esanders wrote: >> @Esanders is it a hu... [08:34:34] gilles: FYI, we'll be roll-restarting swift frontends in ~10-15 mins, shouldn't cause any problem to setzoneaccess tho [08:34:41] ok [08:36:50] !log update thirdparty/cloudera packages to cdh 5.16.1 for jessie/stretch-wikimedia - T218343 [08:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:54] T218343: Upgrade analytics cluster to Cloudera CDH 5.16.1 - https://phabricator.wikimedia.org/T218343 [08:36:57] (03CR) 10DCausse: [C: 03+1] elasticsearch: reset all indices to read/write [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [08:42:23] (03PS3) 10Muehlenhoff: Initial Kerberos KDC/kadmin server profiles/roles [puppet] - 10https://gerrit.wikimedia.org/r/502511 [08:43:10] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Set up acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502478 (https://phabricator.wikimedia.org/T220378) (owner: 10Vgutierrez) [08:43:19] (03PS7) 10Vgutierrez: acme_chief: Set up acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502478 (https://phabricator.wikimedia.org/T220378) [08:46:28] !log roll-restart swift frontends - T214289 [08:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:35] T214289: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 [08:48:35] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:48:52] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:48:55] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:49:21] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:49:26] (03PS2) 10Marostegui: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) [08:49:29] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:49:33] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:49:40] (03PS4) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) [08:49:54] (03PS2) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) [08:50:07] (03PS2) 10Marostegui: mariadb: Promote db1075 to master [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) [08:50:11] checking aqs [08:50:50] seems to be druid not liking the way we drop segments, following up with my team [08:50:56] (03CR) 10Jcrespo: [C: 03+1] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [08:51:41] (03PS1) 10Ladsgroup: Enable UrlShortener in mediawikiwiki again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502760 [08:51:45] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [08:53:31] <_joe_> sigh aqs [08:53:39] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Bstorm) a:03Bstorm [08:56:35] !log restart druid-broker on druid100[4-6] - stuck after attempt datasource delete action [08:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:44] this seems to be a bug in Druid [08:57:01] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:07] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:13] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:29] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:45] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:57:49] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:58:23] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [08:58:50] 10Operations, 10Analytics, 10EventBus, 10vm-requests, and 3 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10akosiaris) >>! In T219556#5098760, @Ottomata wrote: > @akosiaris, I'd love to do this sooner rather than later. It'd make some configu... [09:01:28] 10Operations, 10Acme-chief, 10Traffic: Provide an staging environment for acme-chief - https://phabricator.wikimedia.org/T220378 (10Vgutierrez) 05Open→03Resolved [09:05:03] (03PS2) 10Ladsgroup: Enable UrlShortener in mediawikiwiki again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502760 [09:05:27] (03CR) 10Ema: [C: 03+1] zone_validator: catch parse errors [dns] - 10https://gerrit.wikimedia.org/r/481833 (owner: 10Volans) [09:05:57] !log upgrading job runners mw1299-mw1311 to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [09:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:01] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [09:14:25] PROBLEM - puppet last run on registry1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[create_swift_container_replication] [09:14:52] godog, fsero ^^^ I guess related to the WIP [09:15:03] yep is mine [09:15:05] ill mute it [09:15:33] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Package envoy 1.9.X for stretch and use it as redis proxy on docker registry - https://phabricator.wikimedia.org/T215810 (10MoritzMuehlenhoff) >>! In T215810#5092693, @fsero wrote: > Building 1.9.1 due to CVE Adding CVE 2019-9900 a... [09:17:08] (03PS1) 10DCausse: Add a new extension point SshCommandPreExecutionFilter [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/502764 [09:21:45] (03CR) 10Gehel: [C: 03+1] "LGTM, minor comment inline" (031 comment) [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/502764 (owner: 10DCausse) [09:23:15] (03PS1) 10Vgutierrez: acme_chief: Let the staging environment get the whole certificate list [puppet] - 10https://gerrit.wikimedia.org/r/502765 (https://phabricator.wikimedia.org/T219414) [09:26:50] (03PS1) 10Muehlenhoff: Update email adress for awight [puppet] - 10https://gerrit.wikimedia.org/r/502766 (https://phabricator.wikimedia.org/T216995) [09:27:14] (03CR) 10Vgutierrez: [C: 03+2] "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/15668/" [puppet] - 10https://gerrit.wikimedia.org/r/502765 (https://phabricator.wikimedia.org/T219414) (owner: 10Vgutierrez) [09:28:50] (03PS2) 10Muehlenhoff: Update email adress for awight [puppet] - 10https://gerrit.wikimedia.org/r/502766 (https://phabricator.wikimedia.org/T216995) [09:30:14] (03PS1) 10Marostegui: db-eqiad.php: Repool db1120 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502767 [09:30:29] (03PS3) 10Muehlenhoff: Update email adress for awight [puppet] - 10https://gerrit.wikimedia.org/r/502766 (https://phabricator.wikimedia.org/T216995) [09:30:32] (03CR) 10Marostegui: [C: 04-1] "Wait for the host to catch up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502767 (owner: 10Marostegui) [09:31:10] (03PS4) 10Muehlenhoff: Update email adress for awight [puppet] - 10https://gerrit.wikimedia.org/r/502766 (https://phabricator.wikimedia.org/T216995) [09:32:38] (03CR) 10Muehlenhoff: [C: 03+2] Update email adress for awight [puppet] - 10https://gerrit.wikimedia.org/r/502766 (https://phabricator.wikimedia.org/T216995) (owner: 10Muehlenhoff) [09:40:22] !log upgrade kubernetes-master on neon (staging cluster) to 1.12.7-1 [09:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:52] !log upgrade kubernetes-master on neon (staging cluster) to 1.12.7-1 T220405 [09:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:55] T220405: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 [09:41:32] (03PS1) 10Mathew.onipe: maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) [09:42:16] (03CR) 10jerkins-bot: [V: 04-1] maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [09:43:22] (03PS2) 10Mathew.onipe: maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) [09:50:11] !log upgrading snapshot hosts to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [09:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:15] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [09:51:49] !log upgrade kubernetes-node on kubestage1001 (staging cluster) to 1.12.7-1 T220405 [09:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:53] T220405: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 [09:59:54] !log upgrading labweb hosts (wikitech) to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [09:59:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:58] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [10:08:39] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1120 T217453 (duration: 01m 03s) [10:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:43] T217453: Remove etp_user from echo_target_page in production - https://phabricator.wikimedia.org/T217453 [10:14:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1064 T217453 (duration: 00m 59s) [10:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:07] T217453: Remove etp_user from echo_target_page in production - https://phabricator.wikimedia.org/T217453 [10:17:44] !log upload kubernetes_1.12.7-1 to apt.wikimedia.org/stretch-wikimedia component main T220405 [10:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:48] T220405: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 [10:25:26] !log resizing disk on mwdebug2001 T219989 [10:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:38] T219989: mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 [10:33:37] !log upgrading nodejs on aqs* to latest node 10 version from component/node10 [10:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:57] !log upgrade kubernetes-node on kubestage1002 (staging cluster) to 1.12.7-1 T220405 [10:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:01] T220405: TEC3:05:05.1:Q4 Services and the deployment pipeline are hosted on production-level infrastructure - https://phabricator.wikimedia.org/T220405 [10:46:12] !log T220265 setZoneAccess on all wikis finished [10:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:17] T220265: mw:thumbor swift user doesn't have access to wikipedia-commons-local-temp.* swift containers - https://phabricator.wikimedia.org/T220265 [10:47:14] PROBLEM - Host mwdebug2001 is DOWN: PING CRITICAL - Packet loss = 100% [10:47:32] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:49:18] fsero: missing downtime? [10:49:35] yeah.. [10:49:46] sorry I'm always worried when seeing this things that we hit again the icinga restart issue with the command file ;) [10:50:31] ACKNOWLEDGEMENT - Host mwdebug2001 is DOWN: PING CRITICAL - Packet loss = 100% Fsero down due to T219989 [10:55:51] !log upgrading nodejs on analytics-tool1002 to latest node 10 version from component/node10 [10:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190410T1100). [11:00:04] Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:24] Amir1: I guess it's just you deploying your own patch then :) [11:00:45] \o/ [11:17:22] RECOVERY - Host mwdebug2001 is UP: PING OK - Packet loss = 0%, RTA = 36.37 ms [11:19:40] PROBLEM - HHVM rendering on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:19:44] PROBLEM - Apache HTTP on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:20:12] PROBLEM - HHVM processes on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:20:12] PROBLEM - php7.2-fpm service on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused [11:20:12] PROBLEM - mcrouter process on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Mcrouter [11:20:24] PROBLEM - Nginx local proxy to apache on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:20:26] ^^ thats me [11:20:28] PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused [11:20:28] PROBLEM - DPKG on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused [11:20:31] downtiming the host [11:20:34] PROBLEM - Disk space on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused [11:20:40] PROBLEM - Check size of conntrack table on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused [11:20:40] PROBLEM - Check whether ferm is active by checking the default input chain on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused [11:20:42] PROBLEM - nutcracker port on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [11:20:46] PROBLEM - puppet last run on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused [11:20:50] PROBLEM - PHP7 rendering on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers [11:20:52] PROBLEM - dhclient process on mwdebug2001 is CRITICAL: connect to address 10.192.0.98 port 5666: Connection refused [11:44:42] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [11:45:22] jenkins... [11:46:47] !log elastisearch search cluster: reindexing zh-min-nan wikis (T219533) [11:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:51] T219533: Reindex space less languages wikis to use BM25 - https://phabricator.wikimedia.org/T219533 [11:50:55] zeljkof: https://integration.wikimedia.org/zuul/ jenkins is full with lots of patches at the same time now the deployment is stuck because of that [11:51:37] jouncebot: now [11:51:38] For the next 0 hour(s) and 8 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190410T1100) [11:52:34] Amir1: hm, hashar would know, but he's not around :/ [11:52:50] looks like there's only one job running? [11:54:36] it's just someone changed something on really long chain and everything got rebased at the same time [11:55:14] ouch [11:55:28] but still, why is only one job running? [11:55:38] and why don't swat patches go first? [11:56:22] IDK :( [11:56:30] Should I force merge it? [11:57:10] only if you are absolutely sure it will not break things, and if it can't wait until the next window [11:57:31] I really don't know how to fix CI :/ [11:58:02] I have undeployed patches in deploy1001, that would block any other deployment [11:58:27] oh mine is working now [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190410T1200) [12:05:24] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: Prep work for deploying UrlShortener extension (T108557), part I (duration: 01m 00s) [12:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:28] T108557: Review and deploy UrlShortener extension to Wikimedia wikis - https://phabricator.wikimedia.org/T108557 [12:07:13] !log ladsgroup@deploy1001 Synchronized wmf-config/CommonSettings.php: SWAT: Prep work for deploying UrlShortener extension (T108557), part II (duration: 01m 00s) [12:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:22] !log EU swat is done [12:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:52] PROBLEM - puppet last run on wtp1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:17:50] !log rolling security update of systemd on stretch systems [12:17:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:22] RECOVERY - Apache HTTP on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 10975 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:21:06] RECOVERY - HHVM processes on mwdebug2001 is OK: PROCS OK: 1 process with command name hhvm https://wikitech.wikimedia.org/wiki/Application_servers [12:21:06] RECOVERY - mcrouter process on mwdebug2001 is OK: PROCS OK: 1 process with UID = 113 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [12:21:06] RECOVERY - php7.2-fpm service on mwdebug2001 is OK: OK - php7.2-fpm is active [12:21:24] RECOVERY - DPKG on mwdebug2001 is OK: All packages OK [12:21:30] RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:21:30] RECOVERY - Disk space on mwdebug2001 is OK: DISK OK [12:21:34] RECOVERY - Check size of conntrack table on mwdebug2001 is OK: OK: nf_conntrack is 0 % full [12:21:38] RECOVERY - Check whether ferm is active by checking the default input chain on mwdebug2001 is OK: OK ferm input default policy is set [12:21:48] RECOVERY - dhclient process on mwdebug2001 is OK: PROCS OK: 0 processes with command name dhclient [12:30:04] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:33:40] RECOVERY - Nginx local proxy to apache on mwdebug2001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 624 bytes in 2.939 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:02] RECOVERY - PHP7 rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 74003 bytes in 1.395 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:34:10] RECOVERY - HHVM rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 73955 bytes in 2.267 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:40:39] !log contint2001: stopped puppet and zuul-merger for debugging [12:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:46] !log decommissioning cassandra-b, restbase2007 -- T208087 [12:41:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:49] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [12:42:32] RECOVERY - puppet last run on wtp1032 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:43:32] PROBLEM - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [12:45:06] ACKNOWLEDGEMENT - zuul_merger_service_running on contint2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger amusso maintenance ongoing for zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [12:45:36] PROBLEM - puppet last run on es2019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:45:36] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:49:42] PROBLEM - puppet last run on ganeti1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:50:10] PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid] [12:51:04] PROBLEM - DPKG on analytics1065 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:51:06] PROBLEM - puppet last run on wtp1043 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:51:38] PROBLEM - puppet last run on graphite1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:52:02] PROBLEM - DPKG on es1013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:52:30] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:53:20] PROBLEM - puppet last run on ms-be1037 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:53:38] PROBLEM - puppet last run on snapshot1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [12:54:08] ^^ this is likley related to the systemd update (which is now complete) [12:55:00] RECOVERY - puppet last run on ganeti1008 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:56:18] RECOVERY - DPKG on analytics1065 is OK: All packages OK [12:57:14] RECOVERY - DPKG on es1013 is OK: All packages OK [12:59:16] RECOVERY - zuul_merger_service_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190410T1300) [13:00:42] RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:01:26] RECOVERY - puppet last run on es2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:01:26] RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:01:40] RECOVERY - puppet last run on wtp1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:02:14] RECOVERY - puppet last run on graphite1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:03:02] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:03:54] RECOVERY - puppet last run on ms-be1037 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:04:12] RECOVERY - puppet last run on snapshot1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:10:16] (03PS7) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) [13:12:48] (03CR) 10Ema: [C: 03+2] cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:13:06] (03PS3) 10Jbond: puppet_major_version4: remove old puppet_major_version variable. [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) [13:14:45] (03PS2) 10Filippo Giunchedi: swift: don't show container-sync-realms diff [puppet] - 10https://gerrit.wikimedia.org/r/502774 [13:14:50] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: don't show container-sync-realms diff [puppet] - 10https://gerrit.wikimedia.org/r/502774 (owner: 10Filippo Giunchedi) [13:19:27] !log Deploy schema change on aawiki aawikibooks aawiktionary abwiki abwiktionary acewiki advisorswiki advisorywiki adywiki afwiki on x1 - T136427 [13:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:30] T136427: Remove event_page_namespace and event_page_title - https://phabricator.wikimedia.org/T136427 [13:22:27] (03PS3) 10Filippo Giunchedi: logging: move webrequest-5xx to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/493243 (https://phabricator.wikimedia.org/T213899) [13:24:08] PROBLEM - HP RAID on db2054 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11 - Failed: 1I:1:12 - Controller: OK - Battery/Capacitor: OK [13:24:10] ACKNOWLEDGEMENT - HP RAID on db2054 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11 - Failed: 1I:1:12 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T220607 [13:24:17] \o/ [13:25:15] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T220607 (10Marostegui) p:05Triage→03Normal a:03Papaul Can we replace this disk? Thanks! [13:28:51] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [13:29:03] 10Operations: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) a:03Ottomata Hm ya sounds right. What happens if you remove the .csr file too? I think puppet still might not accept it, since it already has a cert stored for this name. You'll probably have to... [13:29:12] (03CR) 10jenkins-bot: Disable UrlShortener in mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502780 (owner: 10Ladsgroup) [13:29:16] 10Operations, 10Analytics: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) [13:29:48] (03CR) 10Ottomata: [C: 03+1] admin: add jupyterhub restart capabilities to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/502677 (owner: 10Elukey) [13:29:58] (03PS4) 10Jbond: puppet_major_version4: remove old puppet_major_version variable. [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) [13:30:06] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [13:31:01] 10Operations, 10Core Platform Team Backlog, 10MediaWiki-General-or-Unknown, 10serviceops, and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Esanders) Sounds good to me. [13:33:19] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984 (10Scoopfinder) [13:34:38] 10Operations, 10Puppet, 10User-fgiunchedi: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10fgiunchedi) [13:37:16] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2054 is CRITICAL: cluster=mysql device=cciss,11 instance=db2054:9100 job=node site=codfw Marostegui T220607 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2054&var-datasource=codfw+prometheus/ops [13:37:24] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [13:38:19] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => write-both/read-new on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502794 (https://phabricator.wikimedia.org/T188327) [13:39:54] (03CR) 10Anomie: [C: 03+2] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502794 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:41:12] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-new on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502794 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:41:26] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-new on group 0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502794 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:41:39] (03CR) 10Herron: "I like it! Added some thoughts inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [13:42:33] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-new on group0 (T188327) (duration: 01m 00s) [13:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:39] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [13:45:58] (03CR) 10Vgutierrez: [C: 04-2] lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [13:47:25] !log resizing disk on mwdebug2002 T219989 [13:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:28] T219989: mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 [13:52:51] (03CR) 10Alex Monk: "This doesn't look right, see the header of the file and Ief7d536c" [puppet] - 10https://gerrit.wikimedia.org/r/484024 (https://phabricator.wikimedia.org/T213540) (owner: 10Andrew Bogott) [13:56:55] Hi ops-team - quick note letting you know the analytics-team is deploying AQS :) [13:58:35] !log joal@deploy1001 Started deploy [analytics/aqs/deploy@fc1d232]: Deploying per-page limits for druid-endpoints [13:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:15] !log joal@deploy1001 Finished deploy [analytics/aqs/deploy@fc1d232]: Deploying per-page limits for druid-endpoints (duration: 14m 41s) [14:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:21] !log CI processing was a bit slower than usual over the past couple hours or so. It should be slightly faster now T220606 [14:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:28] T220606: zuul-merger takes a while to recreate repository branches - https://phabricator.wikimedia.org/T220606 [14:22:08] PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:22:20] PROBLEM - nutcracker port on mwdebug2001 is CRITICAL: connect to address 127.0.0.1 and port 11212: Connection refused https://wikitech.wikimedia.org/wiki/Nutcracker [14:22:52] PROBLEM - nutcracker process on mwdebug2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [14:23:35] 10Operations, 10Analytics, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) Ah no, checked the code, and this is indeed because Puppet CA has already signed a cert for this common name. [14:23:45] 10Operations, 10Analytics, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) Added a note on https://wikitech.wikimedia.org/wiki/Cergen [14:24:22] (03CR) 10Alexandros Kosiaris: [C: 04-1] "premise looks good, some inline comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) (owner: 10Ladsgroup) [14:31:21] looking into the 2001 alerts [14:34:42] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:37:02] 10Operations, 10Advanced Mobile Contributions, 10Traffic, 10User-Joe: AMC – Opt-in for logged out users - https://phabricator.wikimedia.org/T215624 (10ovasileva) p:05Triage→03Low [14:37:22] 10Operations, 10Advanced Mobile Contributions, 10Traffic, 10User-Joe: AMC – Opt-in for logged out users - https://phabricator.wikimedia.org/T215624 (10ovasileva) [14:37:38] (03PS1) 10Alexandros Kosiaris: Remove google api key from wmde_secrets [labs/private] - 10https://gerrit.wikimedia.org/r/502801 (https://phabricator.wikimedia.org/T217641) [14:37:54] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove google api key from wmde_secrets [labs/private] - 10https://gerrit.wikimedia.org/r/502801 (https://phabricator.wikimedia.org/T217641) (owner: 10Alexandros Kosiaris) [14:38:14] 10Operations, 10Advanced Mobile Contributions, 10Traffic, 10User-Joe: AMC – Opt-in for logged out users - https://phabricator.wikimedia.org/T215624 (10ovasileva) [14:43:34] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Milimetric) I would say so, @dr0ptp4kt, I'd maybe even go so far as to host a graph edit-a-thon to upgrad... [14:44:49] (03CR) 10Jbond: "puppet-compiler report" [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [14:44:56] RECOVERY - Disk space on mwdebug2002 is OK: DISK OK [14:46:44] (03CR) 10Volans: puppet_major_version4: remove old puppet_major_version variable. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [14:50:08] RECOVERY - nutcracker process on mwdebug2001 is OK: PROCS OK: 1 process with UID = 112 (nutcracker), command name nutcracker https://wikitech.wikimedia.org/wiki/Nutcracker [14:50:42] RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational [14:50:52] RECOVERY - nutcracker port on mwdebug2001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 https://wikitech.wikimedia.org/wiki/Nutcracker [14:53:59] (03PS5) 10Jbond: puppet_major_version4: remove old puppet_major_version variable. [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) [14:54:24] (03CR) 10Jbond: puppet_major_version4: remove old puppet_major_version variable. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [14:55:06] (03PS11) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [14:55:27] (03PS1) 10Ema: cache: move varnish storage config to varnish-be profile [puppet] - 10https://gerrit.wikimedia.org/r/502806 (https://phabricator.wikimedia.org/T219967) [14:57:57] !log repooling mwdebug2001 [14:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:28] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T220607 (10Papaul) a:05Papaul→03Marostegui complete [14:59:18] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T220607 (10Marostegui) Thanks! ` physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS, 600 GB, Rebuilding) ` [15:00:01] !log Enable puppet on thumbor1001, switch back to nginx, pool thumbor1004 - T187765 [15:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:05] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [15:00:06] !log repooling mwdebug2002 [15:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:05] !log pooled back mwdebug200[1,2] T219989 [15:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:09] T219989: mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 [15:01:49] (03PS2) 10Ema: cache: move varnish storage config to varnish-be profile [puppet] - 10https://gerrit.wikimedia.org/r/502806 (https://phabricator.wikimedia.org/T219967) [15:02:34] 10Operations, 10Release-Engineering-Team (Backlog): mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10fsero) 05Open→03Resolved mwdebug2001,2 disk has been increased and VMs reimaged and pooled back, so this should be good to go now. While doing this i faced the sam... [15:02:55] (03PS1) 10Volans: debmonitor: add self to the list of hosts [puppet] - 10https://gerrit.wikimedia.org/r/502810 [15:04:21] (03CR) 10Volans: "compiler output: https://puppet-compiler.wmflabs.org/compiler1002/15675/" [puppet] - 10https://gerrit.wikimedia.org/r/502810 (owner: 10Volans) [15:05:16] (03PS1) 10Giuseppe Lavagetto: Release 2.0.0 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/502812 [15:11:13] (03PS3) 10Ema: cache: move varnish storage config to varnish-be profile [puppet] - 10https://gerrit.wikimedia.org/r/502806 (https://phabricator.wikimedia.org/T219967) [15:12:22] PROBLEM - Host pc2008.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:12:40] papaul: ^ [15:13:58] (03PS1) 10Elukey: role::druid::public::worker: set stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/502814 (https://phabricator.wikimedia.org/T219910) [15:14:25] papaul: maybe loose cable? [15:14:48] 10Operations, 10Thumbor, 10serviceops: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10Gilles) p:05Triage→03Normal a:03Gilles [15:15:14] PROBLEM - Host graphite2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:15:29] (03PS2) 10Giuseppe Lavagetto: Release 2.0.0 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/502812 [15:15:42] mmm, those two hosts are on different rows [15:15:44] (03PS1) 10Giuseppe Lavagetto: profile::docker::builder: fix shell script for docker-pkg 2.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/502815 [15:16:22] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Release 2.0.0 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/502812 (owner: 10Giuseppe Lavagetto) [15:17:00] (03CR) 10Ema: "pcc looks good: https://puppet-compiler.wmflabs.org/compiler1001/15679/" [puppet] - 10https://gerrit.wikimedia.org/r/502806 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [15:17:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502810 (owner: 10Volans) [15:18:26] (03CR) 10Volans: [C: 03+2] debmonitor: add self to the list of hosts [puppet] - 10https://gerrit.wikimedia.org/r/502810 (owner: 10Volans) [15:19:12] RECOVERY - mediawiki-installation DSH group on mwdebug2001 is OK: OK https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [15:19:25] So the IPMI works locally on both [15:19:26] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@605690c]: Upgrade to docker-pkg 2.0.0 [15:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:40] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@605690c]: Upgrade to docker-pkg 2.0.0 (duration: 00m 13s) [15:19:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:28] PROBLEM - Host lvs2010.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:20:54] gilles: Are you deploying? I've got a UBN… [15:22:01] James_F: go ahead [15:22:04] Thanks. [15:22:12] you can include mine in the rebase, it's fine [15:22:16] i'll deploy it afterwards [15:22:21] Kk. [15:24:12] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.25/extensions/Score/: UBN Revert Score changes that broke VE T220465 (duration: 01m 01s) [15:24:15] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::docker::builder: fix shell script for docker-pkg 2.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/502815 (owner: 10Giuseppe Lavagetto) [15:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:16] T220465: [Regression wmf.25] VE is not loading for pages with music score, shows error " Uncaught Error: No class registered by that name: mwScore" - https://phabricator.wikimedia.org/T220465 [15:24:19] (03PS2) 10Giuseppe Lavagetto: profile::docker::builder: fix shell script for docker-pkg 2.0.0 [puppet] - 10https://gerrit.wikimedia.org/r/502815 [15:24:42] gilles: All done. Thanks! [15:24:51] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Yurik) @Milimetric I think the removal of Graphoid will be far more difficult than just keeping it. If y... [15:25:19] (03CR) 10Volans: [C: 03+1] "Change looks good to me, thanks for cleaning up this long-standing tech debt." [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [15:26:20] RECOVERY - Host graphite2003.mgmt is UP: PING WARNING - Packet loss = 64%, RTA = 37.00 ms [15:26:22] !log oblivian@deploy1001 Started deploy [docker-pkg/deploy@605690c]: Upgrade to docker-pkg 2.0.0 everywhere [15:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:43] !log oblivian@deploy1001 Finished deploy [docker-pkg/deploy@605690c]: Upgrade to docker-pkg 2.0.0 everywhere (duration: 00m 21s) [15:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:11] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.25/includes/media/ThumbnailImage.php: T216499 Only apply high priority hint half the time (duration: 00m 59s) [15:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:16] T216499: Priority Hints origin trial - https://phabricator.wikimedia.org/T216499 [15:28:52] RECOVERY - Host pc2008.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.95 ms [15:33:49] (03PS2) 10Elukey: role::druid::public::worker: set stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/502814 (https://phabricator.wikimedia.org/T219910) [15:34:48] PROBLEM - Host bast2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:40:30] James_F: Can I take mwdebug1002? [15:41:00] RECOVERY - Host lvs2010.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.97 ms [15:41:51] Krinkle: Go for it, I'm not doing anything in prod. [15:42:15] * Krinkle takes mwdebug1002 [15:44:36] PROBLEM - Host pc2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:46:01] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.25/includes/parser/DateFormatter.php: Ib2b3fb315dc93b / T220563 (duration: 01m 00s) [15:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:05] T220563: Undefined index: june in DateFormatter.php (makeIsoMonth) - https://phabricator.wikimedia.org/T220563 [15:49:46] RECOVERY - mediawiki-installation DSH group on mwdebug2002 is OK: OK https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [15:51:38] RECOVERY - Host bast2002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.98 ms [15:55:00] RECOVERY - Host pc2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.97 ms [15:58:04] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) [15:58:16] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) Built 0.2.4 deb and installed on puppetmaster1001. [15:58:36] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: cergen: exceptions trying to add alt_name - https://phabricator.wikimedia.org/T220591 (10Ottomata) [15:59:18] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:59:28] this is again us sorry [15:59:32] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:59:39] 👍 [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190410T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:24] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:00:38] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:01:22] !log restart brokers on druid100[3-6] - locking after segments get deleted [16:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:40] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:01:56] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [16:02:34] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:02:37] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:02:42] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [16:02:50] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [16:03:19] (03PS2) 10C. Scott Ananian: Default to Preprocessor_Hash on both PHP7 and HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502567 (https://phabricator.wikimedia.org/T216664) [16:04:07] (03PS1) 10Papaul: DHCP: Add MAC address entries for db209[7-9] and db210[0-2] [puppet] - 10https://gerrit.wikimedia.org/r/502824 [16:04:53] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address entries for db209[7-9] and db210[0-2] [puppet] - 10https://gerrit.wikimedia.org/r/502824 (owner: 10Papaul) [16:25:39] (03PS7) 10Ladsgroup: ores: use hiera for statsd host [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) [16:25:54] (03CR) 10Ladsgroup: ores: use hiera for statsd host (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499875 (https://phabricator.wikimedia.org/T218567) (owner: 10Ladsgroup) [16:29:19] godog: soon I need to relocate tools-prometheus-01 and tools-prometheus-02 to a new network (which will entail changes of IPs). can you advise about what I should do for a graceful change over? And/or how to test that I didn't break it after? [16:29:30] (03PS1) 10Jcrespo: mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) [16:30:00] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [16:32:30] (03PS1) 10Gehel: cloudelastic: allow jobrunners and mwmaint nodes to access cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/502829 (https://phabricator.wikimedia.org/T220625) [16:39:36] (03PS2) 10Jcrespo: mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) [16:40:08] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [16:41:09] (03PS3) 10Jcrespo: mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) [16:41:45] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [16:42:18] 10Operations, 10Mobile-Content-Service, 10RESTBase, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Continuous errors on several REST API resources (probably related to MCS release) - https://phabricator.wikimedia.org/T220574 (10mobrovac) [16:42:53] (03PS1) 10DCausse: [cirrus] add cloudelastic service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502832 (https://phabricator.wikimedia.org/T220625) [16:43:18] (03PS4) 10Jcrespo: mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) [16:43:41] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Allow new option --stop-slave for xtrabackup transfers [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [16:43:43] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] add cloudelastic service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502832 (https://phabricator.wikimedia.org/T220625) (owner: 10DCausse) [16:45:31] (03PS1) 10Ema: WIP: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 [16:48:47] RECOVERY - HP RAID on db2054 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Controller: OK - Battery/Capacitor: OK [16:48:56] (03PS2) 10DCausse: [cirrus] add cloudelastic service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502832 (https://phabricator.wikimedia.org/T220625) [16:48:58] (03PS10) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [16:49:41] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] add cloudelastic service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502832 (https://phabricator.wikimedia.org/T220625) (owner: 10DCausse) [16:51:16] (03PS11) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [16:51:48] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=druid1004.eqiad.wmnet [16:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:39] andrewbogott: if whichever hosts prometheus is polling keep being reachable then there's nothing to do essentially, you can check that "targets" can still be reached e.g. from https://tools-prometheus.wmflabs.org/tools/targets [16:56:45] PROBLEM - Check whether ferm is active by checking the default input chain on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [16:56:45] PROBLEM - Check systemd state on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [16:56:57] PROBLEM - DPKG on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [16:56:57] PROBLEM - Check size of conntrack table on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [16:58:03] RECOVERY - Check whether ferm is active by checking the default input chain on proton1001 is OK: OK ferm input default policy is set [16:58:03] RECOVERY - Check systemd state on proton1001 is OK: OK - running: The system is fully operational [16:58:13] RECOVERY - DPKG on proton1001 is OK: All packages OK [16:58:13] RECOVERY - Check size of conntrack table on proton1001 is OK: OK: nf_conntrack is 0 % full [16:58:22] !log restarted nagios-nrpe-server on proton1001 (it died due to OOM) [16:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:41] good times [17:03:09] (03PS2) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [17:07:19] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T220607 (10Marostegui) All good! Thanks! ` root@db2054:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380337FE1C0) Port Name: 1I Port Name: 2I Gen8 ServBP 12+... [17:08:53] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T220607 (10Marostegui) 05Open→03Resolved [17:09:07] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2054 - https://phabricator.wikimedia.org/T220607 (10Marostegui) a:05Marostegui→03Papaul [17:16:57] (03PS2) 10Marostegui: DNS: Add mgmt and prodcution DNS for db209[7-9] db210[0-2] [dns] - 10https://gerrit.wikimedia.org/r/502651 (owner: 10Papaul) [17:20:03] (03CR) 10Marostegui: [C: 03+2] DNS: Add mgmt and prodcution DNS for db209[7-9] db210[0-2] [dns] - 10https://gerrit.wikimedia.org/r/502651 (owner: 10Papaul) [17:20:49] (03CR) 10Jforrester: [C: 03+1] "Shall we deploy this so we can merge the related patch?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) (owner: 10Gergő Tisza) [17:22:29] (03PS2) 10Papaul: DHCP: Add MAC address entries for db209[7-9] and db210[0-2] [puppet] - 10https://gerrit.wikimedia.org/r/502824 (https://phabricator.wikimedia.org/T219463) [17:23:28] (03PS3) 10Marostegui: DHCP: Add MAC address entries for db209[7-9] and db210[0-2] [puppet] - 10https://gerrit.wikimedia.org/r/502824 (https://phabricator.wikimedia.org/T219463) (owner: 10Papaul) [17:24:15] (03CR) 10Marostegui: [C: 03+2] DHCP: Add MAC address entries for db209[7-9] and db210[0-2] [puppet] - 10https://gerrit.wikimedia.org/r/502824 (https://phabricator.wikimedia.org/T219463) (owner: 10Papaul) [17:30:33] (03PS1) 10Ottomata: Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502838 (https://phabricator.wikimedia.org/T214080) [17:32:34] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Papaul) ` papaul@asw-c-codfw> show interfaces ge-5/0/12 descriptions Interface Admin Link Description ge-5/0/12 up... [17:33:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Papaul) [17:34:24] (03CR) 10Ppchelko: [C: 03+1] Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502838 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [17:39:56] 10Operations, 10ORES, 10Scap, 10Scoring-platform-team, and 2 others: [Epic] ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619 (10Halfak) 05Open→03Resolved a:03Halfak Seems like this is done. [17:43:46] (03CR) 10Gergő Tisza: "I'd wait until someone from Security reviews the core patch and agrees this right should be available on non-Wikimedia wikis (which is som" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502669 (https://phabricator.wikimedia.org/T207750) (owner: 10Gergő Tisza) [17:44:54] !log twentyafterfour@deploy1001 Pruned MediaWiki: 1.33.0-wmf.18 [keeping static files] (duration: 02m 22s) [17:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:52] (03PS1) 10Jforrester: WBMI: Configure initial qualifiers for Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502845 [17:59:52] (03CR) 10EBernhardson: [C: 03+1] cloudelastic: allow jobrunners and mwmaint nodes to access cloudelastic [puppet] - 10https://gerrit.wikimedia.org/r/502829 (https://phabricator.wikimedia.org/T220625) (owner: 10Gehel) [18:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190410T1800) [18:03:28] (03CR) 10EBernhardson: [cirrus] add cloudelastic service (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502832 (https://phabricator.wikimedia.org/T220625) (owner: 10DCausse) [18:04:56] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Bstorm) 05Open→03Resolved All set from WMCS end. Confirmed I can run queries from Toolforge. [18:08:59] (03CR) 10Jforrester: [C: 04-2] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502845 (owner: 10Jforrester) [18:09:30] 10Operations, 10Wikimedia-Mailing-lists: Change ownership of wikimania-program@lists.wikimedia.org - https://phabricator.wikimedia.org/T220641 (10ICueva) [18:30:07] (03PS2) 10Dzahn: get rid of wicipediacymraeg.org [dns] - 10https://gerrit.wikimedia.org/r/502193 (https://phabricator.wikimedia.org/T219856) (owner: 10Vgutierrez) [18:31:49] (03CR) 10Dzahn: [C: 03+2] get rid of wicipediacymraeg.org [dns] - 10https://gerrit.wikimedia.org/r/502193 (https://phabricator.wikimedia.org/T219856) (owner: 10Vgutierrez) [18:33:58] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops, 10Patch-For-Review: Identify appropriate SPF record for domain wikimediafoundation.org - https://phabricator.wikimedia.org/T220412 (10Jgreen) >>! In T220412#5098902, @herron wrote: > > I noticed there is a DKIM record in the wikimediaf... [18:36:18] (03PS3) 10DCausse: [cirrus] add cloudelastic service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502832 (https://phabricator.wikimedia.org/T220625) [18:36:24] (03PS2) 10Dzahn: apache redirects: remove wicipediacymraeg.org [puppet] - 10https://gerrit.wikimedia.org/r/501202 (https://phabricator.wikimedia.org/T219856) [18:36:29] (03CR) 10DCausse: [cirrus] add cloudelastic service (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502832 (https://phabricator.wikimedia.org/T220625) (owner: 10DCausse) [18:37:22] (03CR) 10jerkins-bot: [V: 04-1] [cirrus] add cloudelastic service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502832 (https://phabricator.wikimedia.org/T220625) (owner: 10DCausse) [18:38:11] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10Krinkle) @Pchelolo I think it may be better to wait with actual switching of prod jobs until T219279 and T218005 ar... [18:38:36] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 3 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10Krinkle) [18:38:48] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [18:39:18] there is no more redirects.conf , just a redirects.dat for cluster Apache rewrites, right? it gets autocreated now? [18:40:08] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) @Krinkle yeah we will wait for sure, meanwhile, we are exploring:) [18:40:44] mutante: afaik there is a script in the puppet repo that generates conf from .dat, both should be in git still. [18:40:49] but maybe it changed in last few months [18:40:54] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [18:41:16] Krinkle: i remember having to git commit 2 files, .conf and .dat but it changed i think [18:42:38] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10Pchelolo) > A job doesn't offer a way with retrying when they fail. Actually, it does. We do retry jobs unless it... [18:43:41] Krinkle: ah yea..the .conf was removed in https://gerrit.wikimedia.org/r/c/operations/puppet/+/357733 [18:44:16] mutante: oh nice [18:44:26] paravoid++ [18:44:28] :D [18:44:38] ;) [18:46:04] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10Pchelolo) However, I agree that enabling jobs in production might be premature, we can probably start experimenting... [18:46:13] the bulk work of this was ori's [18:46:15] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/138292/ [18:46:40] i.e. porting the PHP generator to Ruby, so that it can be used as a parser function [18:46:55] my subsequent commit was just plumbing [18:47:06] and lots of pinging people :P [18:47:08] aha, cool [18:47:45] (03CR) 10Dzahn: [C: 03+2] "removed from DNS and domain on clientHold" [puppet] - 10https://gerrit.wikimedia.org/r/501202 (https://phabricator.wikimedia.org/T219856) (owner: 10Dzahn) [18:52:33] 10Operations, 10Domains, 10Traffic: figure out if we can park wicipediacymraeg.org - https://phabricator.wikimedia.org/T128085 (10Dzahn) [18:52:40] 10Operations, 10Domains, 10Traffic, 10Patch-For-Review: wicipediacymraeg.org is on clientHold - https://phabricator.wikimedia.org/T219856 (10Dzahn) 05Open→03Resolved a:03Dzahn removed from DNS and Apache redirects [18:57:11] paravoid: has anything changed since https://phabricator.wikimedia.org/T198939#4413445 ? do you agree to remove servermon now? [18:57:53] has patches to remove it all but was told to get your ok first [18:58:30] 10Operations, 10Cloud-Services, 10cloud-services-team (Kanban): Cron test -x /usr/sbin/anacron || ( cd / && run-parts --report /etc/cron.daily ) - https://phabricator.wikimedia.org/T211271 (10bd808) 05Open→03Resolved a:03bd808 No sign of this in recent journald logs on labweb100{1,2}.... [19:00:04] twentyafterfour: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190410T1900). [19:03:53] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [19:04:23] (03PS1) 10Hashar: zuul: fix MAILTO for cron [puppet] - 10https://gerrit.wikimedia.org/r/502859 [19:05:08] (03Abandoned) 10Dzahn: cassandra: change superuser_password for testing [puppet] - 10https://gerrit.wikimedia.org/r/502221 (owner: 10Dzahn) [19:05:29] (03PS2) 10Hashar: zuul: fix MAILTO for cron [puppet] - 10https://gerrit.wikimedia.org/r/502859 [19:05:36] twentyafterfour: you are doing train ya? can you let me know when you are done? [19:05:47] ottomata: yep, doing it now [19:05:51] k danke [19:05:54] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/502859 (owner: 10Hashar) [19:07:46] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/143/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/502859 (owner: 10Hashar) [19:08:33] !log twentyafterfour@deploy1001 Pruned MediaWiki: 1.33.0-wmf.19 [keeping static files] (duration: 02m 22s) [19:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:16] * apergos lurks [19:17:37] !log twentyafterfour@deploy1001 Pruned MediaWiki: 1.33.0-wmf.20 [keeping static files] (duration: 02m 18s) [19:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:29] (03PS1) 1020after4: group0 wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502865 [19:18:31] (03CR) 1020after4: [C: 03+2] group0 wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502865 (owner: 1020after4) [19:19:42] (03Merged) 10jenkins-bot: group0 wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502865 (owner: 1020after4) [19:20:18] (03CR) 10Hashar: "Notice: /Stage[main]/Zuul::Merger/Cron[zuul_repack]/environment: environment changed 'PATH=/usr/bin:/bin:/usr/sbin:/sbin' to 'PATH=/usr/bi" [puppet] - 10https://gerrit.wikimedia.org/r/502859 (owner: 10Hashar) [19:22:45] (03PS1) 10Ottomata: Add DNS entries for schema[12]00[12] ganeti VMs [dns] - 10https://gerrit.wikimedia.org/r/502866 (https://phabricator.wikimedia.org/T219556) [19:24:34] (03CR) 10Ottomata: [C: 03+2] Add DNS entries for schema[12]00[12] ganeti VMs [dns] - 10https://gerrit.wikimedia.org/r/502866 (https://phabricator.wikimedia.org/T219556) (owner: 10Ottomata) [19:24:37] (03PS2) 10Ottomata: Add DNS entries for schema[12]00[12] ganeti VMs [dns] - 10https://gerrit.wikimedia.org/r/502866 (https://phabricator.wikimedia.org/T219556) [19:26:17] (03CR) 10jenkins-bot: group0 wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502865 (owner: 1020after4) [19:26:37] !log enable sampling on cr2-eqiad external links, outbound [19:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:48] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.33.0-wmf.25 refs T206679 [19:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:51] T206679: 1.33.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T206679 [19:36:40] (03PS1) 1020after4: group1 wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502870 [19:36:43] (03CR) 1020after4: [C: 03+2] group1 wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502870 (owner: 1020after4) [19:37:44] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502870 (owner: 1020after4) [19:37:58] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502870 (owner: 1020after4) [19:38:33] 10Operations, 10Analytics, 10EventBus, 10vm-requests, and 4 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) [19:39:11] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Kanban (Doing), 10Services (doing): Increase the CPU count for proton[12]00[12] - https://phabricator.wikimedia.org/T197862 (10Tgr) This is currently a subtask of {T210651} - does that mean it is seen as a blocker? If not... [19:40:47] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.25 refs T206679 [19:40:53] OresMetadata.php: Class undefined: ORES\ORESServices MediaWiki or an installed extension requires this class but it is not [19:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:08] T206679: 1.33.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T206679 [19:41:16] 10Operations, 10Proton, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Increase the CPU count for proton[12]00[12] - https://phabricator.wikimedia.org/T197862 (10mobrovac) 05Open→03Resolved a:03mobrovac Given that o... [19:42:36] !log twentyafterfour@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.25 refs T206679 (duration: 01m 48s) [19:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:08] ottomata: wmf.25 is deployed to group1 wikis. [19:44:18] k danke [19:45:14] (03PS2) 10Ottomata: Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502838 (https://phabricator.wikimedia.org/T214080) [19:46:30] (03CR) 10Ottomata: [C: 03+2] Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502838 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [19:48:20] (03CR) 10jenkins-bot: Enable api-request EventGate logging for group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502838 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [19:48:49] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enabling api-request logging via eventgate-analytics for group1 wikis - T214080 (duration: 00m 59s) [19:48:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:58] T214080: Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate - https://phabricator.wikimedia.org/T214080 [19:51:29] ooooh group1, yay [19:52:35] 10Operations, 10Analytics, 10EventBus, 10vm-requests, and 4 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) On ganeti1003: ` sudo gnt-instance add -t drbd -I hail --net 0:link=private --hypervisor-parameters=kvm:boot_order=network -o... [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190410T2000). [20:00:13] no parsoid deploy today [20:01:31] parsoid/js is officially Bug Free (tm) [20:01:38] we're putting all the bugs into parsoid/php now instead [20:01:51] aahahahahahaha [20:03:12] PROBLEM - eventgate-analytics LVS eqiad on eventgate-analytics.svc.eqiad.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (root with wrong query param) timed out before a response was received https://wikitech.wikimedia.org/wiki/Event%2A%23EventGate_%28repository%29 [20:03:33] hm [20:04:22] RECOVERY - eventgate-analytics LVS eqiad on eventgate-analytics.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event%2A%23EventGate_%28repository%29 [20:05:36] (03CR) 10Nuria: [C: 03+1] role::druid::public::worker: set stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/502814 (https://phabricator.wikimedia.org/T219910) (owner: 10Elukey) [20:06:26] (03PS2) 10Dzahn: cassandra: no default for super_user, super_password [puppet] - 10https://gerrit.wikimedia.org/r/502240 (https://phabricator.wikimedia.org/T219560) [20:07:12] (03CR) 10jerkins-bot: [V: 04-1] cassandra: no default for super_user, super_password [puppet] - 10https://gerrit.wikimedia.org/r/502240 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [20:08:20] PROBLEM - eventgate-analytics LVS eqiad on eventgate-analytics.svc.eqiad.wmnet is CRITICAL: / (root with wrong query param) timed out before a response was received https://wikitech.wikimedia.org/wiki/Event%2A%23EventGate_%28repository%29 [20:08:26] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:08:38] we are investigating in #services and will rollback shotly... [20:08:54] k8s operational latencies?! [20:09:32] RECOVERY - eventgate-analytics LVS eqiad on eventgate-analytics.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event%2A%23EventGate_%28repository%29 [20:10:43] (03CR) 10Nuria: [C: 03+1] admin: add jupyterhub restart capabilities to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/502677 (owner: 10Elukey) [20:12:56] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:14:50] PROBLEM - eventgate-analytics LVS eqiad on eventgate-analytics.svc.eqiad.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received https://wikitech.wikimedia.org/wiki/Event%2A%23EventGate_%28repository%29 [20:16:22] RECOVERY - eventgate-analytics LVS eqiad on eventgate-analytics.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event%2A%23EventGate_%28repository%29 [20:16:24] (03PS1) 10Ottomata: eventgate-analytics - add extra_service_runner_conf templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/502877 [20:17:12] PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:17:23] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - add extra_service_runner_conf templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/502877 (owner: 10Ottomata) [20:18:28] RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [20:18:46] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:19:02] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics_31192: Servers kubernetes1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:19:46] (03CR) 10Ppchelko: eventgate-analytics - add extra_service_runner_conf templating (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/502877 (owner: 10Ottomata) [20:20:22] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:20:32] (03PS3) 10Dzahn: cassandra: no default for super_user, super_password [puppet] - 10https://gerrit.wikimedia.org/r/502240 (https://phabricator.wikimedia.org/T219560) [20:21:14] (03CR) 10jerkins-bot: [V: 04-1] cassandra: no default for super_user, super_password [puppet] - 10https://gerrit.wikimedia.org/r/502240 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [20:22:42] PROBLEM - LVS HTTP IPv4 on eventgate-analytics.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:22:44] PROBLEM - eventgate-analytics LVS eqiad on eventgate-analytics.svc.eqiad.wmnet is CRITICAL: / (root with no query params) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received: / (root with wrong query param) timed out before a response was received https://wikitech.wikimedia.org/wiki/Event%2A%23EventGate_%28repository%29 [20:22:44] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:23:44] PROBLEM - PyBal backends health check on lvs1006 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics_31192: Servers kubernetes1002.eqiad.wmnet, kubernetes1003.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:24:38] (03PS1) 10Ottomata: Revert "Enable api-request EventGate logging for group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502880 [20:25:07] (03CR) 10Ottomata: [C: 03+2] "Workers keep getting restarted due to missing heartbeat, and k8s operational latencies are up. Hm." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502880 (owner: 10Ottomata) [20:25:22] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:26:36] RECOVERY - LVS HTTP IPv4 on eventgate-analytics.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 802 bytes in 7.879 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:26:48] PROBLEM - eventgate-analytics LVS eqiad on eventgate-analytics.svc.eqiad.wmnet is CRITICAL: /robots.txt (robots.txt check) timed out before a response was received: / (spec from root) timed out before a response was received: / (doc from root) timed out before a response was received: / (root with wrong query param) timed out before a response was received https://wikitech.wikimedia.org/wiki/Event%2A%23EventGate_%28repository%29 [20:26:52] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - eventgate-analytics_31192: Servers kubernetes1003.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:27:43] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Revert - Enabling api-request logging via eventgate-analytics for group1 wikis - T214080 (duration: 01m 00s) [20:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:47] T214080: Rewrite Avro schemas (ApiAction, CirrusSearchRequestSet) as JSONSchema and produce to EventGate - https://phabricator.wikimedia.org/T214080 [20:27:52] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:28:41] (03PS1) 10Paladox: Initial implementation of a PolyGerrit plugin that allows mass branch creation for specified branches [software/gerrit/plugins/MassBranchCreation] - 10https://gerrit.wikimedia.org/r/502883 [20:28:56] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:29:08] RECOVERY - eventgate-analytics LVS eqiad on eventgate-analytics.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Event%2A%23EventGate_%28repository%29 [20:29:26] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:30:32] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:33:00] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:33:40] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:34:24] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:38:57] (03CR) 10jenkins-bot: Revert "Enable api-request EventGate logging for group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502880 (owner: 10Ottomata) [20:43:41] !log decommissioning cassandra-c, restbase2007 -- T208087 [20:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:45] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [20:49:38] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/502859 (owner: 10Hashar) [20:51:19] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:52:00] (03PS1) 10Dzahn: cassandra: pass super_user,super_password to create_resource [puppet] - 10https://gerrit.wikimedia.org/r/502890 [20:52:46] (03CR) 10jerkins-bot: [V: 04-1] cassandra: pass super_user,super_password to create_resource [puppet] - 10https://gerrit.wikimedia.org/r/502890 (owner: 10Dzahn) [20:55:31] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:59:25] PROBLEM - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [20:59:41] ^^^ that is me [21:03:09] (03PS2) 10Dzahn: cassandra: pass super_user,super_password to create_resource [puppet] - 10https://gerrit.wikimedia.org/r/502890 [21:03:20] hashar: ok.thx. also, the cron mail issue should be solved [21:06:01] (03CR) 10Dzahn: "@urandom finally this is it: https://puppet-compiler.wmflabs.org/compiler1002/15682/sessionstore1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502890 (owner: 10Dzahn) [21:06:04] mutante: will see tomorrow whether I get some email as a result :] [21:06:17] hashar: ok! is that a mailman list? [21:06:26] eh, nevermind. it's not [21:06:31] I have no idea :/ [21:06:43] probably just an alias to some folks [21:06:45] ACKNOWLEDGEMENT - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger amusso zuul-merger maintenance https://www.mediawiki.org/wiki/Continuous_integration/Zuul [21:07:28] hashar: wow, it's none of the options i knew or expected [21:07:40] it's not in Google and not in exim and not in mailman [21:07:56] mutante: hrmm, that seems...very obvious [21:08:01] I wonder how it ever worked [21:08:15] urandom: it took me way too long to find it though for that :) [21:08:29] because of the create_resources part [21:08:36] I mean, prior to the erroneous include in the role [21:08:41] create_resources? [21:08:59] create_resources('class', {'::cassandra' => $cassandra_real_settings}) [21:09:12] the cassandra class isnt declared or included the common way [21:09:57] that line 35 in https://gerrit.wikimedia.org/r/c/operations/puppet/+/502890/2/modules/profile/manifests/cassandra.pp is there it actually errored if you did not set any default [21:10:40] anyways..now just trying to make sure it fixes sessionstore but does NOT change other things [21:12:21] (03CR) 10Eevans: "Should this include application_username and application_password as well?" [puppet] - 10https://gerrit.wikimedia.org/r/502890 (owner: 10Dzahn) [21:12:41] RECOVERY - zuul_merger_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [21:12:47] :) [21:14:01] (03PS2) 10Paladox: Initial implementation of a PolyGerrit plugin that allows mass branch creation for specified branches [software/gerrit/plugins/MassBranchCreation] - 10https://gerrit.wikimedia.org/r/502883 [21:18:05] hashar: dont expect email in WMF inbox, do expect in free.fr inbox [21:18:37] mutante: yeah that is where all tech emails ends up :D [21:19:03] eventually I should revisit the members of that email [21:19:07] or migrate to something else [21:19:19] hashar: there is just 1 member, you [21:19:37] which does not scale! [21:19:46] we would like to move those to Google [21:19:52] ideally OIT could make a group [21:24:43] (03CR) 10Dzahn: "found the difference. application_username and application_password appear in "profile::cassandra::settings" in Hiera. so it gets them fro" [puppet] - 10https://gerrit.wikimedia.org/r/502890 (owner: 10Dzahn) [21:27:02] mutante: but could we send crontab spam to an OIT group? [21:27:28] hashar: yes, the email address itself would not change [21:27:42] I dont mind deleting the address as well ;] [21:27:46] just that OIT handles removals/additions [21:27:47] (03CR) 10Eevans: "> found the difference. application_username and application_password" [puppet] - 10https://gerrit.wikimedia.org/r/502890 (owner: 10Dzahn) [21:27:52] sounds good [21:28:41] I filled a task to not forget about it [21:28:54] (03PS3) 10Paladox: Initial implementation of a PolyGerrit plugin that allows mass branch creation for specified branches [software/gerrit/plugins/MassBranchCreation] - 10https://gerrit.wikimedia.org/r/502883 [21:29:09] hashar: great:) technically subtask of https://phabricator.wikimedia.org/T122144 [21:29:21] (03PS3) 10Dzahn: cassandra: pass super_user,super_password to create_resource [puppet] - 10https://gerrit.wikimedia.org/r/502890 [21:31:07] 10Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144 (10hashar) [21:31:19] mutante: made it a subtask. Danke! [21:31:22] I am off for now! [21:32:13] hashar: de rien, bonne nuit [21:32:34] (03PS4) 10Dzahn: cassandra: pass super_user,super_password to Hiera sessionstore role [puppet] - 10https://gerrit.wikimedia.org/r/502890 (https://phabricator.wikimedia.org/T219560) [21:33:04] (03Abandoned) 10Dzahn: cassandra: no default for super_user, super_password [puppet] - 10https://gerrit.wikimedia.org/r/502240 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [21:37:26] (03PS4) 10Paladox: Initial implementation of a PolyGerrit plugin that allows mass branch creation for specified branches [software/gerrit/plugins/MassBranchCreation] - 10https://gerrit.wikimedia.org/r/502883 [21:37:30] (03PS5) 10Dzahn: cassandra: add super_user,super_password to Hiera sessionstore role [puppet] - 10https://gerrit.wikimedia.org/r/502890 (https://phabricator.wikimedia.org/T219560) [21:37:39] (03CR) 10Dzahn: "ehm.. https://puppet-compiler.wmflabs.org/compiler1002/15683/sessionstore1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502890 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [21:46:42] (03CR) 10Volans: [C: 04-1] "This is probably not the right solution, but the right one might require more complex refactors, see inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502829 (https://phabricator.wikimedia.org/T220625) (owner: 10Gehel) [21:54:34] (03PS6) 10Dzahn: sessionstore: add super_username,super_password to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/502890 (https://phabricator.wikimedia.org/T219560) [21:58:27] (03CR) 10Dzahn: [C: 03+2] sessionstore: add super_username,super_password to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/502890 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [22:17:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1008 with 10G interfaces - https://phabricator.wikimedia.org/T216661 (10RobH) a:05RobH→03None [22:17:40] (03PS1) 10Smalyshev: Enable revisions support on internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/502909 (https://phabricator.wikimedia.org/T217897) [22:19:57] (03PS1) 10Dzahn: sessionstore: debug super_password lookup issue [puppet] - 10https://gerrit.wikimedia.org/r/502912 [22:23:24] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/15685/sessionstore1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502912 (owner: 10Dzahn) [22:32:29] (03CR) 10Cwhite: [C: 03+1] logging: move webrequest-5xx to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/493243 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [22:38:55] (03CR) 10Cwhite: [C: 03+1] puppet_major_version4: remove old puppet_major_version variable. [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [22:49:08] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10Jdforrester-WMF) [22:49:17] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10Jdforrester-WMF) [22:55:17] 10Operations, 10ops-eqiad, 10decommission: Decommission iron - https://phabricator.wikimedia.org/T220505 (10Aklapper) [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190410T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:53] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10Dzahn) Did the Wikitech (SQL) part: on mwmaint1002 -> 'sql labswiki' and ` UPDATE user SET user_password = reverse( user_password ),user_email = reverse( user_email ) where user_name="HaeB"; ` [23:23:10] (03CR) 10Alex Monk: "Are you really sure this would actually work given wikitech's LDAP authentication?" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [23:35:23] (03CR) 10Dzahn: "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [23:35:26] (03CR) 10Alex Monk: [C: 04-1] "Also the reversing thing might work to break MediaWiki's password hash format but won't work against a palindrome address e.g. if someone " [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [23:43:39] 10Operations, 10PHP 7.0 support, 10Patch-For-Review: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) Exported from mwdebug1001 in plain text and sorted. Full dumps at P8387 and P8386. ### Differences * [ ] APC This seems worth looking... [23:45:02] 10Operations, 10PHP 7.0 support, 10Patch-For-Review: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) [23:45:36] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) [23:45:38] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) [23:54:38] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Papaul) @Marostegui while installing db2102 I am getting [!!] Partition disks ├─────────────┐ │...