[00:02:10] 08Warning Alert for device cr2-esams.wikimedia.org - Memory over 85% [00:04:33] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [00:12:04] (03CR) 10Jeena Huneidi: "> Patch Set 6: Code-Review-1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [00:31:53] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [00:59:52] (03PS7) 10Jeena Huneidi: Give scaffold template configuration options for dev purposes [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) [01:00:25] (03CR) 10Jeena Huneidi: "So for wmf.appbase_url I added a new value appbase_url_port in values.yaml. The other options I see are:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [01:12:10] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Memory over 85% [01:52:10] 08Warning Alert for device cr2-esams.wikimedia.org - Memory over 85% [02:38:47] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 95416128 and 7 seconds [02:40:37] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22208760 and 0 seconds [02:42:05] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 47 seconds [02:43:13] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 149752 and 76 seconds [03:08:22] (03PS1) 10BryanDavis: Remove unused files from lib [software/tendril] - 10https://gerrit.wikimedia.org/r/520357 [03:20:57] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10Dzahn) The ssh.log is now owned by the $vcsuser:root and we did chmod 640, via puppet of course. [04:23:12] !log rebooting primary lvs servers for MDS security updates [04:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:32] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [04:24:33] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:37] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [04:34:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:52] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [04:47:52] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [04:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:46] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T227107 (10Marostegui) a:03Papaul Let's replace the disk please! Thanks [04:54:55] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T227107 (10Marostegui) p:05Triage→03Normal [04:55:04] (03PS2) 10BryanDavis: Remove unused files from lib [software/tendril] - 10https://gerrit.wikimedia.org/r/520357 [04:55:06] (03PS1) 10BryanDavis: activity: constrain info cell width and height [software/tendril] - 10https://gerrit.wikimedia.org/r/520359 [04:55:18] (03PS4) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/519185 (https://phabricator.wikimedia.org/T226358) [04:55:22] (03CR) 10Marostegui: mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/519185 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [04:55:31] (03CR) 10Marostegui: wmnet: Change x1-master to point to the new master [dns] - 10https://gerrit.wikimedia.org/r/519186 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [04:55:35] (03PS5) 10Marostegui: wmnet: Change x1-master to point to the new master [dns] - 10https://gerrit.wikimedia.org/r/519186 (https://phabricator.wikimedia.org/T226358) [04:55:46] (03CR) 10Marostegui: db-eqiad.php: Promote db1120 to x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519187 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [04:55:51] (03PS4) 10Marostegui: db-eqiad.php: Promote db1120 to x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519187 (https://phabricator.wikimedia.org/T226358) [04:58:27] PROBLEM - PyBal connections to etcd on lvs4006 is CRITICAL: CRITICAL: 7 connections established with conf2003.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [04:59:03] uh [04:59:04] checking [05:02:01] !log Start pre-failover steps for x1 - T226358 [05:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:06] T226358: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 [05:03:39] !log restarting pybal on lvs4006 [05:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:07:09] RECOVERY - PyBal connections to etcd on lvs4006 is OK: OK: 8 connections established with conf2003.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [05:11:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1120 to x1 master [puppet] - 10https://gerrit.wikimedia.org/r/519185 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [05:21:49] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [05:21:50] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:22:44] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Joe) @thcipriani why do we even need to save the images? We don't really care about loc... [05:35:14] (03CR) 10Jcrespo: [C: 03+2] Remove unused files from lib [software/tendril] - 10https://gerrit.wikimedia.org/r/520357 (owner: 10BryanDavis) [05:35:27] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Remove unused files from lib [software/tendril] - 10https://gerrit.wikimedia.org/r/520357 (owner: 10BryanDavis) [05:35:42] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] activity: constrain info cell width and height [software/tendril] - 10https://gerrit.wikimedia.org/r/520359 (owner: 10BryanDavis) [05:39:07] (03CR) 10Jcrespo: "I cannot see how this is better :-/" [software/tendril] - 10https://gerrit.wikimedia.org/r/520359 (owner: 10BryanDavis) [05:41:12] 10Operations, 10ops-eqiad, 10User-Elukey: (OoW) Heating alerts and broken RAM on kafka1014 - https://phabricator.wikimedia.org/T204479 (10elukey) 05Open→03Declined The server will be decommed by https://phabricator.wikimedia.org/T226517, closing! [05:45:22] 10Operations, 10ops-eqiad: rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10elukey) a:05elukey→03Cmjohnson All right so: - hostnames an-conf100[123] - ip subnet info: analytics VLAN - OS: stretch - partitioning scheme: probably a variant of conf-lvm, will come up w... [05:45:53] (03CR) 10BryanDavis: "> I cannot see how this is better :-/" [software/tendril] - 10https://gerrit.wikimedia.org/r/520359 (owner: 10BryanDavis) [05:46:59] (03CR) 10Jcrespo: "> Patch Set 1:" [software/tendril] - 10https://gerrit.wikimedia.org/r/520359 (owner: 10BryanDavis) [05:51:12] We are going to take over the mediawiki-config repo for the x1 failover. Please coordinate with us before merging/Deploying something [05:52:40] 10Operations, 10Dumps-Generation, 10hardware-requests: Get a third dumpsdata server - https://phabricator.wikimedia.org/T219768 (10ArielGlenn) [05:52:44] marostegui: o/ - just to be sure, only mediawiki-config right? Or do you guys also need puppet? [05:52:55] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime [05:52:57] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:52:59] (was about to merge one thing) [05:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:53:27] elukey: We don't need puppet for now, we have already merged what we needed [05:53:28] (03CR) 10Elukey: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/520267 (owner: 10Elukey) [05:53:36] (03PS5) 10Elukey: Move the zookeeper submodule into the repository - part 1 [puppet] - 10https://gerrit.wikimedia.org/r/520267 [05:53:40] marostegui: ack thanks [05:54:16] (03CR) 10Elukey: [C: 03+2] Move the zookeeper submodule into the repository - part 1 [puppet] - 10https://gerrit.wikimedia.org/r/520267 (owner: 10Elukey) [05:55:12] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Promote db1120 to x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519187 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [05:55:57] (03CR) 10Jcrespo: "> Patch Set 1:" [software/tendril] - 10https://gerrit.wikimedia.org/r/520359 (owner: 10BryanDavis) [05:56:08] (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1120 to x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519187 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [05:56:25] (03CR) 10jenkins-bot: db-eqiad.php: Promote db1120 to x1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519187 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [05:56:36] (03PS2) 10Elukey: Move the zookeeper submodule into the repository - part 2 [puppet] - 10https://gerrit.wikimedia.org/r/520271 (https://phabricator.wikimedia.org/T226466) [05:57:34] (03CR) 10Elukey: [C: 03+2] Move the zookeeper submodule into the repository - part 2 [puppet] - 10https://gerrit.wikimedia.org/r/520271 (https://phabricator.wikimedia.org/T226466) (owner: 10Elukey) [06:00:05] marostegui and jynus: #bothumor My software never has bugs. It just develops random features. Rise for x1 database master failover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190703T0600). [06:00:09] jynus: ready? [06:00:13] !log move the zookeeper puppet submodule into operations/puppet - T226466 [06:00:13] yes [06:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:18] T226466: Move the puppet cdh and zookeeper submodules into operations/puppet - https://phabricator.wikimedia.org/T226466 [06:00:20] ok, starting then [06:00:24] !log Starting x1 failover from db1069 to db1120 - T226358 [06:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:29] T226358: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 [06:00:44] done [06:01:01] replication looking good [06:01:43] some reading list errors [06:01:50] deploying config [06:01:56] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Switchover x1 master eqiad from db1069 to db1120 T226358 (duration: 00m 27s) [06:02:01] done [06:02:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:40] no errors [06:02:47] so far so good yeah [06:04:26] (03CR) 10Marostegui: [C: 03+2] wmnet: Change x1-master to point to the new master [dns] - 10https://gerrit.wikimedia.org/r/519186 (https://phabricator.wikimedia.org/T226358) (owner: 10Marostegui) [06:05:03] The failover is done, deployments and merges on mediawiki-config can resume as usual [06:08:07] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup: Archive zookeeper puppet submodule - https://phabricator.wikimedia.org/T227164 (10elukey) [06:09:36] (03PS1) 10Tulsi Bhagat: Add sju, sjd, and rmf to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520364 (https://phabricator.wikimedia.org/T226701) [06:09:51] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 10 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) This was done. Read only start: 06:00:36 UTC Read only stop: 06:01:56 UTC Total read only time: 01:20 min [06:14:28] (03CR) 10Tulsi Bhagat: [C: 03+1] Rename `Image-reviewer` to `image-reviewer` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520283 (https://phabricator.wikimedia.org/T216406) (owner: 10Urbanecm) [06:14:50] (03PS1) 10Elukey: Deprecate repository [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/520365 (https://phabricator.wikimedia.org/T227164) [06:15:02] (03CR) 10jerkins-bot: [V: 04-1] Deprecate repository [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/520365 (https://phabricator.wikimedia.org/T227164) (owner: 10Elukey) [06:16:27] (03PS2) 10Elukey: Deprecate repository [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/520365 (https://phabricator.wikimedia.org/T227164) [06:16:38] (03CR) 10jerkins-bot: [V: 04-1] Deprecate repository [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/520365 (https://phabricator.wikimedia.org/T227164) (owner: 10Elukey) [06:16:48] (03CR) 10Elukey: [V: 03+2 C: 03+2] Deprecate repository [puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/520365 (https://phabricator.wikimedia.org/T227164) (owner: 10Elukey) [06:17:23] (03CR) 10Marostegui: "Nice work! Some comments inline!" (034 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [06:18:09] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup, 10Patch-For-Review: Archive zookeeper puppet submodule - https://phabricator.wikimedia.org/T227164 (10elukey) [06:21:54] 10Operations, 10ops-codfw, 10DBA: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [06:24:29] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup, 10Patch-For-Review: Archive zookeeper puppet submodule - https://phabricator.wikimedia.org/T227164 (10elukey) There are some pull requests to close in https://github.com/wikimedia/puppet-zookeeper/pulls and also to set the mirror as read only, but... [06:26:03] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [06:27:08] https://en.wikipedia.org/wiki/File:SMS_Arcona_NH_65764.tiff what happened to this file? [06:27:16] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10elukey) @hashar should I deactivate the repo in diffusion? O do anything else? [06:28:49] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:29:17] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [06:32:15] (03PS1) 10Marostegui: db1069: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/520371 (https://phabricator.wikimedia.org/T227166) [06:32:48] (03CR) 10Marostegui: [C: 03+2] db1069: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/520371 (https://phabricator.wikimedia.org/T227166) (owner: 10Marostegui) [06:33:33] PROBLEM - puppet last run on theemin is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:39:27] (03PS1) 10Marostegui: db-eqiad.php: Clarify db1069 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520373 [06:40:40] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Clarify db1069 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520373 (owner: 10Marostegui) [06:42:10] (03Merged) 10jenkins-bot: db-eqiad.php: Clarify db1069 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520373 (owner: 10Marostegui) [06:42:27] (03CR) 10jenkins-bot: db-eqiad.php: Clarify db1069 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520373 (owner: 10Marostegui) [06:42:54] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify db1069 status (duration: 00m 28s) [06:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:05] !log depool and roll-restart swift proxy - T209182 [06:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:10] T209182: Setup Swift Storage for Netbox image (was: netbox won't allow me to upload photos of the rack) - https://phabricator.wikimedia.org/T209182 [07:00:51] RECOVERY - puppet last run on theemin is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:06] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10hashar) Yes archive it in Diffusion and we will also just delete the Github mirror. Just a note, it is possible to merge the repository into operations/puppet.git while ke... [07:02:51] (03PS3) 10Jcrespo: switchover.py: Add some extra automations to the script [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 [07:03:14] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Add some extra automations to the script [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [07:04:38] 10Operations, 10Analytics, 10Analytics-Kanban, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10elukey) >>! In T226474#5302653, @hashar wrote: > Yes archive it in Diffusion and we will also just delete the Github mirror. IIUC Timo suggested to leave the github mirror... [07:05:12] !log add 150G to graphite hosts lv, was at 94% utilization [07:05:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:10] !log Drop secret and scratch_tokens from fishbowl wiki list T226826 [07:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:15] T226826: Drop old oathauth_users columns - https://phabricator.wikimedia.org/T226826 [07:10:06] (03PS2) 10Marostegui: mariadb: Promote db1132 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/519975 (https://phabricator.wikimedia.org/T226952) [07:10:44] !log Drop secret and scratch_tokens from labswiki (wikitech) and labstestwiki - T226826 [07:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:28] 10Operations, 10ops-eqiad: (Need By: August 31) rack/setup/install (3) new zookeeper nodes - https://phabricator.wikimedia.org/T227025 (10wiki_willy) [07:14:42] 10Operations, 10ops-codfw, 10DBA, 10Goal: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) [07:20:53] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Enforce staging time validation (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/517605 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [07:21:13] FYI I'll be stopping puppet fleetwide in a little bit, https://gerrit.wikimedia.org/r/c/operations/puppet/+/520012 [07:22:38] (03CR) 10Marostegui: "We can use this next week for the m2 failover (T226952) if you feel confident about it?" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [07:23:07] !log updated buster installer d-i image to RC3 [07:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:29] (03PS6) 10Filippo Giunchedi: rsyslog: use named actions for central syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) [07:24:08] (03Merged) 10jenkins-bot: acme_chief: Enforce staging time validation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/517605 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [07:24:36] !log temporarily disable puppet to test/apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/520012 [07:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:58] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: use named actions for central syslog hosts [puppet] - 10https://gerrit.wikimedia.org/r/520012 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi) [07:25:36] !log Upgrade db2100 (snapshots on that hosts are finished) [07:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:57] (03CR) 10jenkins-bot: acme_chief: Enforce staging time validation [software/acme-chief] - 10https://gerrit.wikimedia.org/r/517605 (https://phabricator.wikimedia.org/T225945) (owner: 10Vgutierrez) [07:33:20] !log Upgrade db2078 (s8 codfw master) [07:33:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:30] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10akosiaris) >>! In T227041#5302151, @Andrew wrote: > >> How is corosync/pacemaker going to work then with a single VIP? > > I may be miss... [07:34:42] !log reenable puppet fleetwide [07:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:50] 10Operations, 10procurement: Wildcard SSL Cert Renewal (*.corp.wikimedia.org) - https://phabricator.wikimedia.org/T227149 (10akosiaris) p:05Triage→03High [07:38:11] ACKNOWLEDGEMENT - Disk space on an-tool1006 is CRITICAL: DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error Elukey Still testing Kerberos configs https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [07:45:14] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [07:45:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [07:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:27] (03CR) 10JJMC89: Update AbuseFilter config to keep the status quo (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [07:53:11] (03CR) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [07:54:39] (03PS21) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [07:55:15] (03PS27) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [07:55:17] (03CR) 10jerkins-bot: [V: 04-1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [07:55:26] (03CR) 10Vgutierrez: ncredir: Provide initial puppetization (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [07:56:53] (03PS22) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [07:57:12] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [07:59:24] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash and creating grafana boards / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is all complete, please let me know if there a... [07:59:35] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Deprecate all non-Kafka logstash inputs - https://phabricator.wikimedia.org/T227080 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:01:03] (03CR) 10JJMC89: [C: 03+1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [08:03:05] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) >>! In T207707#5302475, @Joe wrote: > @thcipriani why do we even need to save t... [08:05:26] 10Operations, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Request access to deployment cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223698 (10MoritzMuehlenhoff) @alaa_wmde : Please generate a separate SSH key for the access to the Wikimedia production... [08:08:48] (03PS1) 10Marostegui: install_server: Allow reimage db21[21-30].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/520379 (https://phabricator.wikimedia.org/T227113) [08:09:07] 10Operations, 10Analytics, 10Wikimedia-Incident: Move icinga alarm for the EventStreams external endpoint to SRE - https://phabricator.wikimedia.org/T227065 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:09:20] (03CR) 10jerkins-bot: [V: 04-1] install_server: Allow reimage db21[21-30].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/520379 (https://phabricator.wikimedia.org/T227113) (owner: 10Marostegui) [08:09:46] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) That the package is broken? I am stalling this and filling another task for WMCS to rebuild the Jessie image. [08:09:54] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) 05Open→03Stalled [08:12:01] (03PS2) 10Marostegui: install_server: Allow reimage db21[21-30].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/520379 (https://phabricator.wikimedia.org/T227113) [08:12:34] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10MoritzMuehlenhoff) The package per se isn't broken, only the upgrade path from the old jessie version to 8.1901.0-1~bpo8+wmf1, given that... [08:14:41] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10MoritzMuehlenhoff) This task covers many things, AFAICT the original hardware issues are fixed and https://gerrit.wikimedia.org/r/500684 is merged, so good to close? [08:18:16] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Port varnishlog consumers to log to syslog / logging infra - https://phabricator.wikimedia.org/T227108 (10MoritzMuehlenhoff) p:05Triage→03Normal [08:20:23] 10Operations, 10Continuous-Integration-Infrastructure: Jessie rsyslog_8.1901.0-1~bpo8+wmf1_amd64.deb package fails to upgrade - https://phabricator.wikimedia.org/T222166 (10hashar) >>! In T222166#5302794, @MoritzMuehlenhoff wrote: > The package per se isn't broken, only the upgrade path from the old jessie ver... [08:26:01] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [08:28:20] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [08:32:28] (03PS1) 10Jcrespo: switchover.py: Enable new option --replicating-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 [08:32:57] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Enable new option --replicating-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 (owner: 10Jcrespo) [08:33:09] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage db21[21-30].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/520379 (https://phabricator.wikimedia.org/T227113) (owner: 10Marostegui) [08:34:17] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db21[21-30].codfw.wmnet - https://phabricator.wikimedia.org/T227113 (10Marostegui) @RobH @Papaul I have merged: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/520379/ The only changes pending from your side to be abl... [08:43:15] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [08:43:18] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:12] !log rolling reboot of kubernetes masters in codfw [08:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:30] !log rolling reboot of kubernetes masters in codfw to pick up MDS-enabled qemu [08:44:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:53] (03PS23) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [08:47:54] 10Operations, 10DBA, 10OTRS, 10Operations-Software-Development, and 2 others: Failover m2 master db1065 to db1132 - https://phabricator.wikimedia.org/T226952 (10Marostegui) [08:50:27] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 38 seconds ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts/helmfile.d/services/staging/sessionstore/private] [08:59:32] (03PS1) 10Fsero: helmfile,k8s: cannot apply helm secrets due to missing user [puppet] - 10https://gerrit.wikimedia.org/r/520387 (https://phabricator.wikimedia.org/T212130) [09:00:41] ACKNOWLEDGEMENT - puppet last run on deploy1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts/helmfile.d/services/staging/sessionstore/private] Fsero somethings was missing on helmfile puppetization. Fixing it [09:01:26] !log rolling reboot of kubernetes masters in eqiad to pick up MDS-enabled qemu [09:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:33] (03CR) 10Fsero: "https://puppet-compiler.wmflabs.org/compiler1001/17202/deploy1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/520387 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [09:03:45] PROBLEM - puppet last run on deploy2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts/helmfile.d/services/staging/sessionstore/private] [09:05:28] (03PS2) 10Jcrespo: switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 [09:05:54] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 (owner: 10Jcrespo) [09:08:37] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: factorice preflight checks to the common profile [puppet] - 10https://gerrit.wikimedia.org/r/520388 (https://phabricator.wikimedia.org/T215531) [09:11:08] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: factorice preflight checks to the common profile [puppet] - 10https://gerrit.wikimedia.org/r/520388 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [09:12:30] (03CR) 10Arturo Borrero Gonzalez: "I separated the preflight checks into a separate patch. https://gerrit.wikimedia.org/r/c/operations/puppet/+/520388" [puppet] - 10https://gerrit.wikimedia.org/r/520319 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [09:17:21] (03PS1) 10Alexandros Kosiaris: helmfile: Set owner for secrets values [puppet] - 10https://gerrit.wikimedia.org/r/520390 (https://phabricator.wikimedia.org/T212130) [09:17:37] (03PS1) 10Elukey: profile::hadoop::master: add Yarn unhealthy workers check [puppet] - 10https://gerrit.wikimedia.org/r/520391 (https://phabricator.wikimedia.org/T226698) [09:18:30] (03PS1) 10Gehel: wdqs: remove "mdc" prefix from logs. [puppet] - 10https://gerrit.wikimedia.org/r/520392 [09:18:34] (03CR) 10Elukey: [C: 03+2] profile::hadoop::master: add Yarn unhealthy workers check [puppet] - 10https://gerrit.wikimedia.org/r/520391 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [09:19:07] (03PS2) 10Arturo Borrero Gonzalez: toolforge: refactor to join a node to the new cluster [puppet] - 10https://gerrit.wikimedia.org/r/520319 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [09:19:34] (03Abandoned) 10Arturo Borrero Gonzalez: toolforge: k8s: add calico yaml config [puppet] - 10https://gerrit.wikimedia.org/r/520238 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [09:20:01] (03PS3) 10Jcrespo: switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 [09:20:31] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 (owner: 10Jcrespo) [09:20:41] (03Abandoned) 10Fsero: helmfile,k8s: cannot apply helm secrets due to missing user [puppet] - 10https://gerrit.wikimedia.org/r/520387 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [09:20:57] (03PS3) 10Arturo Borrero Gonzalez: toolforge: refactor to join a node to the new cluster [puppet] - 10https://gerrit.wikimedia.org/r/520319 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [09:21:15] (03CR) 10DCausse: [C: 03+1] wdqs: remove "mdc" prefix from logs. [puppet] - 10https://gerrit.wikimedia.org/r/520392 (owner: 10Gehel) [09:21:22] (03CR) 10Fsero: [C: 03+1] helmfile: Set owner for secrets values [puppet] - 10https://gerrit.wikimedia.org/r/520390 (https://phabricator.wikimedia.org/T212130) (owner: 10Alexandros Kosiaris) [09:21:35] (03CR) 10Gehel: [C: 03+2] wdqs: remove "mdc" prefix from logs. [puppet] - 10https://gerrit.wikimedia.org/r/520392 (owner: 10Gehel) [09:21:43] (03PS2) 10Gehel: wdqs: remove "mdc" prefix from logs. [puppet] - 10https://gerrit.wikimedia.org/r/520392 [09:23:43] (03PS4) 10Arturo Borrero Gonzalez: toolforge: refactor to join a node to the new cluster [puppet] - 10https://gerrit.wikimedia.org/r/520319 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [09:23:45] (03PS4) 10Jcrespo: switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 [09:24:23] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 (owner: 10Jcrespo) [09:24:33] (03CR) 10Fsero: [C: 03+2] helmfile: Set owner for secrets values [puppet] - 10https://gerrit.wikimedia.org/r/520390 (https://phabricator.wikimedia.org/T212130) (owner: 10Alexandros Kosiaris) [09:24:43] (03PS2) 10Fsero: helmfile: Set owner for secrets values [puppet] - 10https://gerrit.wikimedia.org/r/520390 (https://phabricator.wikimedia.org/T212130) (owner: 10Alexandros Kosiaris) [09:25:37] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:25:39] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:13] (03PS5) 10Jcrespo: switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 [09:27:32] !log rebooting failoid nodes to pick up MDS-enabled qemu [09:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:40] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 (owner: 10Jcrespo) [09:28:59] 10Operations, 10procurement: Wildcard SSL Cert Renewal (*.corp.wikimedia.org) - https://phabricator.wikimedia.org/T227149 (10Peachey88) This task is in S1 space so its visible, procurement tasks should be S4. [09:30:09] (03PS5) 10Arturo Borrero Gonzalez: toolforge: refactor to join a node to the new cluster [puppet] - 10https://gerrit.wikimedia.org/r/520319 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [09:30:15] RECOVERY - puppet last run on deploy2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [09:31:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: refactor to join a node to the new cluster [puppet] - 10https://gerrit.wikimedia.org/r/520319 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [09:32:12] (03PS6) 10Jcrespo: switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 [09:32:40] (03CR) 10jerkins-bot: [V: 04-1] switchover.py: Add new options --replicating-master & --read-only-master [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 (owner: 10Jcrespo) [09:33:03] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:34:35] (03CR) 10Marostegui: "Question: I case you want to switchover a master in codfw (while eqiad is active)" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 (owner: 10Jcrespo) [09:46:56] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:46:57] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:16] !log rebooting debmonitor nodes to pick up MDS-enabled qemu [09:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:45] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: fix half-done code factorization for kubadm_join [puppet] - 10https://gerrit.wikimedia.org/r/520397 (https://phabricator.wikimedia.org/T215531) [09:48:38] (03CR) 10Elukey: Add rpkicounter (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520337 (owner: 10Ayounsi) [09:49:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: fix half-done code factorization for kubadm_join [puppet] - 10https://gerrit.wikimedia.org/r/520397 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [09:52:01] (03PS1) 10Elukey: profile::hadoop: remove unused script [puppet] - 10https://gerrit.wikimedia.org/r/520398 (https://phabricator.wikimedia.org/T226698) [09:52:20] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:52:21] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:55] !log rebooting netmon2001 for kernel security update [09:54:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:32] (03CR) 10Elukey: [C: 03+2] profile::hadoop: remove unused script [puppet] - 10https://gerrit.wikimedia.org/r/520398 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [09:58:49] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:58:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:04] !log Drop secret and stratch_tokens columns from the private wiki list T226826 [10:00:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:13] T226826: Drop old oathauth_users columns - https://phabricator.wikimedia.org/T226826 [10:08:48] (03PS1) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [10:09:13] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [10:11:07] (03PS2) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [10:11:33] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [10:11:43] !log rolling reboot of eventschema service hosts to pick up MDS-enabled qemu [10:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:53] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:11:56] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:50] (03PS3) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [10:14:53] 10Operations, 10ops-eqiad, 10DC-Ops: Hardware Request: puppet master eqiad - https://phabricator.wikimedia.org/T226382 (10jbond) [10:14:56] 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10jbond) [10:15:00] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [10:15:04] (03PS24) 10Vgutierrez: ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) [10:15:37] (03CR) 10Volans: [C: 04-1] "Replies to questions/comments inline" (033 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (https://phabricator.wikimedia.org/T205900) (owner: 10CRusnov) [10:15:59] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "LGTM, merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/519485 (https://phabricator.wikimedia.org/T226660) (owner: 10Jeena Huneidi) [10:24:04] (03PS1) 10Filippo Giunchedi: hieradata: fix netbox swift auth url [puppet] - 10https://gerrit.wikimedia.org/r/520405 [10:24:16] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: fix netbox swift auth url [puppet] - 10https://gerrit.wikimedia.org/r/520405 (owner: 10Filippo Giunchedi) [10:24:46] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: apilb: specify path for systemctl call when reloading config [puppet] - 10https://gerrit.wikimedia.org/r/520406 (https://phabricator.wikimedia.org/T215531) [10:25:10] (03PS4) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [10:25:43] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [10:25:48] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: apilb: specify path for systemctl call when reloading config [puppet] - 10https://gerrit.wikimedia.org/r/520406 (https://phabricator.wikimedia.org/T215531) [10:26:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: apilb: specify path for systemctl call when reloading config [puppet] - 10https://gerrit.wikimedia.org/r/520406 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [10:29:05] (03PS5) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [10:29:32] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [10:30:07] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [10:31:25] (03PS9) 10Jbond: monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) [10:36:22] (03PS1) 10Urbanecm: Add new throttle rule for enwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520408 (https://phabricator.wikimedia.org/T227059) [10:36:34] !log start of ladsgroup@mwmaint1002:~$ foreachwikiindblist wiktionary extensions/Cognate/maintenance/populateCognatePages.php (T226358) [10:36:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:39] T226358: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 [10:36:51] (03CR) 10Volans: "If the new value in config is just ignore yes, it seems safe to merge. Looks good, just a couple of minor nitpicks inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [10:37:06] jouncebot, refresh [10:37:07] I refreshed my knowledge about deployments. [10:37:19] (03CR) 10jerkins-bot: [V: 04-1] Add new throttle rule for enwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520408 (https://phabricator.wikimedia.org/T227059) (owner: 10Urbanecm) [10:37:25] (03CR) 10Jbond: "Thanks for the updates Chris. I will merge this now and further correction can be updated as we go" [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [10:37:32] (03CR) 10Jbond: [C: 03+2] monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) (owner: 10Jbond) [10:37:42] (03PS10) 10Jbond: monitoring: add notes url for memory errors [puppet] - 10https://gerrit.wikimedia.org/r/509365 (https://phabricator.wikimedia.org/T183177) [10:38:04] jouncebot: next [10:38:04] In 0 hour(s) and 21 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190703T1100) [10:38:49] (03PS2) 10Urbanecm: Add new throttle rule for enwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520408 (https://phabricator.wikimedia.org/T227059) [10:39:37] (03PS11) 10Jbond: icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) [10:40:21] (03CR) 10Jbond: [C: 03+2] icinga user agent: add custom user agent to icing checks [puppet] - 10https://gerrit.wikimedia.org/r/519218 (https://phabricator.wikimedia.org/T226508) (owner: 10Jbond) [10:43:07] (03PS1) 10Filippo Giunchedi: logstash: set retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/520409 (https://phabricator.wikimedia.org/T220103) [10:43:33] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: set retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/520409 (https://phabricator.wikimedia.org/T220103) (owner: 10Filippo Giunchedi) [10:51:31] PROBLEM - puppet last run on icinga1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [10:55:24] 10Operations, 10DNS, 10Domains, 10Traffic, 10WMF-Legal: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508 (10Aklapper) 05Open→03Declined Unfortunately closing this report as no further information has been provided. @Naveenpf: After you have pro... [11:00:05] Amir1, Lucas_WMDE, and Urbanecm: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190703T1100). [11:00:05] awight and Urbanecm: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:24] Nice! [11:01:27] starting with my patches, awight around? [11:01:34] (03PS3) 10Urbanecm: Configuring Namespaces at pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) (owner: 10Tulsi Bhagat) [11:01:43] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) (owner: 10Tulsi Bhagat) [11:02:32] Urbanecm: hi, yes and it would be great if you would go ahead and deploy mine cos I'm stuck in a meeting. [11:02:49] (03Merged) 10jenkins-bot: Configuring Namespaces at pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) (owner: 10Tulsi Bhagat) [11:02:52] Sure awight, can do [11:02:57] are you able to test it? [11:03:04] yes [11:03:09] (03CR) 10jenkins-bot: Configuring Namespaces at pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520174 (https://phabricator.wikimedia.org/T226959) (owner: 10Tulsi Bhagat) [11:03:16] ah, it's labs-only [11:03:24] (03CR) 10Urbanecm: [C: 03+2] "Noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520234 (https://phabricator.wikimedia.org/T225617) (owner: 10Awight) [11:03:47] awight, done. For next labs-only patches, you can just C+2 them and just fetch at deploy1001 at any time [11:04:00] labs-only patches don't need to be deployed in a SWAT [11:04:16] ah! For some reason, I had thought that they needed to be deployed so that production was in sync with the source tree. [11:04:20] (03Merged) 10jenkins-bot: Enable experimental FileImporter features on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520234 (https://phabricator.wikimedia.org/T225617) (owner: 10Awight) [11:04:45] awight, just deployment host needs to be in sync, to not surprise the next deployer [11:04:51] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:04:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:00] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [11:05:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:05:01] cool, thank you for the tip [11:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:14] !log rebooting krypton nodes to pick up MDS-enabled qemu [11:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:31] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518298 (https://phabricator.wikimedia.org/T204583) (owner: 10Urbanecm) [11:05:55] yw awight [11:06:23] (03Merged) 10jenkins-bot: [throttle-analyze] Grant autoconfirmed permission to user when throttle rule is applied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518298 (https://phabricator.wikimedia.org/T204583) (owner: 10Urbanecm) [11:06:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:520174|Configuring Namespaces at pawikisource]] (T226959) (duration: 00m 52s) [11:06:39] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520408 (https://phabricator.wikimedia.org/T227059) (owner: 10Urbanecm) [11:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:45] T226959: Adding new Namespaces and renaming some in Punjabi language at Punjabi Wikisource. - https://phabricator.wikimedia.org/T226959 [11:06:46] (03CR) 10jenkins-bot: Enable experimental FileImporter features on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520234 (https://phabricator.wikimedia.org/T225617) (owner: 10Awight) [11:07:30] (03Merged) 10jenkins-bot: Add new throttle rule for enwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520408 (https://phabricator.wikimedia.org/T227059) (owner: 10Urbanecm) [11:10:07] (03CR) 10Ema: [C: 03+1] ncredir: Provide initial puppetization [puppet] - 10https://gerrit.wikimedia.org/r/519998 (https://phabricator.wikimedia.org/T133548) (owner: 10Vgutierrez) [11:11:21] !log rebooting people1001 (people.wikimedia.org) to pick up MDS-enabled qemu [11:11:23] !log urbanecm@deploy1001 Synchronized wmf-config/throttle-analyze.php: SWAT: [[:gerrit:518298|[throttle-analyze] Grant autoconfirmed permission to user when throttle rule is applied]] (T204583) (duration: 00m 49s) [11:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:30] T204583: Create a process for raising editing rate limits similar to the six account exceptions (throttle.php) - https://phabricator.wikimedia.org/T204583 [11:12:50] 10Operations, 10DNS, 10Domains, 10Traffic, 10WMF-Legal: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508 (10Naveenpf) It is required. What is the further information required? [11:12:57] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[:gerrit:520408|Add new throttle rule for enwiki event]] (T227059) (duration: 00m 48s) [11:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:03] T227059: Disable edit throttling for event at 131.175.147.29 on 2019-07-05 - https://phabricator.wikimedia.org/T227059 [11:14:13] !log Ran mwscript namespaceDupes.php --wiki=pawikisource --fix for T226959 [11:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:18] T226959: Adding new Namespaces and renaming some in Punjabi language at Punjabi Wikisource. - https://phabricator.wikimedia.org/T226959 [11:14:38] !log EU SWAT done [11:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:11] 10Operations, 10DNS, 10Domains, 10Traffic, 10WMF-Legal: Point wikipedia.in to 205.147.101.160 instead of URL forward - https://phabricator.wikimedia.org/T144508 (10Aklapper) 05Declined→03Open [11:18:31] (03PS2) 10Muehlenhoff: Decommission cp3037 [puppet] - 10https://gerrit.wikimedia.org/r/520218 (https://phabricator.wikimedia.org/T227077) [11:19:35] (03CR) 10Muehlenhoff: [C: 03+2] Decommission cp3037 [puppet] - 10https://gerrit.wikimedia.org/r/520218 (https://phabricator.wikimedia.org/T227077) (owner: 10Muehlenhoff) [11:26:53] (03PS1) 10Ema: vcl: fix template compilation failure with no backends defined [puppet] - 10https://gerrit.wikimedia.org/r/520411 (https://phabricator.wikimedia.org/T226637) [11:26:55] (03PS1) 10Ema: Revert "Revert "cache: reimage cp2026 as upload_ats"" [puppet] - 10https://gerrit.wikimedia.org/r/520412 (https://phabricator.wikimedia.org/T226637) [11:28:24] (03PS4) 10Jbond: hiera: update search order [puppet] - 10https://gerrit.wikimedia.org/r/511686 [11:29:09] !log ran puppet clean/deactivate and debdeploy removal for cp3037 (host is broken for a long time and triggering failing Cumin/debdeploy runs) T227077 [11:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:14] T227077: decommission cp3037 - https://phabricator.wikimedia.org/T227077 [11:30:08] (03CR) 10Jbond: [C: 03+2] hiera: update search order [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [11:38:02] jouncebot: now [11:38:02] For the next 0 hour(s) and 21 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190703T1100) [11:38:04] jouncebot: next [11:38:04] In 4 hour(s) and 21 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190703T1600) [11:41:01] (03PS2) 10Ema: vcl: fix template compilation failure with no backends defined [puppet] - 10https://gerrit.wikimedia.org/r/520411 (https://phabricator.wikimedia.org/T226637) [11:41:11] (03PS6) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [11:41:41] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [11:41:47] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [11:42:08] (03CR) 10Ema: "pcc seems alright https://puppet-compiler.wmflabs.org/compiler1002/17208/" [puppet] - 10https://gerrit.wikimedia.org/r/520411 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [11:52:47] PROBLEM - Keyholder SSH agent on netmon2001 is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. https://wikitech.wikimedia.org/wiki/Keyholder [11:54:11] (03PS1) 10Alexandros Kosiaris: Update admin/README.md [deployment-charts] - 10https://gerrit.wikimedia.org/r/520415 [11:54:13] (03PS1) 10Alexandros Kosiaris: If guard releases stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/520416 [11:55:14] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10WMDE-leszek) [11:55:49] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/TimedMediaHandler/: T226840 (duration: 00m 50s) [11:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:02] T226840: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 [11:56:08] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10WMDE-leszek) I am an engineering manager at WMDE, and Jakob's line manager. By submitting this request I approve it at WMDE's end. [12:02:06] Reedy: deploying something? I need to deploy cxserver if you're done. [12:02:11] kart_: I'm done :) [12:02:16] OK. cool. [12:02:34] (03CR) 10Jcrespo: "root@cumin2001:~/wmfmariadbpy/wmfmariadbpy$ ./switchover.py es2002 es2001 --read-only-master --timeout=1.0" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [12:04:48] (03CR) 10Ema: [C: 03+2] vcl: fix template compilation failure with no backends defined [puppet] - 10https://gerrit.wikimedia.org/r/520411 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [12:05:51] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10MoritzMuehlenhoff) Adding @greg for approval. [12:06:53] (03PS2) 10Ema: Revert "Revert "cache: reimage cp2026 as upload_ats"" [puppet] - 10https://gerrit.wikimedia.org/r/520412 (https://phabricator.wikimedia.org/T226637) [12:07:48] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [12:07:49] !log kartik@deploy1001 scap-helm cxserver cluster staging completed [12:07:49] !log kartik@deploy1001 scap-helm cxserver finished [12:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:23] (03CR) 10Ema: [C: 03+2] Revert "Revert "cache: reimage cp2026 as upload_ats"" [puppet] - 10https://gerrit.wikimedia.org/r/520412 (https://phabricator.wikimedia.org/T226637) (owner: 10Ema) [12:09:26] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [12:09:28] !log kartik@deploy1001 scap-helm cxserver cluster codfw completed [12:09:28] !log kartik@deploy1001 scap-helm cxserver finished [12:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:09] (03CR) 10Jcrespo: [C: 04-1] "But:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [12:10:10] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [12:10:11] !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed [12:10:11] !log kartik@deploy1001 scap-helm cxserver finished [12:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:59] !log Updated cxserver to b447674 (T226611) [12:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:09] T226611: Issue with MT on a single section - https://phabricator.wikimedia.org/T226611 [12:13:40] ACKNOWLEDGEMENT - IPsec on cp1076 is CRITICAL: Strongswan CRITICAL - No connections configured: check ipsec.conf Ema Thanks god! No more IPsec for cache upload \o/ https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [12:14:22] !log reimage cp2026 as upload_ats T226637 [12:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:27] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [12:15:00] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp2026.codfw.wmnet'] ` The log can be found in `... [12:19:17] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:19:19] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:19:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:13] !log rebooting pybal-test hosts to pick up MDS-enabled qemu [12:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:56] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10Pchelolo) > @Krinkle @Pchelolo according to dashboard versions you have changed the dashboard, would it be problematic if we drop "REST API Varnish hi... [12:24:09] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:24:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:36] !log rebooting bromine to pick up MDS-enabled qemu [12:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:03] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10MW-1.34-notes (1.34.0-wmf.11; 2019-06-26), and 3 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Reedy) [12:29:44] (03CR) 10CDanis: [C: 03+1] Release 1.1.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/519752 (owner: 10Volans) [12:30:42] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:30:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:10] !log rebooting neon (kubernetes staging master) to pick up MDS-enabled qemu [12:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:54] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:34:56] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:05] !log rebooting dubnium/pollux (corp LDAP replicas) to pick up MDS-enabled qemu [12:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:05] really need to fix that volans :) [12:36:35] (the two log messages per downtime, but with no actual info of what was actually downtimed :) [12:38:05] the new class API is pending review, hopefully the latest PS I've sent few days ago gets an agreement paravoid ;) [12:38:41] Urbanecm: Are you running the scripts for T226784 ? [12:38:41] T226784: Refresh metadata on very old midi uploads using the audio/mid mime type - https://phabricator.wikimedia.org/T226784 [12:39:11] also I was thinking for very trivial cookbooks like this one to log once only [12:39:16] Reedy, not yet, but plan to. [12:39:22] ok :) [12:39:44] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [12:39:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:14] !log rebooting mendelevium (ticket.wikimedia.org) to pick up MDS-enabled qemu [12:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:58] !log Started foreachwiki refreshImageMetadata.php --mediatype=AUDIO --mime=audio/mid --force for T226784 on mwmaint1002 in a tmux [12:41:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:07] (03PS1) 10Ema: varnishlog: request/response headers to send to logstash [puppet] - 10https://gerrit.wikimedia.org/r/520425 (https://phabricator.wikimedia.org/T189333) [12:48:03] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Request access to deployment cluster for Jakob_WMDE - https://phabricator.wikimedia.org/T227193 (10MoritzMuehlenhoff) p:05Triage→03Normal [12:50:21] !log foreachwiki refreshImageMetadata.php --mediatype=AUDIO --mime=audio/mid --force completed (T226784) [12:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:27] T226784: Refresh metadata on very old midi uploads using the audio/mid mime type - https://phabricator.wikimedia.org/T226784 [12:51:12] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [12:51:19] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 9 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) 05Open→03Resolved [12:51:28] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, and 4 others: Changing Kibana filters is ridiculously slow - https://phabricator.wikimedia.org/T189333 (10ema) [12:51:53] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp2026.codfw.wmnet'] ` and were **ALL** successful. [12:51:59] (03CR) 10Ayounsi: "Thanks! Reply inline." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520337 (owner: 10Ayounsi) [12:52:41] (03PS2) 10Ayounsi: Add rpkicounter [puppet] - 10https://gerrit.wikimedia.org/r/520337 [12:53:33] !log pool cp2026 w/ ATS backend T226637 [12:53:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:38] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [12:54:14] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes - https://phabricator.wikimedia.org/T226589 (10ema) [12:54:17] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 (10ema) 05Open→03Resolved a:03ema Done! [12:55:48] !log Drop secret and stratch_tokens columns from centralauth (s7) T226826 [12:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:54] T226826: Drop old oathauth_users columns - https://phabricator.wikimedia.org/T226826 [12:57:27] Urbanecm: Do you know, if there's any caching of configuration in labs? I'm getting different results from the REPL, echoing $wg and dumping config->get->() [12:57:52] awight, no idea, I'm sorry [12:58:34] Can you tell I'm desperately searching for excuses to explain my code not working? [12:59:51] awight, are we talking about https://gerrit.wikimedia.org/r/c/520234/? [13:00:05] or the code behind the conf variables? [13:01:05] (03PS1) 10Bstorm: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) [13:01:11] The actual feature... There's something basic I don't understand. I'm seeing that $wgFileImporterSourceWikiTemplating = true, but $GLOBALS['wgFileImporterSourceWikiTemplating'] = false [13:02:16] (03PS6) 10Matthias Geisler: Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) [13:02:50] (03PS1) 10Ema: cache: reimage cp1076 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520430 (https://phabricator.wikimedia.org/T226638) [13:03:19] (03CR) 10jerkins-bot: [V: 04-1] Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [13:05:12] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), and 3 others: Consistent HTTP 503 Varnish Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) @Anomie tracked this down: a Tim... [13:06:44] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10ema) [13:07:03] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) I wonder if [[https://codesearch.wmflabs... [13:07:27] Urbanecm: I'm narrowing it down now, it seems like the labs config just isn't there yet. [13:07:30] deployment-mediawiki-07:/srv/mediawiki$ vi wmf-config/CommonSettings-labs.php [13:07:35] just did the same thing [13:07:49] but why? [13:07:54] !log depool cp1076 and reimage as upload_ats T226637 [13:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:00] T226637: Replace Varnish backends with ATS on cache upload nodes in codfw - https://phabricator.wikimedia.org/T226637 [13:08:12] wrong ticket! [13:08:21] !log depool cp1076 and reimage as upload_ats T226638 [13:08:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:26] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [13:08:29] better [13:09:18] (03CR) 10Marostegui: "What's the meaning of:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [13:09:20] (moving to -releng) [13:09:42] (03CR) 10Ema: [C: 03+2] cache: reimage cp1076 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520430 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [13:10:20] (03CR) 10Jcrespo: "> Patch Set 6:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520382 (owner: 10Jcrespo) [13:11:09] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp1076.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reim... [13:11:17] (03PS3) 10Fdans: ReportUpdater: change repo of all queries to reportupdater-queries [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T222739) [13:12:06] (03PS1) 10Aklapper: Phab: Allow Greg and Andre to roll back specific user actions [puppet] - 10https://gerrit.wikimedia.org/r/520433 [13:12:34] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10BBlack) Thanks for chasing this down! After... [13:13:23] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) ` [6498437.928368] mce: [Hardware Error]: Machine check events logged [6498437.928393] EDAC skx MC1: HANDLING MCE MEMORY ERROR [64... [13:14:26] (03PS2) 10Bstorm: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) [13:16:26] 10Operations, 10MediaWiki-extensions-CentralAuth, 10Traffic, 10MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), and 3 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) Sessions are created all the time, and t... [13:16:47] (03CR) 10Jcrespo: [C: 04-1] "> Patch Set 3:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [13:17:23] (03CR) 10Fsero: [C: 03+1] "added a nit though" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/520415 (owner: 10Alexandros Kosiaris) [13:19:08] (03CR) 10Fsero: [C: 03+1] "LGTM however i kinda prefer to show errors since environment is kind of required." [deployment-charts] - 10https://gerrit.wikimedia.org/r/520416 (owner: 10Alexandros Kosiaris) [13:19:15] (03PS1) 10Ema: cache_upload: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/520436 (https://phabricator.wikimedia.org/T226638) [13:21:22] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) >>! In T184942#5303464, @Pchelolo wrote: >> @Krinkle @Pchelolo according to dashboard versions you have changed the dashboard, would it be... [13:22:07] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10fgiunchedi) [13:22:25] (03PS2) 10BBlack: anycast recdns: test via resolv.conf on 5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/520284 (https://phabricator.wikimedia.org/T186550) [13:22:41] (03CR) 10BBlack: [C: 03+2] anycast recdns: test via resolv.conf on 5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/520284 (https://phabricator.wikimedia.org/T186550) (owner: 10BBlack) [13:24:29] !log upgrade and restart db2097 T225378 [13:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:34] T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 [13:24:34] (03CR) 10Marostegui: Add 2 simple scripts: move_replica.py and stop_in_sync.py (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [13:25:35] (03CR) 10Ema: "pcc seems sane: https://puppet-compiler.wmflabs.org/compiler1001/17212/" [puppet] - 10https://gerrit.wikimedia.org/r/520436 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [13:29:50] (03CR) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [13:30:48] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:30:49] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:52] (03PS1) 10BBlack: anycast recdns: use for LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/520441 (https://phabricator.wikimedia.org/T186550) [13:33:11] !log rebooting doc1001 to pick up MDS-enabled qemu [13:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:51] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) ` 462 - Uncorrectable Memory Error Threshold Exceeded (Processor 1, DIMM 3). The DIMM is mapped out and is currently not availabl... [13:33:59] (03PS2) 10BBlack: anycast recdns: use for LVS balancers [puppet] - 10https://gerrit.wikimedia.org/r/520441 (https://phabricator.wikimedia.org/T186550) [13:34:33] (03PS2) 10Ema: cache_upload: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/520436 (https://phabricator.wikimedia.org/T226638) [13:35:25] (03CR) 10Ema: [C: 03+2] cache_upload: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/520436 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [13:36:09] (03PS1) 10Elukey: profile::hadoop::master: allow nagios to authenticate as hdfs [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) [13:38:04] (03PS7) 10Matthias Geisler: Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) [13:39:04] (03CR) 10jerkins-bot: [V: 04-1] Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [13:41:45] (03PS8) 10Matthias Geisler: Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) [13:42:05] (03CR) 10Marostegui: "Nice work! I have made two comments." (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [13:43:26] (03CR) 10Marostegui: ">" (031 comment) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [13:44:32] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10JHedden) >>! In T227041#5302703, @akosiaris wrote: > Per T223907, the chosen approach for providing HA over the 3 haproxy nodes is pacemak... [13:44:46] 10Operations, 10ops-codfw, 10DBA: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 - https://phabricator.wikimedia.org/T225378 (10jcrespo) a:05jcrespo→03Papaul a memory stick of db2097 is literally broken: ` root@db2097:~$ free -m total used... [13:44:51] (03CR) 10Marostegui: "> > Patch Set 3:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520280 (owner: 10Jcrespo) [13:46:02] (03CR) 10Ayounsi: "Adding Filippo especially for the Prometheus part." [puppet] - 10https://gerrit.wikimedia.org/r/520337 (owner: 10Ayounsi) [13:47:26] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10akosiaris) 05Open→03Stalled >>! In T227041#5303652, @JHedden wrote: > Let's hold off on these VMs until we have a better understandin... [13:47:59] (03PS2) 10Elukey: profile::hadoop::master: allow nagios to authenticate as hdfs [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) [13:49:59] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp1076.eqiad.wmnet'] ` and were **ALL** successful. [13:51:05] (03PS1) 10Fsero: sync current staging values with stored values on repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/520445 [13:51:20] (03CR) 10Fsero: [V: 03+2 C: 03+2] sync current staging values with stored values on repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/520445 (owner: 10Fsero) [13:51:38] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [13:51:40] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:12] !log remove all mentions of sampling (curently disabled) on cr2-esams to try to reduce memory usage [13:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:20] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [13:54:22] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:11] (03PS1) 10DCausse: [cirrus] Increase elastic master timeout to 5m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520446 (https://phabricator.wikimedia.org/T227136) [13:58:18] (03PS7) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [13:58:57] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [14:00:11] (03PS8) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [14:00:16] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:00:17] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:42] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [14:05:25] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:05:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:48] (03CR) 10Andrew Bogott: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/511686 (owner: 10Jbond) [14:06:23] !log power off msw1-codfw - T224250 [14:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:28] T224250: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 [14:07:25] (03PS1) 10Ema: cache: cp1076 storage settings for ATS [puppet] - 10https://gerrit.wikimedia.org/r/520450 (https://phabricator.wikimedia.org/T226638) [14:08:09] (03CR) 10Elukey: Add rpkicounter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520337 (owner: 10Ayounsi) [14:08:27] (03CR) 10Filippo Giunchedi: netbox: Add parameters and settings for storing things in Swift (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [14:09:25] (03CR) 10Ema: [C: 03+2] cache: cp1076 storage settings for ATS [puppet] - 10https://gerrit.wikimedia.org/r/520450 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [14:09:50] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:09:52] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:59] PROBLEM - Host ps1-b2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:09:59] PROBLEM - Host ps1-b1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:09:59] PROBLEM - Host ps1-c5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:07] (03PS9) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [14:10:09] PROBLEM - Host ps1-b4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:09] PROBLEM - Host ps1-c8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:09] PROBLEM - Host ps1-b8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:11] PROBLEM - Host asw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:17] PROBLEM - Host fasw-c-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:17] PROBLEM - Host asw-a-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:23] PROBLEM - Host mr1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:30] XioNoX: JFYI ^ [14:10:39] (03CR) 10jenkins-bot: [throttle-analyze] Grant autoconfirmed permission to user when throttle rule is applied [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518298 (https://phabricator.wikimedia.org/T204583) (owner: 10Urbanecm) [14:10:42] (03CR) 10jenkins-bot: Add new throttle rule for enwiki event [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520408 (https://phabricator.wikimedia.org/T227059) (owner: 10Urbanecm) [14:10:42] as in, expected [14:10:43] PROBLEM - Host asw-d-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:47] are they not a dependency in icinga or the check on msw1 was delayed? [14:10:49] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [14:10:53] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:55] PROBLEM - Host ps1-c4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:55] PROBLEM - Host ps1-b6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:55] PROBLEM - Host ps1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:55] PROBLEM - Host ps1-d3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:10:59] PROBLEM - Host ps1-c2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:01] PROBLEM - Host ps1-d2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:01] PROBLEM - Host ps1-a4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:01] PROBLEM - Host ps1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:01] PROBLEM - Host ps1-c7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:01] PROBLEM - Host ps1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:01] PROBLEM - Host asw-b-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:03] PROBLEM - Host ps1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:03] PROBLEM - Host ps1-a2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:03] PROBLEM - Host ps1-d4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:05] PROBLEM - Host ps1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:06] volans: seems like not a ddependency [14:11:09] PROBLEM - Host ps1-d1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:09] PROBLEM - Host ps1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:09] PROBLEM - Host ps1-d6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:13] PROBLEM - Host ps1-c1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:13] PROBLEM - Host ps1-d8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:17] PROBLEM - Host ps1-a1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:17] PROBLEM - Host ps1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:23] PROBLEM - Host ps1-d7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:23] PROBLEM - Host ps1-c6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:27] PROBLEM - Host ps1-d5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:30] I mean they don't show up in the hostgroup=mgmt [14:11:31] PROBLEM - Host ps1-a6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [14:11:39] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 124, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:12:15] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:12:48] (03PS1) 10Fsero: helmfile: added current eqiad cluster values [deployment-charts] - 10https://gerrit.wikimedia.org/r/520451 [14:13:07] (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile: added current eqiad cluster values [deployment-charts] - 10https://gerrit.wikimedia.org/r/520451 (owner: 10Fsero) [14:13:09] !log pool cp1076 w/ ATS backend T226638 [14:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:15] T226638: Replace Varnish backends with ATS on cache upload nodes in eqiad - https://phabricator.wikimedia.org/T226638 [14:13:28] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:13:30] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:13:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:47] PROBLEM - Host mr1-codfw IPv6 is DOWN: CRITICAL - Destination Unreachable (2620:0:860:ffff::6) [14:15:41] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/17213/" [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [14:16:29] (03CR) 10Jcrespo: "root@cumin2001:~/wmfmariadbpy/wmfmariadbpy$ ./stop_in_sync.py es2002 es2003 --timeout=1.0" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [14:17:23] (03PS2) 10Urbanecm: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520311 [14:17:53] (03CR) 10Filippo Giunchedi: "Prometheus part LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/520337 (owner: 10Ayounsi) [14:18:24] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:18:25] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:14] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:22:15] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:24] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:33:26] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:38] (03PS1) 10Ema: cache: reimage cp1078 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520455 (https://phabricator.wikimedia.org/T226638) [14:34:04] (03CR) 10jerkins-bot: [V: 04-1] cache: reimage cp1078 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520455 (https://phabricator.wikimedia.org/T226638) (owner: 10Ema) [14:35:25] (03PS2) 10Ema: cache: reimage cp1078 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520455 (https://phabricator.wikimedia.org/T226638) [14:38:16] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:38:18] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:38:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:59] !log rebooting cloudnet2002-dev.codfw T224228 [14:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:16] (03CR) 10Filippo Giunchedi: prometheus-mysqld-exporter: Automate targets based on zarcillo db (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/519203 (https://phabricator.wikimedia.org/T143896) (owner: 10Jcrespo) [14:42:40] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:42:42] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:48] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:47:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:57] jouncebot, next [14:48:57] In 1 hour(s) and 11 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190703T1600) [14:49:20] (03PS3) 10Arturo Borrero Gonzalez: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [14:49:25] (03CR) 10Ottomata: profile::hadoop::master: allow nagios to authenticate as hdfs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [14:51:56] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:51:58] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:01] (03CR) 10Elukey: profile::hadoop::master: allow nagios to authenticate as hdfs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [14:55:39] !log rebooting cloudnet2003-dev.codfw T224228 [14:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:31] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [14:57:32] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:57:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:53] (03PS3) 10Elukey: profile::hadoop::master: allow nagios to authenticate as hdfs [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) [15:01:31] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:01:32] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:01:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:19] (03PS3) 10CRusnov: netbox: Add parameters and settings for storing things in Swift [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) [15:04:30] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17215/" [puppet] - 10https://gerrit.wikimedia.org/r/520442 (https://phabricator.wikimedia.org/T226698) (owner: 10Elukey) [15:04:41] !log rebooting cloudservices2002-dev.codfw T224228 [15:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:46] (03CR) 10jerkins-bot: [V: 04-1] netbox: Add parameters and settings for storing things in Swift [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [15:04:52] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:04:54] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:21] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:05:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:40] !log rolling reboot of Kubernetes etcd nodes in codfw [15:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:11] (03PS4) 10CRusnov: netbox: Add parameters and settings for storing things in Swift [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) [15:07:12] RECOVERY - Host mr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.78 ms [15:07:24] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 126, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:07:28] PROBLEM - Host re0.cr2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:07:28] PROBLEM - Host re0.cr1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [15:07:39] (03CR) 10CRusnov: "> Patch Set 2:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520296 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [15:08:00] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:08:56] RECOVERY - Host mr1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 41.74 ms [15:09:26] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T227107 (10Papaul) a:05Papaul→03Marostegui Disk replaced [15:10:02] (03CR) 10Ppchelko: [C: 03+1] Produce centralnotice.campaign-* streams to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520019 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [15:10:12] RECOVERY - Host ps1-a1-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.62 ms [15:10:14] RECOVERY - Host ps1-a2-codfw is UP: PING OK - Packet loss = 0%, RTA = 39.92 ms [15:10:18] RECOVERY - Host ps1-a3-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.10 ms [15:10:21] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:10:23] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:10:24] RECOVERY - Host ps1-a4-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.39 ms [15:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:38] RECOVERY - Host ps1-a5-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.34 ms [15:10:56] RECOVERY - Host ps1-a6-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.12 ms [15:11:10] RECOVERY - Host ps1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 39.57 ms [15:11:28] RECOVERY - Host asw-a-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [15:11:28] RECOVERY - Host ps1-a8-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.05 ms [15:11:59] (03CR) 10Ppchelko: "> Patch Set 1:" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [15:12:12] RECOVERY - Host ps1-b2-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.42 ms [15:12:16] RECOVERY - Host asw-b-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [15:12:58] RECOVERY - Host re0.cr2-codfw is UP: PING OK - Packet loss = 0%, RTA = 41.11 ms [15:12:58] RECOVERY - Host re0.cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.79 ms [15:13:05] (03PS10) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [15:13:24] RECOVERY - Host ps1-b3-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.21 ms [15:13:32] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [15:13:52] RECOVERY - Host ps1-b4-codfw is UP: PING OK - Packet loss = 0%, RTA = 39.34 ms [15:14:06] RECOVERY - Host ps1-b5-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.36 ms [15:14:26] RECOVERY - Host ps1-b6-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.56 ms [15:14:47] (03PS28) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [15:14:57] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:14:58] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:30] RECOVERY - Host ps1-b8-codfw is UP: PING WARNING - Packet loss = 50%, RTA = 37.77 ms [15:15:41] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) (owner: 10CRusnov) [15:15:42] RECOVERY - Host ps1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.12 ms [15:16:03] (03PS11) 10Jcrespo: Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 [15:16:16] RECOVERY - Host ps1-c1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.40 ms [15:16:22] RECOVERY - Host ps1-c2-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.72 ms [15:16:30] (03CR) 10jerkins-bot: [V: 04-1] Add 2 simple scripts: move_replica.py and stop_in_sync.py [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/520401 (owner: 10Jcrespo) [15:16:59] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DLynch - https://phabricator.wikimedia.org/T227200 (10DLynch) [15:17:14] RECOVERY - Host ps1-c3-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.71 ms [15:17:32] RECOVERY - Host asw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.74 ms [15:17:36] RECOVERY - Host ps1-c4-codfw is UP: PING OK - Packet loss = 0%, RTA = 39.46 ms [15:17:40] PROBLEM - etcd request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:17:47] (03PS2) 10Ppchelko: RESTRouter: Add initial Helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [15:18:22] (03PS29) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [15:18:31] !log rebooting clouddb2001-dev.codfw T224228 [15:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:36] RECOVERY - Host ps1-c5-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.98 ms [15:18:38] RECOVERY - Host ps1-c6-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.05 ms [15:18:55] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:18:57] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:10] RECOVERY - etcd request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:20:20] RECOVERY - Host ps1-c7-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.15 ms [15:20:25] (03CR) 10Jcrespo: WMFReplication: Make move work for a limited number of cases (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/517794 (owner: 10Jcrespo) [15:20:26] RECOVERY - Host ps1-c8-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.27 ms [15:20:28] RECOVERY - Host fasw-c-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.83 ms [15:21:20] RECOVERY - Host ps1-d1-codfw is UP: PING OK - Packet loss = 0%, RTA = 39.16 ms [15:21:34] RECOVERY - Host ps1-d2-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.96 ms [15:21:56] RECOVERY - Host ps1-d3-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.74 ms [15:22:30] RECOVERY - Host ps1-d4-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.80 ms [15:22:40] RECOVERY - Host asw-d-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.75 ms [15:23:22] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:23:24] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:04] RECOVERY - Host ps1-d5-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.41 ms [15:24:18] RECOVERY - Host ps1-d6-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.28 ms [15:24:44] RECOVERY - Host ps1-d7-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.51 ms [15:24:50] RECOVERY - Host ps1-d8-codfw is UP: PING OK - Packet loss = 0%, RTA = 38.11 ms [15:25:39] (03CR) 10Smalyshev: [C: 03+1] [cirrus] Increase elastic master timeout to 5m [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520446 (https://phabricator.wikimedia.org/T227136) (owner: 10DCausse) [15:25:46] (03CR) 10Cwhite: [C: 03+1] logstash: set retention to 90 days [puppet] - 10https://gerrit.wikimedia.org/r/520409 (https://phabricator.wikimedia.org/T220103) (owner: 10Filippo Giunchedi) [15:26:12] (03PS3) 10Smalyshev: Enable RDF output for MediaInfo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520078 (https://phabricator.wikimedia.org/T221916) [15:26:58] !log rebooting cloudweb2001-dev.codfw T224228 [15:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:48] (03PS30) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [15:28:05] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:28:07] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:28] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:28:30] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:33] !log rolling reboot of Kubernetes etcd nodes in eqiad [15:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:14] RECOVERY - Host ps1-b1-codfw is UP: PING OK - Packet loss = 0%, RTA = 37.33 ms [15:32:03] (03PS4) 10Bstorm: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) [15:33:07] (03CR) 10jerkins-bot: [V: 04-1] toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:33:35] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:33:36] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:44] (03PS5) 10Bstorm: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) [15:36:09] (03CR) 10jerkins-bot: [V: 04-1] toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:36:40] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={LIST,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:37:48] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:37:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:56] (03PS6) 10Bstorm: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) [15:39:05] (03PS31) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [15:39:17] (03PS7) 10Bstorm: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) [15:40:58] paravoid: if you are still around, I was wondering if you thought we cleared up the issues with current numbers on T224188 ? [15:40:58] T224188: rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 [15:41:38] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:41:42] I'm in meetings bstorm_, sorry :( [15:41:54] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={LIST,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:42:01] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:42:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:42:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:10] No worries :) I can check your calendar for the next good time to bug you [15:42:29] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 2 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10herron) [15:42:36] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:43:08] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:43:24] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:43:26] (03PS1) 10Herron: kafka-main: add kafka-main200[45] to the codfw cluster [puppet] - 10https://gerrit.wikimedia.org/r/520465 (https://phabricator.wikimedia.org/T225005) [15:45:31] (03CR) 10Arturo Borrero Gonzalez: toolforge: correct the values for bootstrapping a cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:46:44] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:46:45] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:46:46] 10Operations, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, 10Traffic, and 4 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Jdforrester-WMF) >>! In T226840#5303566, @Tgr wrote: > I wonder if [[... [15:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:49] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:46:52] (03CR) 10Bstorm: toolforge: correct the values for bootstrapping a cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:03] (03PS1) 10Herron: eventstreams: add admins contact to eventstreams check [puppet] - 10https://gerrit.wikimedia.org/r/520475 (https://phabricator.wikimedia.org/T227065) [15:48:43] (03PS8) 10Bstorm: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) [15:49:38] RECOVERY - Device not healthy -SMART- on db2049 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops [15:50:13] (03PS9) 10Bstorm: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) [15:50:40] 10Operations, 10vm-requests, 10cloud-services-team (Kanban): Three small ganeti VMs to host haproxy for OpenStack endpoints - https://phabricator.wikimedia.org/T227041 (10aborrero) Some questions I have. Do we have a single ganeti hypervisor in each row? Could you set affiniting/pinning for VMs/hypervisor ru... [15:51:46] (03CR) 10BPirkle: [C: 04-1] "We need to consider effects on CentralAuth before this can be deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519432 (https://phabricator.wikimedia.org/T222099) (owner: 10BPirkle) [15:51:49] (03PS10) 10Bstorm: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) [15:52:18] (03CR) 10Bstorm: "Removed unused params as well in this version." [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:54:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [15:55:00] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime [15:55:01] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:21] (03PS32) 10CRusnov: profile::netbox: Reorganize for splitting front and back-end. [puppet] - 10https://gerrit.wikimedia.org/r/514395 (https://phabricator.wikimedia.org/T223291) [15:55:39] !log jmm@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw2250.codfw.wmnet [15:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:59] 10Operations, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, 10Traffic, and 4 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10BBlack) >>! In T226840#5303935, @Jdforrester-WMF wrote: >>>! In T2268... [16:00:04] MaxSem, RoanKattouw, and Niharika: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Morning SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190703T1600). [16:00:04] isaacj, Urbanecm, and tgr: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:14] I can SWAT today! [16:00:20] here [16:00:38] Fyi this is my first patch so don’t hesitate to overexplain things to me :) I read the wikitech documentation and worked with baha last week though when the surveys in my patch were being deployed so I’m familiar at least with the testing with x-wikimedia-debug etc. [16:00:44] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [16:00:46] PROBLEM - Host mw2250 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:52] isaacj: good luck :) [16:01:07] (03PS3) 10Ema: cache: reimage cp1078 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/520455 (https://phabricator.wikimedia.org/T226638) [16:01:09] thanks! [16:01:26] isaacj, thanks for the info, feel free to ask if you don't understand something :) [16:01:42] Urbanecm: great - will do! [16:02:26] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) [16:03:14] (03PS4) 10Urbanecm: Undeploy reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [16:03:39] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "SWAT: gate-and-submit succeeded, V+2 to save time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [16:03:41] (03PS1) 10Fsero: helmfile: added codfw current values [deployment-charts] - 10https://gerrit.wikimedia.org/r/520485 [16:03:56] (03CR) 10Fsero: [V: 03+2 C: 03+2] helmfile: added codfw current values [deployment-charts] - 10https://gerrit.wikimedia.org/r/520485 (owner: 10Fsero) [16:03:58] (03CR) 10jenkins-bot: Undeploy reader demographics surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519216 (https://phabricator.wikimedia.org/T226273) (owner: 10Bmansurov) [16:04:08] isaacj, your patch should be on mwdebug1002 now, please test and let me know if I can deploy it site-wide [16:04:24] sounds good -- will likely take a few minutes while i check each language [16:04:40] sure isaacj [16:04:52] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520311 (owner: 10Urbanecm) [16:05:45] (03Merged) 10jenkins-bot: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520311 (owner: 10Urbanecm) [16:05:57] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [16:06:38] tgr, +2'ed your patch to give time for CI [16:06:49] (03CR) 10jenkins-bot: Remove expired throttle rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520311 (owner: 10Urbanecm) [16:07:04] I just added another config change to the deployment calendar [16:07:10] so please don’t close the SWAT after tgr’s change :) [16:07:18] Urbanecm: all looks good to me (surveys no longer being displayed on mwdebug1002) [16:07:27] isaacj, good, deploying sitewide [16:07:45] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: SWAT: Clean expired throttle rules (duration: 00m 49s) [16:07:48] ok Lucas_WMDE, should be beta-only, right? [16:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:54] Urbanecm: great -- i'll check that they're gone there too [16:08:08] isaacj, note it's not yet deployed [16:08:17] ahh...yes, gotcha [16:08:38] isaacj, was syncing another change while you were testing :) [16:08:59] Urbanecm: yes, but I’d like to double-check on mwdebug1002 all the same :) [16:09:14] Lucas_WMDE, how can be a beta-only change tested on mwdebug1002? [16:09:15] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[:gerrit:519216|Undeploy reader demographics surveys]] (T226273) (duration: 00m 49s) [16:09:15] !log rearmed keyholder on netmon2001 (was rebooted earlier) [16:09:23] test that it breaks nothing ^^ [16:09:33] isaacj, should be deployed. [16:09:33] it still needs to be synced anyways as far as I can tell [16:09:46] so that the deployment host isn’t stuck with changes that aren’t on the other servers [16:09:51] even if they should only affect beta [16:10:01] Urbanecm: sounds good -- i'll let you know if anything seems off [16:10:08] thanks isaacj [16:10:20] Lucas_WMDE, I was told that a beta-only change doesn't need to be synced [16:10:24] RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [16:10:31] i think it was James_F who told me that? not sure... [16:10:57] urbanecm@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [16:10:57] T226273: Demographic Surveys Configurations - https://phabricator.wikimedia.org/T226273 [16:11:04] You still need to at least pull them on the deployment host [16:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:11] But it's useful to deploy them anyway so they're on noc [16:11:32] ok Reedy [16:11:40] tgr, around? [16:13:06] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [16:13:07] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [16:13:08] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'citoid' for release 'staging' . [16:13:08] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'cxserver' for release 'staging' . [16:13:09] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-analytics' for release 'analytics' . [16:13:10] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-main' for release 'main' . [16:13:11] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'mathoid' for release 'staging' . [16:13:11] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'sessionstore' for release 'staging' . [16:13:12] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [16:13:13] !log root@ helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' . [16:13:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:59] (03Merged) 10jenkins-bot: Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [16:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:15] (03CR) 10jenkins-bot: Enable DataBridge on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519987 (https://phabricator.wikimedia.org/T226816) (owner: 10Matthias Geisler) [16:14:26] Lucas_WMDE, your patch is on mwdebug1002 as you wanted [16:14:34] testing… [16:14:36] Urbanecm: present [16:15:49] thanks tgr. your patch is on mwdebug1002, if it's testable there [16:16:06] 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10RobH) asw2-d-eqiad: ge-1/0/8 down down ms-be1013 ge-1/0/9 up up ms-be1014 ge-1/0/10 up up ms-be1015 [16:16:35] Urbanecm: i'm still seeing surveys on wikis where no other surveys are deployed but i suspect that's because some caches need to flush so i'll give them a bit longer [16:16:45] !log deleting zotero namespace and recreating it with helmfile on staging cluster [16:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:22] Urbanecm: my patch works as expected [16:17:27] thanks Lucas_WMDE [16:17:28] do you want to sync it or should I do it? [16:17:55] Lucas_WMDE, up to you, not sure what file should be synced first [16:17:58] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [16:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:04] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [16:18:08] 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `ms-be1013.eqiad.wmnet` - ms-be1013.eqiad.wmnet - Removed from Puppet master and... [16:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:21] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [16:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:27] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [16:18:30] 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `ms-be1014.eqiad.wmnet` - ms-be1014.eqiad.wmnet - Removed from Puppet master and... [16:18:30] ok I’ll do it [16:18:32] !log robh@cumin1001 START - Cookbook sre.hosts.decommission [16:18:36] I think syncing both files at once should be fine here [16:18:37] Urbanecm: works [16:18:38] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [16:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:42] 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by robh@cumin1001 for hosts: `ms-be1015.eqiad.wmnet` - ms-be1015.eqiad.wmnet - Removed from Puppet master and... [16:18:46] isaacj, well, the config change should be deployed already, not aware of any cache... [16:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:48] thanks tgr [16:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:19] Lucas_WMDE, ah, they're in same dir.... makes sense [16:20:49] if IS-labs.php gets synced first it just defines unused variables, if Wikibase.php is synced first the isset check will just fail [16:20:56] (which it will anyways in production) [16:21:09] 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10RobH) [16:21:19] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:519987|Enable DataBridge on Beta (T226816)]] (production no-op) (duration: 00m 54s) [16:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:25] T226816: Enable data bridge on Wikidata Beta - https://phabricator.wikimedia.org/T226816 [16:22:11] 10Operations, 10Elasticsearch, 10Icinga, 10Discovery-Search (Current work): Create Icinga plugin to check number of eligible masters - https://phabricator.wikimedia.org/T224073 (10debt) 05Open→03Resolved [16:22:17] (03PS1) 10RobH: decom ms-be101[345] prod dns [dns] - 10https://gerrit.wikimedia.org/r/520486 (https://phabricator.wikimedia.org/T220590) [16:22:28] Urbanecm: I’m done, you can sync tgr’s change now I think [16:22:35] doing, thanks Lucas_WMDE [16:22:44] Urbanecm: well whatever it was seems to be fixed now. i'll do one last pass [16:22:52] great, thanks isaacj [16:22:57] 10Operations, 10Operations-Software-Development, 10User-Joe, 10User-jijiki: Spicerack cookbooks TODO list - https://phabricator.wikimedia.org/T203943 (10debt) [16:22:58] 10Operations, 10Operations-Software-Development, 10Discovery-Search (Current work), 10User-Joe, 10User-jijiki: Create WDQS reboot cookbook - https://phabricator.wikimedia.org/T224385 (10debt) 05Open→03Resolved [16:23:36] !log urbanecm@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/ReadingLists/: SWAT: [[:gerrit:520480|Fix API continuation]] (T226640) (duration: 00m 49s) [16:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:41] T226640: ReadingLists CI broken - https://phabricator.wikimedia.org/T226640 [16:23:55] (03CR) 10RobH: [C: 03+2] decom ms-be101[345] prod dns [dns] - 10https://gerrit.wikimedia.org/r/520486 (https://phabricator.wikimedia.org/T220590) (owner: 10RobH) [16:24:06] Well, I think that should be all in SWAT [16:24:06] (03PS1) 10RobH: decom ms-be101[345] puppet repo entries [puppet] - 10https://gerrit.wikimedia.org/r/520487 (https://phabricator.wikimedia.org/T220590) [16:24:09] !log Morning SWAT done [16:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:39] 10Operations, 10MediaWiki-extensions-CentralAuth, 10TimedMediaHandler, 10Traffic, and 4 others: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm) - https://phabricator.wikimedia.org/T226840 (10Tgr) At a glance: * The edit API uses it because the way it works is... [16:25:35] (03CR) 10RobH: [C: 03+2] decom ms-be101[345] puppet repo entries [puppet] - 10https://gerrit.wikimedia.org/r/520487 (https://phabricator.wikimedia.org/T220590) (owner: 10RobH) [16:25:48] Urbanecm: yep, all looks good. i believe it was my browser that was causing the lag [16:25:51] thanks Urbanecm! [16:26:23] operations-puppet-tests-stretch-docker queued [16:26:23] isaacj, might be, I always do those kind of verifications with Ctrl+Shift+R to bypass browser cache [16:26:25] =P [16:26:26] yw tgr [16:26:30] thanks!! [16:26:55] yw isaacj, hoping SWAT was easy for you! [16:27:16] yep Urbanecm : just the right amount of suspense that everything would work out :) [16:28:01] :) [16:30:30] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp2026 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined https://wikitech.wikimedia.org/wiki/Confd [16:30:30] PROBLEM - Varnish traffic logger - varnishstatsd on cp2026 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish [16:31:43] it’s not called the Putting Wikis To Sleep Team after all :P [16:31:54] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp2026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1986 bytes in 0.220 second response time https://wikitech.wikimedia.org/wiki/Varnish [16:33:14] PROBLEM - Check Varnish expiry mailbox lag on cp2026 is CRITICAL: NRPE: Command check_check_varnish_expiry_mailbox_lag not defined https://wikitech.wikimedia.org/wiki/Varnish [16:35:36] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'blubberoid' for release 'staging' . [16:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:51] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' . [16:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:24] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' . [16:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:18] 10Operations, 10decommission, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10RobH) a:05RobH→03None [16:38:27] 10Operations, 10ops-eqiad, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10RobH) [16:43:58] !log fsero@ helmfile [STAGING] Ran 'apply' command on namespace 'zotero' for release 'staging' . [16:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:20] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) [16:46:31] (03PS1) 10Cmjohnson: Adding mgmt dns for kafka-main1001-5 [dns] - 10https://gerrit.wikimedia.org/r/520491 (https://phabricator.wikimedia.org/T226274) [16:47:06] (03CR) 10Cmjohnson: [C: 03+2] Adding mgmt dns for kafka-main1001-5 [dns] - 10https://gerrit.wikimedia.org/r/520491 (https://phabricator.wikimedia.org/T226274) (owner: 10Cmjohnson) [16:47:40] (03PS2) 10Cmjohnson: Adding mgmt dns for kafka-main1001-5 [dns] - 10https://gerrit.wikimedia.org/r/520491 (https://phabricator.wikimedia.org/T226274) [16:47:42] (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Adding mgmt dns for kafka-main1001-5 [dns] - 10https://gerrit.wikimedia.org/r/520491 (https://phabricator.wikimedia.org/T226274) (owner: 10Cmjohnson) [16:49:34] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) a:05Papaul→03Andrew @Bstorm @Andrew OS install and puppet run done on cloudbakcup2001. It is all yours. Do all your tests if ha... [16:50:30] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2039 - https://phabricator.wikimedia.org/T225988 (10Papaul) [16:55:10] RECOVERY - Host mw2250 is UP: PING OK - Packet loss = 0%, RTA = 36.21 ms [16:55:37] (03PS1) 10Cmjohnson: fixing kafka-main1001.mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/520497 (https://phabricator.wikimedia.org/T226274) [16:55:48] (03PS2) 10Cmjohnson: fixing kafka-main1001.mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/520497 (https://phabricator.wikimedia.org/T226274) [16:56:33] (03CR) 10Cmjohnson: [C: 03+2] fixing kafka-main1001.mgmt dns entry [dns] - 10https://gerrit.wikimedia.org/r/520497 (https://phabricator.wikimedia.org/T226274) (owner: 10Cmjohnson) [17:02:10] 10Operations, 10Continuous-Integration-Infrastructure, 10serviceops, 10Release-Engineering-Team-TODO (201907): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) >>! In T207707#5302139, @thcipriani wrote: > The other option would be to move t... [17:04:52] 10Operations, 10ops-codfw, 10netops: Setup new msw1-codfw - https://phabricator.wikimedia.org/T224250 (10ayounsi) We had to rollback to the old switch. codfw is different from all the other sites, as mr1-codfw is connect to the cr1/2 through msw1 using 10G links to each routers. While the new msw1 doesn't ha... [17:13:33] (03PS1) 10Giuseppe Lavagetto: mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) [17:14:37] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::webserver: add mtail to gather latency, error rate metrics [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [17:15:13] (03PS1) 10Fsero: adding a buster docker base image [puppet] - 10https://gerrit.wikimedia.org/r/520503 [17:16:21] (03PS2) 10Fsero: adding a buster docker base image [puppet] - 10https://gerrit.wikimedia.org/r/520503 [17:25:04] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1076 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined https://wikitech.wikimedia.org/wiki/Confd [17:25:06] PROBLEM - Varnish traffic logger - varnishstatsd on cp1076 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish [17:26:10] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp1076 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1986 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Varnish [17:28:42] PROBLEM - Check Varnish expiry mailbox lag on cp1076 is CRITICAL: NRPE: Command check_check_varnish_expiry_mailbox_lag not defined https://wikitech.wikimedia.org/wiki/Varnish [17:36:14] 10Operations, 10procurement: Wildcard SSL Cert Renewal (*.corp.wikimedia.org) - https://phabricator.wikimedia.org/T227149 (10MarcoAurelio) >>! In T227149#5302968, @Peachey88 wrote: > This task is in S1 space so its visible, procurement tasks should be S4. Probably. According to #procurement description: > An... [17:41:39] 10Operations, 10Continuous-Integration-Infrastructure, 10Release Pipeline, 10Release-Engineering-Team-TODO (201907): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663 (10Dzahn) See progress on T207707 . The new disks are mounted now an... [17:42:48] PROBLEM - mediawiki-installation DSH group on mw2250 is CRITICAL: Host mw2250 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist [17:44:59] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team-TODO (201907): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10Dzahn) Can be closed together with T207707 once the docker images have moved to the new logica... [17:46:27] (03CR) 10Dzahn: "you have a file name of stretch.yaml but buster content?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/520503 (owner: 10Fsero) [17:49:26] (03PS3) 10Dzahn: nagios_common: update members of the gerrit contact group [puppet] - 10https://gerrit.wikimedia.org/r/512292 [17:51:01] (03CR) 10Dzahn: [C: 03+2] nagios_common: update members of the gerrit contact group [puppet] - 10https://gerrit.wikimedia.org/r/512292 (owner: 10Dzahn) [18:08:25] (03PS1) 10DannyS712: Create "autopatroller" user group on az.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520507 (https://phabricator.wikimedia.org/T227208) [18:14:00] (03PS2) 10DannyS712: Create "autopatroller" user group on az.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520507 (https://phabricator.wikimedia.org/T227208) [18:14:16] 10Operations, 10ops-eqiad: (Need By: June 30) rack/setup/install kafka-main100[1-5] - https://phabricator.wikimedia.org/T226274 (10Cmjohnson) [18:18:16] (03PS3) 10DannyS712: Create "autopatrolled" user group on az.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520507 (https://phabricator.wikimedia.org/T227208) [18:21:02] 10Operations, 10ops-eqsin, 10Traffic: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) [18:21:05] 10Operations, 10ops-eqsin, 10Traffic: cp5007 correctable mem errors - https://phabricator.wikimedia.org/T216716 (10RobH) 05Stalled→03Resolved ` /admin1-> racadm getsel Record: 1 Date/Time: 02/21/2019 18:11:12 Source: system Severity: Ok Description: Log cleared. ---------------------------... [18:22:11] 10Operations, 10ops-eqsin, 10Traffic: amber light on cp5006/5007 - https://phabricator.wikimedia.org/T216691 (10RobH) [18:22:14] 10Operations, 10ops-eqsin, 10Traffic: cp5006 correctable mem errors - https://phabricator.wikimedia.org/T216717 (10RobH) 05Stalled→03Resolved ` /admin1-> racadm getsel Record: 1 Date/Time: 02/21/2019 17:41:01 Source: system Severity: Ok Description: Log cleared. ---------------------------... [18:27:33] 10Operations, 10ops-eqsin: update PDUs for eqsin (asset tag and other info) - https://phabricator.wikimedia.org/T211368 (10wiki_willy) @RobH - I'll add collecting the asset tag and S/N for the PDUs as an action item for our 3rd party vendor to do, when he's out there installing the Geneti servers. Thanks, Willy [18:36:31] (03PS11) 10Bstorm: toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) [18:37:52] (03CR) 10Bstorm: [C: 03+2] toolforge: correct the values for bootstrapping a cluster [puppet] - 10https://gerrit.wikimedia.org/r/520429 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [18:46:31] !log rebooting cloudelastic1001 T224228 [18:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:00] (03Abandoned) 10Bstorm: puppetdb: adapt the module so it works on Cloud VPS again [puppet] - 10https://gerrit.wikimedia.org/r/504966 (owner: 10Bstorm) [18:54:26] !log rebooting cloudelastic1002 T224228 [18:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:40] I'm deploying an UBN for Wikibase [18:58:30] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/Wikibase/data-access/src/GenericServices.php: T227207 Fix missing qualifier hashes in JSON output (duration: 00m 50s) [18:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:35] T227207: Wikibase JSON output (dumps, Special:EntityData) lacks qualifier hashes - https://phabricator.wikimedia.org/T227207 [18:58:55] hoo, apergos: Deployed, seems to have not broken the world. [18:59:06] Fix confirmed [18:59:09] thanks! [18:59:12] it's everywhere? great [18:59:15] 10Operations, 10MediaWiki-Cache, 10serviceops-radar, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 5 others: Use a multi-dc aware store for ObjectCache's MainStash if needed. - https://phabricator.wikimedia.org/T212129 (10EvanProdromou) >>! In T212129#5199079, @Krinkle... [18:59:52] I forget how much faster scapping around one file is [19:01:58] apergos: Yeah, it's so depressing scapping all of MW. [19:02:07] sigh [19:02:40] Prod fixes should feel fast. [19:02:48] !log rebooting cloudelastic1003 T224228 [19:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:42] ACKNOWLEDGEMENT - MD RAID on cloudelastic1003 is CRITICAL: connect to address 208.80.154.83 port 5666: No route to host nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T227222 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [19:03:45] 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1003 - https://phabricator.wikimedia.org/T227222 (10ops-monitoring-bot) [19:04:58] 10Operations, 10ops-eqiad: Degraded RAID on cloudelastic1003 - https://phabricator.wikimedia.org/T227222 (10JHedden) 05Open→03Resolved a:03JHedden This host was rebooted for T224228. I downtimed the host and services in icinga but it looks like this slipped through. [19:10:42] !log rebooting cloudelastic1004 T224228 [19:10:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:47] (03PS3) 10Ppchelko: RESTRouter: Add initial Helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/512923 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [19:22:46] !log rebooting labpuppetmaster1002 T224228 [19:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:25] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Cmjohnson) @Ottomata Please decommission the current servers to spare role Please provide the new hostnames... [19:44:13] (03PS1) 10Bstorm: toolforge: set up cert distribution for additional control plane nodes [puppet] - 10https://gerrit.wikimedia.org/r/520516 (https://phabricator.wikimedia.org/T215531) [19:44:33] !log rebooting labpuppetmaster1001 T224228 [19:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:35] 10Operations, 10ops-codfw, 10serviceops, 10User-jijiki: Degraded RAID on mw2250 - https://phabricator.wikimedia.org/T226948 (10Papaul) I talked to @MoritzMuehlenhoff on irc about this system. We have no 500GB 2"5 SATA disks on site for replacement. Option 1: Open a procurement task to request spare disks... [19:55:01] (03PS2) 10Bstorm: toolforge: set up cert distribution for additional control plane nodes [puppet] - 10https://gerrit.wikimedia.org/r/520516 (https://phabricator.wikimedia.org/T215531) [19:58:08] !log rebooting labmon1002 T224228 [19:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:05] cscott, arlolra, subbu, bearND, and halfak: It is that lovely time of the day again! You are hereby commanded to deploy Services – Parsoid / Citoid / Mobileapps / ORES / …. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190703T2000). [20:00:17] no parsoid deploy today [20:00:43] (03CR) 10BryanDavis: "> I got that, I am just saying that in my browser it doesn't get" [software/tendril] - 10https://gerrit.wikimedia.org/r/520359 (owner: 10BryanDavis) [20:02:12] (03CR) 10Bstorm: [C: 03+2] toolforge: set up cert distribution for additional control plane nodes [puppet] - 10https://gerrit.wikimedia.org/r/520516 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [20:12:24] !log rebooting labmon1001 T224228 [20:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:36] (03PS1) 10Bstorm: toolforge: create etcd dir for the certs [puppet] - 10https://gerrit.wikimedia.org/r/520519 (https://phabricator.wikimedia.org/T215531) [20:13:26] (03CR) 10Bstorm: "Just fixing a stupid mistake of forgetting to create the subdir :)" [puppet] - 10https://gerrit.wikimedia.org/r/520519 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [20:14:10] (03PS2) 10Bstorm: toolforge: create etcd dir for the certs [puppet] - 10https://gerrit.wikimedia.org/r/520519 (https://phabricator.wikimedia.org/T215531) [20:14:38] (03PS5) 10Dzahn: Gerrit: Rename error_log to gerrit.log [puppet] - 10https://gerrit.wikimedia.org/r/510625 (owner: 10Paladox) [20:18:31] (03PS6) 10Dzahn: Gerrit: Rename error_log to gerrit.log [puppet] - 10https://gerrit.wikimedia.org/r/510625 (owner: 10Paladox) [20:19:51] (03CR) 10Bstorm: [C: 03+2] toolforge: create etcd dir for the certs [puppet] - 10https://gerrit.wikimedia.org/r/520519 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [20:24:25] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@cf64319]: Update mobileapps to fdb0108 (T205550) [20:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:31] T205550: Localize the strings used by page library transforms - https://phabricator.wikimedia.org/T205550 [20:25:27] (03PS7) 10Dzahn: Gerrit: Rename error_log to gerrit.log [puppet] - 10https://gerrit.wikimedia.org/r/510625 (owner: 10Paladox) [20:25:50] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@cf64319]: Update mobileapps to fdb0108 (T205550) (duration: 01m 25s) [20:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:12] (03CR) 10Dzahn: [C: 03+2] Gerrit: Rename error_log to gerrit.log [puppet] - 10https://gerrit.wikimedia.org/r/510625 (owner: 10Paladox) [20:27:10] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@cf64319]: Update mobileapps to fdb0108 (T205550) [20:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:19] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@cf64319]: Update mobileapps to fdb0108 (T205550) (duration: 01m 10s) [20:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:40] (03PS6) 10Dzahn: gerrit: only ship gerrit.json to logstash, not *_log [puppet] - 10https://gerrit.wikimedia.org/r/509172 (https://phabricator.wikimedia.org/T141324) [20:38:59] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Kanban, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10Ottomata) > Please decommission the current servers to spare role Ok will do. I'll downtime the the hostnam... [20:39:57] (03PS1) 10Ayounsi: Reserve 10.3.0.0/30 for recdns anycast (backup static route) [dns] - 10https://gerrit.wikimedia.org/r/520528 (https://phabricator.wikimedia.org/T186550) [20:40:20] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@350e74b]: Update mobileapps to 94d0233 (T205550) [20:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:40:26] T205550: Localize the strings used by page library transforms - https://phabricator.wikimedia.org/T205550 [20:45:31] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@350e74b]: Update mobileapps to 94d0233 (T205550) (duration: 05m 11s) [20:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:37] T205550: Localize the strings used by page library transforms - https://phabricator.wikimedia.org/T205550 [20:46:18] 10Operations, 10Security: Network isolation for production and semi-production services - https://phabricator.wikimedia.org/T121240 (10kchapman) [20:48:01] (03CR) 10Dzahn: [C: 03+2] gerrit: only ship gerrit.json to logstash, not *_log [puppet] - 10https://gerrit.wikimedia.org/r/509172 (https://phabricator.wikimedia.org/T141324) (owner: 10Dzahn) [20:51:17] (03PS2) 10Ayounsi: Reserve 10.3.0.0/30 for recdns anycast (backup static route) [dns] - 10https://gerrit.wikimedia.org/r/520528 (https://phabricator.wikimedia.org/T186550) [20:55:10] (03CR) 10CDanis: "mostly nits" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520502 (https://phabricator.wikimedia.org/T226815) (owner: 10Giuseppe Lavagetto) [20:55:35] (03CR) 10Ayounsi: [C: 03+2] Reserve 10.3.0.0/30 for recdns anycast (backup static route) [dns] - 10https://gerrit.wikimedia.org/r/520528 (https://phabricator.wikimedia.org/T186550) (owner: 10Ayounsi) [20:58:58] !log add static backup routes for anycast recdns on cr1/2-codfw/eqiad - T186550 [20:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:04] T186550: Anycast recdns - https://phabricator.wikimedia.org/T186550 [21:12:17] (03PS1) 10Andrew Bogott: nova-fullstack: rename statsd metrics to cloudvps.novafullstack [puppet] - 10https://gerrit.wikimedia.org/r/520636 (https://phabricator.wikimedia.org/T210850) [21:12:54] (03CR) 10Andrew Bogott: [C: 03+2] nova-fullstack: rename statsd metrics to cloudvps.novafullstack [puppet] - 10https://gerrit.wikimedia.org/r/520636 (https://phabricator.wikimedia.org/T210850) (owner: 10Andrew Bogott) [21:25:22] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520507 (https://phabricator.wikimedia.org/T227208) (owner: 10DannyS712) [21:31:10] PROBLEM - SSH mw2269.mgmt on mw2269.mgmt is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:44:30] cmjohnson1: still around? [21:45:12] 10Operations, 10ops-eqiad, 10DC-Ops: a5-eqiad pdu refresh - https://phabricator.wikimedia.org/T227141 (10RobH) [21:46:47] I random [21:46:59] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10Andrew) I've made a new dashboard, https://grafana.wikimedia.org/d/ebJoA6VWz/nova-fullstack -- once I'm convinced that it's doing what I expect I'll d... [21:47:26] urandom ? [21:47:30] what's up? [21:48:09] cmjohnson1: did you have time to talk about T222960? [21:48:09] T222960: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 [21:49:09] Sure [21:50:41] out of curiosity, is this the usual process? that we reach out to you directly? I had always though we'd coordinate with whoever in SRE would be tapped for the rest of the work [21:51:16] I am only doing they physical move...everything else will need to be coordinated [21:51:34] in this case, once it's moved, we need new IPs, DNS updated, and a Puppet changeset that reflects the new IPs, and then finally a reimage [21:51:54] ok, I do that as well [21:52:04] we'd been blocking on someone who could do that, and presumably coordinate the move with you, and then we were told to get with you [21:52:09] oh! cool, all of that! [21:52:13] ? [21:52:34] yep! Is it something that can be done anytime? [21:52:44] 10Operations, 10ops-eqiad, 10DC-Ops: a6-eqiad pdu refresh - https://phabricator.wikimedia.org/T227142 (10RobH) [21:52:47] swweeet [21:52:53] we need to decommission first [21:53:18] okay, Tuesday would be the first available day I can do it...will that work for you? [21:53:19] it takes a day, two max [21:53:36] that would work, we can do that [21:53:57] I would like to do it 10/11am my time (eastern) [21:54:19] cmjohnson1: wfm, is that like...official? shall I update the ticket? [21:54:54] Yes, let's make that official [21:55:19] I am blocking thetime in my calendar [21:56:22] cmjohnson1: awesome, thanks [21:58:31] 10Operations, 10ops-eqiad, 10Cassandra, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 3 others: Fix restbase1017's physical rack - https://phabricator.wikimedia.org/T222960 (10Eevans) From IRC, 2019-07-03T16:57:04-05:00: `lang=irc-log 4:51 PM in this case, o... [21:59:06] (03PS3) 10Paladox: Gerrit: Remove additivity from it's log4j file [puppet] - 10https://gerrit.wikimedia.org/r/508657 [22:02:03] (03PS1) 10Ayounsi: Bird anycast, add monitoring for anycast-healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/520643 (https://phabricator.wikimedia.org/T186550) [22:05:10] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1001/17220/dns1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/520643 (https://phabricator.wikimedia.org/T186550) (owner: 10Ayounsi) [22:06:58] 10Operations, 10ops-eqiad, 10DC-Ops: a7-eqiad pdu refresh - https://phabricator.wikimedia.org/T227143 (10RobH) [22:08:56] (03PS1) 10Paladox: Gerrit: Remove httpd_log log [puppet] - 10https://gerrit.wikimedia.org/r/520644 [22:09:42] (03PS2) 10Paladox: Gerrit: Remove httpd_log log [puppet] - 10https://gerrit.wikimedia.org/r/520644 [22:15:15] (03CR) 10CDanis: Bird anycast, add monitoring for anycast-healthchecker (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/520643 (https://phabricator.wikimedia.org/T186550) (owner: 10Ayounsi) [22:15:56] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [22:18:10] (03PS3) 10Dzahn: Gerrit: Remove httpd_log log [puppet] - 10https://gerrit.wikimedia.org/r/520644 (owner: 10Paladox) [22:18:26] PROBLEM - Free Blazegraph allocators wdqs-blazegraph on wdqs1006 is CRITICAL: cluster=wdqs instance=wdqs1006:9193 job=blazegraph site=eqiad https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=32&fullscreen [22:18:31] ! [remote rejected] HEAD -> refs/for/production (internal server error: Error inserting change/patchset) [22:18:39] (03PS4) 10Paladox: Gerrit: Remove httpd_log log [puppet] - 10https://gerrit.wikimedia.org/r/520644 [22:18:41] (03PS4) 10Paladox: Gerrit: Remove additivity from it's log4j file [puppet] - 10https://gerrit.wikimedia.org/r/508657 [22:19:06] Well that's some exception :P [22:19:11] (03CR) 10Dzahn: [C: 03+2] "confirmed httpd_log exists but is entirely empty and not in use" [puppet] - 10https://gerrit.wikimedia.org/r/520644 (owner: 10Paladox) [22:20:00] (03PS5) 10Paladox: Gerrit: Remove additivity from it's log4j file [puppet] - 10https://gerrit.wikimedia.org/r/508657 [22:20:30] PROBLEM - puppet last run on lvs5001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [22:21:10] (03PS6) 10Paladox: Gerrit: Remove additivity from log4j.xml file [puppet] - 10https://gerrit.wikimedia.org/r/508657 [22:21:51] 10Puppet, 10puppet-compiler, 10wikitech.wikimedia.org, 10Documentation: Cannot run puppet-compiler: access denied - https://phabricator.wikimedia.org/T227231 (10MarcoAurelio) [22:22:55] 10Puppet, 10puppet-compiler, 10wikitech.wikimedia.org, 10Documentation: Cannot run puppet-compiler: access denied - https://phabricator.wikimedia.org/T227231 (10Paladox) You can add Hosts: in the commit msg and run "check experimental". Running the puppet compiler through jenkins UI, is limited to... [22:24:02] 10Operations, 10Analytics, 10Analytics-Kanban: Reduce memory allocation for kafkamon instances - https://phabricator.wikimedia.org/T224988 (10Nuria) 05Open→03Resolved [22:26:30] 10Puppet, 10puppet-compiler, 10wikitech.wikimedia.org, 10Documentation: Cannot run puppet-compiler: access denied - https://phabricator.wikimedia.org/T227231 (10MarcoAurelio) >>! In T227231#5305138, @Paladox wrote: > You can add Hosts: in the commit msg and run "check experimental". I am aware of t... [22:47:46] RECOVERY - puppet last run on lvs5001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:48:54] PROBLEM - dhclient process on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [22:49:16] PROBLEM - puppet last run on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [22:49:44] PROBLEM - configured eth on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [22:49:50] PROBLEM - Disk space on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [22:49:58] PROBLEM - SSH on stat1007 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:50:00] PROBLEM - MD RAID on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:50:02] PROBLEM - Check systemd state on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [22:50:20] PROBLEM - DPKG on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [22:50:23] Someone working on stat1007? [22:51:12] PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused [22:51:17] James_F: not me..but looking at the mgmt console [22:51:27] which shows ..nothing [22:51:33] Good sign. [22:51:36] or actually it does.. it's just super slow [22:52:02] i guess you might be right about someone working on it.. just that work is some manual task involving huge data sets and that slows it down [22:52:13] oom [22:52:20] some pythng process [22:52:23] python [22:52:54] RECOVERY - SSH on stat1007 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:52:58] Did someone write a fun script taht doesn't nice? [22:53:04] there we go.. the options were to reboot or to wait [22:53:11] i did nothing so far [22:53:37] not the first time this happens on servers where people are running manual scripts etc [22:55:50] yea, so i can login normally again and it seems to be relatively normal. expecting more recoveries. come on Icinga [22:56:34] restarting nrpe server which always gets killed by oom [22:57:04] RECOVERY - configured eth on stat1007 is OK: OK - interfaces up [22:57:08] RECOVERY - Disk space on stat1007 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [22:57:11] !log stat1007 - systemctl restart nagios-nrpe-server after OOM from some python process [22:57:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:20] RECOVERY - MD RAID on stat1007 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:57:38] RECOVERY - DPKG on stat1007 is OK: All packages OK [22:57:40] RECOVERY - dhclient process on stat1007 is OK: PROCS OK: 0 processes with command name dhclient [22:59:46] !log stat1007 - jbd2/md0-8 invoked oom-killer [22:59:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:00] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/MobileFrontend/resources/dist/: T221197 schemaEditAttemptStep: only set bucket and anonymous-user-token on defaults if non-null (duration: 00m 51s) [23:00:04] MaxSem, RoanKattouw, and Niharika: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190703T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] T221197: VE mobile default: post-deployment QA A/B test - https://phabricator.wikimedia.org/T221197 [23:00:12] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 43 minutes ago with 0 failures [23:01:40] RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks [23:16:22] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational [23:26:54] !log reset email for "Uwe Martens" [23:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:24] Doing a last minute swat deploy. [23:54:38] (03PS1) 10Niharika29: Deploy to all wiktionary, wikivoyage and wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520653 (https://phabricator.wikimedia.org/T218626) [23:55:23] (03PS2) 10Niharika29: Deploy Partial blocks to all wiktionary, wikivoyage and wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520653 (https://phabricator.wikimedia.org/T218626) [23:56:04] omg partial blocks [23:56:07] :) [23:56:59] (03PS1) 10Dzahn: icinga/elasticsearch: fix notes_link->notes_url parameter name [puppet] - 10https://gerrit.wikimedia.org/r/520654 [23:57:09] (03CR) 10Niharika29: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520653 (https://phabricator.wikimedia.org/T218626) (owner: 10Niharika29) [23:57:57] hauskatze: YES! :D [23:58:13] Nice to see you by the way! [23:58:20] (03Merged) 10jenkins-bot: Deploy Partial blocks to all wiktionary, wikivoyage and wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520653 (https://phabricator.wikimedia.org/T218626) (owner: 10Niharika29) [23:58:38] (03CR) 10jenkins-bot: Deploy Partial blocks to all wiktionary, wikivoyage and wikisource wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/520653 (https://phabricator.wikimedia.org/T218626) (owner: 10Niharika29) [23:59:41] in a little bit cscott will be deploying a parsoid branch with a cherry-pick to fix T227216 [23:59:42] T227216: Adding or editing citations using VisualEditor causes major formatting issues involving pipes, equals signs and nowiki tags - https://phabricator.wikimedia.org/T227216