[00:01:39] Btw, thx Reedy for the shift, seems to be working all fine :-) [00:01:49] * CFIsch_WMDE says gn8 [00:07:11] (03CR) 10Reedy: [C: 03+2] "I've got a feeling while this should stop unwanted autopromotes.. It might break wanted autopromotes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518842 (owner: 10Reedy) [00:08:04] (03Merged) 10jenkins-bot: Rework setup of FR autopromote config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518842 (owner: 10Reedy) [00:08:19] (03CR) 10jenkins-bot: Rework setup of FR autopromote config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518842 (owner: 10Reedy) [00:10:10] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: poke at autopromote config (duration: 00m 54s) [00:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:20] (03PS1) 10Reedy: Remove $wgFlaggedRevsLowProfile already moved to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518853 [00:52:09] PROBLEM - MariaDB Slave Lag: s3 on db2098 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 916.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [01:02:37] (03PS1) 10CDanis: dbctl: handle instance pooled for section but depooled in group [software/conftool] - 10https://gerrit.wikimedia.org/r/518855 [02:11:45] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) So I finally got a chance to test this, I can confirm that my patched sshd bi... [02:13:42] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) a:05mmodell→03None [02:16:07] RECOVERY - MariaDB Slave Lag: s3 on db2098 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:30:58] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [04:30:59] !log kartik@deploy1001 scap-helm cxserver cluster staging completed [04:30:59] !log kartik@deploy1001 scap-helm cxserver finished [04:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:11] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [04:37:12] !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed [04:37:12] !log kartik@deploy1001 scap-helm cxserver finished [04:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:38:40] holding breath.. [04:44:40] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [04:44:41] !log kartik@deploy1001 scap-helm cxserver cluster codfw completed [04:44:41] !log kartik@deploy1001 scap-helm cxserver finished [04:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:57] !log Updated cxserver to use nodejs10 (T226074) [04:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:02] T226074: Update cxserver to nodejs10 - https://phabricator.wikimedia.org/T226074 [04:47:37] PROBLEM - puppet last run on dns4001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [04:50:11] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10KartikMistry) >>! In T210704#5277211, @Joe wrote: > @KartikMistry if we trigger a rebuild of the production container, it s... [04:50:51] (03PS6) 10Marostegui: db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) [04:53:01] (03CR) 10Santhosh: [C: 03+1] Don't show cannot publish error to 'sysop' users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518260 (https://phabricator.wikimedia.org/T225398) (owner: 10Petar.petkovic) [04:58:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [04:58:27] <_joe_> kart_: oh it works? [04:58:32] <_joe_> \o/ [04:58:39] <_joe_> thanks for working on it [04:58:42] I am going to hold a lock for mw-config to deploy the parsercache key change, please coordinate with me in case you need to deploy mw config [04:58:44] _joe_: Thanks! [04:58:58] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [04:59:13] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Change last parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/517807 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:00:04] marostegui: My dear minions, it's time we take the moon! Just kidding. Time for Change parsercache key deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190625T0500). [05:01:13] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Change parsercache key T210725 (duration: 00m 58s) [05:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:19] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [05:01:20] !log Change parsercache key on the canaries T210725 [05:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:45] (03PS1) 10Marostegui: db1135: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/518875 (https://phabricator.wikimedia.org/T222682) [05:11:44] (03CR) 10Marostegui: [C: 03+2] db1135: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/518875 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:12:23] !log Change parsercache key on 20 more hosts [05:12:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:49] RECOVERY - puppet last run on dns4001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:22:28] (03PS1) 10Marostegui: db1133: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/518876 [05:23:56] (03CR) 10Marostegui: [C: 03+2] db1133: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/518876 (owner: 10Marostegui) [05:24:27] !log Change parsercachekey on 20 more hosts [05:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:32:15] 10Operations, 10ops-eqiad, 10DBA, 10Goal, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [05:32:17] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Marostegui) 05Open→03Resolved I have re-imaged the host after Chris did it yesterday and everything looks good: RAID, memory, CPUS... ` root@db1133:~# megacli -LdPdInfo -a0 Adapter #0 Number of... [05:33:00] !log Change parsercachekey on 20 more hosts [05:33:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:39:02] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:39:10] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [05:40:12] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:40:22] (03PS1) 10Marostegui: mariadb: Provision db1133 in m5 [puppet] - 10https://gerrit.wikimedia.org/r/518877 (https://phabricator.wikimedia.org/T222682) [05:41:01] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Provision db1133 in m5 [puppet] - 10https://gerrit.wikimedia.org/r/518877 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:43:41] !log Change parsercachekey on 10 more hosts [05:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:54] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10ayounsi) `asw-ulsfo.mgmt.ulsfo.wmnet` was the old stack, `asw2-ulsfo.mgmt.ulsfo.wmnet` is the way to go. [05:52:52] !log Change parsercachekey on 10 more hosts [05:52:58] !log Change parsercachekey on 10 more hosts [05:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:12] (03CR) 10Elukey: "All no-ops: https://puppet-compiler.wmflabs.org/compiler1002/17072/" [puppet] - 10https://gerrit.wikimedia.org/r/518764 (owner: 10Elukey) [06:10:17] 10Operations, 10netops: Telia IC-307235 reported down from the eqiad side - https://phabricator.wikimedia.org/T226394 (10ayounsi) 05Open→03Resolved > Tha faulty card replaced and at 2019-06-25 05:41 UTC the circuit recovered and running at the moment , please check and let us know if you have any issue . >... [06:14:40] (03PS2) 10Elukey: Move the cdh submodule into environments/production [puppet] - 10https://gerrit.wikimedia.org/r/518764 (https://phabricator.wikimedia.org/T226466) [06:23:56] 10Operations, 10ops-eqiad, 10Analytics: Broken disk on analytics1072 - https://phabricator.wikimedia.org/T226467 (10elukey) [06:36:45] (03CR) 10Muehlenhoff: [C: 03+1] "Finally :-)" [puppet] - 10https://gerrit.wikimedia.org/r/518764 (https://phabricator.wikimedia.org/T226466) (owner: 10Elukey) [06:40:08] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10akosiaris) >>! In T226444#5280715, @RobH wrote: > I set these to internal IP/vlan since other ganeti hosts are that way. Yup, that's correct. Thanks! [06:44:03] (03PS1) 10Muehlenhoff: Extend access date for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/518888 [06:45:00] (03CR) 10Muehlenhoff: [C: 03+2] Extend access date for sukhe [puppet] - 10https://gerrit.wikimedia.org/r/518888 (owner: 10Muehlenhoff) [06:51:35] !log Change parsercachekey on 20 more hosts [06:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:35] (03Abandoned) 10Elukey: Add cdh::systemd_timer [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518097 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [06:53:48] (03PS3) 10Elukey: Move the cdh submodule into environments/production [puppet] - 10https://gerrit.wikimedia.org/r/518764 (https://phabricator.wikimedia.org/T226466) [07:01:42] akosiaris: if you have a minute, can you review --^ ? Just to be sure :) [07:03:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/518764 (https://phabricator.wikimedia.org/T226466) (owner: 10Elukey) [07:03:22] :-) [07:03:47] thankssss [07:04:08] (03CR) 10Elukey: [C: 03+2] Move the cdh submodule into environments/production [puppet] - 10https://gerrit.wikimedia.org/r/518764 (https://phabricator.wikimedia.org/T226466) (owner: 10Elukey) [07:04:59] 10Operations, 10ops-codfw, 10netops: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10ayounsi) [07:09:04] !log Change parsercachekey on 10 more hosts [07:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:16] PROBLEM - puppet last run on analytics1059 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:09:30] !log depooled wdqs1004 due to lag [07:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:07] (03PS2) 10Marostegui: mariadb: Provision db1133 in m5 [puppet] - 10https://gerrit.wikimedia.org/r/518877 (https://phabricator.wikimedia.org/T222682) [07:13:42] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:13:55] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Provision db1133 in m5 [puppet] - 10https://gerrit.wikimedia.org/r/518877 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [07:14:33] (03PS1) 10Elukey: profile::hadoop::common: remove '::' in front of cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/518901 (https://phabricator.wikimedia.org/T226466) [07:15:02] (03CR) 10Marostegui: [V: 03+2 C: 03+2] "the -1 is a known issue that will be tackled once the refactoring is done." [puppet] - 10https://gerrit.wikimedia.org/r/518877 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [07:15:31] (03CR) 10Elukey: [C: 03+2] profile::hadoop::common: remove '::' in front of cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/518901 (https://phabricator.wikimedia.org/T226466) (owner: 10Elukey) [07:15:38] (03PS2) 10Elukey: profile::hadoop::common: remove '::' in front of cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/518901 (https://phabricator.wikimedia.org/T226466) [07:15:40] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::hadoop::common: remove '::' in front of cdh::hadoop [puppet] - 10https://gerrit.wikimedia.org/r/518901 (https://phabricator.wikimedia.org/T226466) (owner: 10Elukey) [07:16:08] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:16:34] PROBLEM - puppet last run on druid1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:16:56] PROBLEM - puppet last run on druid1006 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:17:20] ah yes this is me [07:18:03] disabled puppet on those node too [07:20:48] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:21:34] !log Change parsercachekey on 10 more hosts [07:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:24] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [07:23:43] (03CR) 10Volans: [C: 03+1] "LGTM, one nit inline." (031 comment) [software/netbox] - 10https://gerrit.wikimedia.org/r/518785 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [07:25:41] (03PS1) 10Elukey: Add missing /modules/ to cdh in environments/production [puppet] - 10https://gerrit.wikimedia.org/r/518907 (https://phabricator.wikimedia.org/T226466) [07:25:50] akosiaris,moritzm --^ :( [07:27:30] looking [07:27:41] ah, /modules/ [07:28:01] puppet environments... never seen them work ok [07:28:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add missing /modules/ to cdh in environments/production [puppet] - 10https://gerrit.wikimedia.org/r/518907 (https://phabricator.wikimedia.org/T226466) (owner: 10Elukey) [07:28:09] my bad, ENOCOFFEE [07:28:16] (03CR) 10Elukey: [C: 03+2] Add missing /modules/ to cdh in environments/production [puppet] - 10https://gerrit.wikimedia.org/r/518907 (https://phabricator.wikimedia.org/T226466) (owner: 10Elukey) [07:28:54] (03CR) 10Muehlenhoff: [C: 03+1] Add missing /modules/ to cdh in environments/production [puppet] - 10https://gerrit.wikimedia.org/r/518907 (https://phabricator.wikimedia.org/T226466) (owner: 10Elukey) [07:31:22] all right it works now :) [07:33:50] (03PS1) 10Elukey: Move cdh module from env/prod to modules [puppet] - 10https://gerrit.wikimedia.org/r/518909 (https://phabricator.wikimedia.org/T226466) [07:35:38] !log Change parsercachekey on 20 more hosts [07:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:02] RECOVERY - puppet last run on analytics1059 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:36:15] (03CR) 10Elukey: [C: 03+2] Move cdh module from env/prod to modules [puppet] - 10https://gerrit.wikimedia.org/r/518909 (https://phabricator.wikimedia.org/T226466) (owner: 10Elukey) [07:38:56] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:39:01] 10Operations, 10ops-eqiad, 10DBA, 10Goal, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) 05Stalled→03Open Finally db1133 has been installed correctly! Thanks @Cmjohnson for getting it fixed! ` root@db1133:~# megacli -LdPdInfo -... [07:40:05] 10Operations, 10DBA: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 (10Marostegui) [07:40:08] 10Operations, 10ops-eqiad, 10DBA, 10Goal, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) 05Open→03Resolved [07:43:18] (03CR) 10Volans: [C: 03+2] dbctl: remove never-implemented 'section get --mediawiki' flag [software/conftool] - 10https://gerrit.wikimedia.org/r/518802 (owner: 10CDanis) [07:44:07] !log Change parsercachekey on 20 more hosts [07:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:23] (03CR) 10Volans: [C: 03+2] dbctl: s/slave/replica/ everywhere [software/conftool] - 10https://gerrit.wikimedia.org/r/518809 (owner: 10CDanis) [07:45:20] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [07:46:06] (03CR) 10Volans: [C: 03+2] dbctl: s/reason/ro_reason/ in the schema, so section edit is clearer [software/conftool] - 10https://gerrit.wikimedia.org/r/518812 (owner: 10CDanis) [07:46:12] (03Merged) 10jenkins-bot: dbctl: remove never-implemented 'section get --mediawiki' flag [software/conftool] - 10https://gerrit.wikimedia.org/r/518802 (owner: 10CDanis) [07:46:56] (03CR) 10Volans: [C: 03+2] dbctl: handle instance pooled for section but depooled in group [software/conftool] - 10https://gerrit.wikimedia.org/r/518855 (owner: 10CDanis) [07:46:58] (03Merged) 10jenkins-bot: dbctl: s/slave/replica/ everywhere [software/conftool] - 10https://gerrit.wikimedia.org/r/518809 (owner: 10CDanis) [07:47:47] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:47:49] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:48:01] RECOVERY - puppet last run on druid1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:48:07] (03Merged) 10jenkins-bot: dbctl: s/reason/ro_reason/ in the schema, so section edit is clearer [software/conftool] - 10https://gerrit.wikimedia.org/r/518812 (owner: 10CDanis) [07:48:07] RECOVERY - puppet last run on druid1006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:49:11] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:49:25] (03Merged) 10jenkins-bot: dbctl: handle instance pooled for section but depooled in group [software/conftool] - 10https://gerrit.wikimedia.org/r/518855 (owner: 10CDanis) [07:49:50] !log Change parsercachekey on 20 more hosts [07:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:12] 10Operations, 10Analytics, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10elukey) [07:58:23] 10Operations, 10Analytics, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10elukey) [07:58:36] 10Operations, 10Analytics, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10elukey) [07:58:40] !log Change parsercachekey on 20 more hosts [07:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:56] 10Operations, 10Cognate, 10DBA, 10Growth-Team, and 2 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Ladsgroup) A new thing that also gets affected is url shortener. FYI. [08:07:07] 10Operations, 10Cognate, 10DBA, 10Growth-Team, and 2 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) Thanks @Ladsgroup! We have always talked about documenting who and which teams to tag when planning x1 switchovers, so I have cr... [08:07:47] 10Operations, 10Community-Relations, 10Traffic, 10Performance, and 2 others: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10ema) [08:08:09] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10MoritzMuehlenhoff) I'll file a bug against the Debian OpenSSH package, this seems like... [08:08:21] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [08:08:52] !log Change parsercachekey on 10 more hosts [08:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:51] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@bd3df8c]: modify agents for T226471 [08:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:56] T226471: WDQS bans its own monitoring due to bad user agent - https://phabricator.wikimedia.org/T226471 [08:14:24] 10Operations, 10ops-eqiad, 10netops: update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) [08:16:46] 10Operations, 10Diffusion, 10Packaging, 10Release-Engineering-Team (Kanban), and 2 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10mmodell) This is causing significant inconvenience as we have some repositories which... [08:18:42] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10MoritzMuehlenhoff) Breakdown of servers and their config eqsin: Enabled: cp5001-cp5003, cp5007-cp5009 Disabled: cp5004-cp5006, cp50... [08:18:56] !log Change parsercachekey on 10 more hosts [08:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:22:41] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:22:53] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@bd3df8c]: modify agents for T226471 (duration: 11m 02s) [08:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:59] T226471: WDQS bans its own monitoring due to bad user agent - https://phabricator.wikimedia.org/T226471 [08:24:09] (03CR) 10Gehel: [C: 04-1] "It looks like this cookbook make sense to keep. It might need to be augmented at some point, but that's a good start!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/517377 (https://phabricator.wikimedia.org/T225694) (owner: 10Mathew.onipe) [08:25:31] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:25:41] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [08:26:20] 502 spike affecting query.wikidata.org/sparql ^ [08:26:35] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10Gilles) The results are in, looking at loadEventEnd. | status | median | p90 | p95 | sample size | | SACK enabled | 1282 | 4079 | 6... [08:26:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:26:57] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [08:28:11] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10Gilles) Hive queries used, for reference: P8650 [08:28:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [08:30:56] !log Change parsercachekey on 20 more hosts [08:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:36] (03PS1) 10Gehel: wdqs: Set a custom U-A for prometheus blazegraph exporter [puppet] - 10https://gerrit.wikimedia.org/r/518914 (https://phabricator.wikimedia.org/T226471) [08:34:21] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [08:35:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:35:58] (03PS1) 10Elukey: Introduce the kerberos module [puppet] - 10https://gerrit.wikimedia.org/r/518915 (https://phabricator.wikimedia.org/T212259) [08:40:43] (03CR) 10ArielGlenn: [C: 03+1] "agent looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/517032 (owner: 10Vgutierrez) [08:41:16] (03CR) 10ArielGlenn: [C: 03+1] "agent looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/518914 (https://phabricator.wikimedia.org/T226471) (owner: 10Gehel) [08:41:42] * apergos goes to get caffeine [08:44:51] 10Operations, 10Cognate, 10DBA, 10Growth-Team, and 3 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Ladsgroup) >>! In T226358#5281302, @Marostegui wrote: > Thanks @Ladsgroup! > We have always talked about documenting who and which teams to... [08:45:44] 10Operations, 10Cognate, 10DBA, 10Growth-Team, and 3 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) Thank you! :) [08:46:24] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ema) [08:46:29] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ema) p:05Triage→03Normal [08:46:43] (03PS1) 10Elukey: Add HTTP keytabs for the Hadoop testing cluster [labs/private] - 10https://gerrit.wikimedia.org/r/518920 [08:46:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/518915 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [08:47:23] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add HTTP keytabs for the Hadoop testing cluster [labs/private] - 10https://gerrit.wikimedia.org/r/518920 (owner: 10Elukey) [08:50:45] !log Disable puppet on thumbor* - T225284 [08:50:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Change parsercache key everywhere after deploying it in small batches for a few hours T210725 (duration: 00m 57s) [08:50:54] (03PS1) 10Elukey: Add HTTP-oozie.keytab for analytics1030 [labs/private] - 10https://gerrit.wikimedia.org/r/518921 [08:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:01] T225284: thumbor haproxy trying to send syslog on wrong port - https://phabricator.wikimedia.org/T225284 [08:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:06] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [08:51:13] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add HTTP-oozie.keytab for analytics1030 [labs/private] - 10https://gerrit.wikimedia.org/r/518921 (owner: 10Elukey) [08:53:41] (03PS1) 10Ema: cache: reimage cp5001 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/518923 (https://phabricator.wikimedia.org/T226477) [08:54:49] (03PS2) 10Gehel: wdqs: Set a custom U-A for prometheus blazegraph exporter [puppet] - 10https://gerrit.wikimedia.org/r/518914 (https://phabricator.wikimedia.org/T226471) [08:54:57] (03PS2) 10Elukey: Introduce the kerberos module [puppet] - 10https://gerrit.wikimedia.org/r/518915 (https://phabricator.wikimedia.org/T212259) [08:56:18] (03CR) 10Gehel: [C: 03+2] wdqs: Set a custom U-A for prometheus blazegraph exporter [puppet] - 10https://gerrit.wikimedia.org/r/518914 (https://phabricator.wikimedia.org/T226471) (owner: 10Gehel) [08:57:54] (03PS1) 10Ema: cache_upload eqsin: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/518925 (https://phabricator.wikimedia.org/T226477) [08:59:25] (03PS3) 10Elukey: Introduce the kerberos module [puppet] - 10https://gerrit.wikimedia.org/r/518915 (https://phabricator.wikimedia.org/T212259) [09:00:58] 10Operations, 10DBA, 10Traffic, 10Patch-For-Review: Framework to transfer files over the LAN - https://phabricator.wikimedia.org/T156462 (10jcrespo) transfer.py was modified to add hot mysql backup taking and compression/decompression handling for provisioning. It is still a bit of a clunky mess, and it w... [09:02:19] !log Disable puppet on dbproxy* - T225284 [09:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:24] T225284: thumbor haproxy trying to send syslog on wrong port - https://phabricator.wikimedia.org/T225284 [09:02:54] (03CR) 10Effie Mouzeli: [V: 03+1 C: 03+2] haproxy: Disable global logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/517755 (https://phabricator.wikimedia.org/T225284) (owner: 10Effie Mouzeli) [09:03:08] (03PS3) 10Effie Mouzeli: haproxy: Disable global logging to syslog [puppet] - 10https://gerrit.wikimedia.org/r/517755 (https://phabricator.wikimedia.org/T225284) [09:08:50] !log Rolling haproxy restarts on thumbor* - T225284 [09:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:55] T225284: thumbor haproxy trying to send syslog on wrong port - https://phabricator.wikimedia.org/T225284 [09:09:26] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/518923 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [09:09:52] !log depool cp5001 and reimage as upload_ats T226477 [09:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:57] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [09:10:36] (03PS2) 10Ema: cache: reimage cp5001 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/518923 (https://phabricator.wikimedia.org/T226477) [09:11:35] (03CR) 10Ema: [C: 03+2] cache: reimage cp5001 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/518923 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [09:13:16] (03PS4) 10Elukey: Introduce the kerberos module [puppet] - 10https://gerrit.wikimedia.org/r/518915 (https://phabricator.wikimedia.org/T212259) [09:14:26] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5001.eqsin.wmnet'] ` The log can be found in `... [09:16:26] (03CR) 10Elukey: [C: 03+2] "Fixed some inconsistencies, seems good now https://puppet-compiler.wmflabs.org/compiler1002/17079/" [puppet] - 10https://gerrit.wikimedia.org/r/518915 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [09:16:34] (03PS5) 10Elukey: Introduce the kerberos module [puppet] - 10https://gerrit.wikimedia.org/r/518915 (https://phabricator.wikimedia.org/T212259) [09:22:29] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [09:22:29] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:23:55] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp5001_v4, cp5001_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:23:55] PROBLEM - IPsec on cp2018 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp5001_v4, cp5001_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:23:59] PROBLEM - IPsec on cp1084 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp5001_v4, cp5001_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:24:15] PROBLEM - IPsec on cp2025 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp5001_v4, cp5001_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:24:19] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 26 not-conn: cp5001_v4, cp5001_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:24:21] PROBLEM - IPsec on cp1090 is CRITICAL: Strongswan CRITICAL - ok: 34 not-conn: cp5001_v4, cp5001_v6 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:24:35] <_joe_> !log restarting gerrit on cobalt [09:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:40] that's cp5001 being reimaged ^ [09:25:03] ema: thanks, your tip discards it as unrelated [09:25:10] (to the other issues) [09:25:13] <_joe_> it will take time [09:25:21] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 26 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:25:25] RECOVERY - IPsec on cp1084 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:25:49] RECOVERY - IPsec on cp1090 is OK: Strongswan OK - 34 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:26:49] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26840 bytes in 6.755 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [09:27:13] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 26 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:28:11] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 865 bytes in 0.052 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [09:29:06] (03PS1) 10Ema: ATS: mask varnish.service [puppet] - 10https://gerrit.wikimedia.org/r/518946 [09:29:32] (03CR) 10jerkins-bot: [V: 04-1] ATS: mask varnish.service [puppet] - 10https://gerrit.wikimedia.org/r/518946 (owner: 10Ema) [09:30:19] !log enable puppet on dbproxy* [09:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:07] PROBLEM - puppet last run on an-coord1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_operations/mediawiki-config] [09:31:28] (03PS2) 10Ema: ATS: mask varnish.service [puppet] - 10https://gerrit.wikimedia.org/r/518946 [09:33:09] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [09:33:41] PROBLEM - puppet last run on db2094 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [09:34:15] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [09:34:23] PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [09:35:27] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/518946 (owner: 10Ema) [09:36:31] RECOVERY - puppet last run on an-coord1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:38:04] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10WMDE-leszek) @ArielGlenn have you maybe had a chance to discuss this topic in the SRE round? [09:38:27] RECOVERY - IPsec on cp2018 is OK: Strongswan OK - 26 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:39:09] RECOVERY - puppet last run on db2094 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:40:11] (03CR) 10Ema: [C: 03+2] ATS: mask varnish.service [puppet] - 10https://gerrit.wikimedia.org/r/518946 (owner: 10Ema) [09:42:53] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10ArielGlenn) That week we did not meet after all! :-( And this week I was missing, so I do not know if it was discussed. It should not be delayed... [09:43:09] RECOVERY - IPsec on cp2025 is OK: Strongswan OK - 26 ESP OK https://wikitech.wikimedia.org/wiki/Monitoring/strongswan [09:43:39] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [09:43:39] marostegui: can you please change the on duty person to john ? [09:44:51] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10fgiunchedi) >>! In T200209#5264664, @MoritzMuehlenhoff wrote: >>>! In T200209#4888109, @fgiunchedi wrote: >> I'm taking graphite2001 now to do some tests for prometheus v2 upgrade i... [09:51:54] 10Operations, 10LDAP-Access-Requests: Grant WMDE engineers access to logstash / Add WMDE engineers to 'nda' LDAP group - https://phabricator.wikimedia.org/T225004 (10ArielGlenn) It was indeed discussed this week and someone at that dicussion should be weighing in here these next few days. [09:54:12] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:55:26] RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:56:49] (03PS2) 10Giuseppe Lavagetto: conftool: add safe_service_restart define [puppet] - 10https://gerrit.wikimedia.org/r/518665 [09:59:42] jijiki: done [10:00:05] :D [10:00:30] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:02:16] RECOVERY - Docker registry HTTPS interface on registry1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2545 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Docker [10:04:22] (03CR) 10Effie Mouzeli: [C: 03+1] "I will wait for filippo to have another look and merge" [puppet] - 10https://gerrit.wikimedia.org/r/518658 (https://phabricator.wikimedia.org/T226373) (owner: 10Gilles) [10:04:36] 10Operations, 10User-fgiunchedi: CPU scaling governor audit - https://phabricator.wikimedia.org/T225713 (10fgiunchedi) a:03fgiunchedi [10:04:38] 10Operations, 10Mail: Add Eric to cpt-leads@wikimedia.org and remove Marko to cpt-leads@wikimedia.org - https://phabricator.wikimedia.org/T226443 (10jbond) p:05Triage→03Normal [10:05:17] 10Operations, 10Cognate, 10DBA, 10Growth-Team, and 3 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Tgr) ` tgr@stat1006:~$ analytics-mysql enwiki --use-x1 mysql:research@dbstore1005.eqiad.wmnet [enwiki]> select distinct table_name from inf... [10:05:34] RECOVERY - Docker registry HTTPS interface on registry1002 is OK: HTTP OK: HTTP/1.1 200 OK - 2545 bytes in 0.215 second response time https://wikitech.wikimedia.org/wiki/Docker [10:06:15] 10Operations, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, 10SRE-Access-Requests: Request access to deployment cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223698 (10jbond) @greg are you able to approve this request? [10:08:02] 10Operations, 10Cognate, 10ContentTranslation, 10DBA, and 9 others: Failover x1 master: db1069 to db1120 3rd July at 06:00 UTC - https://phabricator.wikimedia.org/T226358 (10Marostegui) Thanks a lot @Tgr I will tag those (better to tag them and they can remove themselves if it no longer applies) and update... [10:08:08] (03PS2) 10Ema: cache_upload eqsin: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/518925 (https://phabricator.wikimedia.org/T226477) [10:09:42] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [10:10:15] (03CR) 10Ema: [C: 03+2] cache_upload eqsin: read ats-be etcd keys [puppet] - 10https://gerrit.wikimedia.org/r/518925 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [10:14:50] jouncebot: now [10:14:50] No deployments scheduled for the next 0 hour(s) and 45 minute(s) [10:14:55] jouncebot: next [10:14:55] In 0 hour(s) and 45 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190625T1100) [10:15:26] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10MoritzMuehlenhoff) [10:15:52] 10Operations, 10ops-codfw, 10decommission, 10observability: Decom graphite2001 - https://phabricator.wikimedia.org/T200209 (10MoritzMuehlenhoff) a:05fgiunchedi→03RobH [10:17:25] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5001.eqsin.wmnet'] ` and were **ALL** successful. [10:18:07] (03PS4) 10Giuseppe Lavagetto: profile::lvs::realserver: introduce ability to use safe-service-restart [puppet] - 10https://gerrit.wikimedia.org/r/518666 [10:18:52] (03CR) 10jerkins-bot: [V: 04-1] profile::lvs::realserver: introduce ability to use safe-service-restart [puppet] - 10https://gerrit.wikimedia.org/r/518666 (owner: 10Giuseppe Lavagetto) [10:22:18] !log pool cp5001 w/ ATS backend T226477 [10:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:25] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [10:25:51] (03PS1) 10Ladsgroup: Set EntityUsageTable addUsage batch size to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518952 (https://phabricator.wikimedia.org/T225500) [10:28:43] (03PS5) 10Giuseppe Lavagetto: profile::lvs::realserver: introduce ability to use safe-service-restart [puppet] - 10https://gerrit.wikimedia.org/r/518666 [10:29:29] (03CR) 10jerkins-bot: [V: 04-1] profile::lvs::realserver: introduce ability to use safe-service-restart [puppet] - 10https://gerrit.wikimedia.org/r/518666 (owner: 10Giuseppe Lavagetto) [10:31:08] (03PS1) 10Jbond: Production shell: create shell account for accraze [puppet] - 10https://gerrit.wikimedia.org/r/518953 (https://phabricator.wikimedia.org/T226204) [10:36:49] (03PS1) 10Elukey: Replace profile::analytics::systemd_timer with kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) [10:46:26] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/17086/" [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [10:48:18] (03CR) 10Volans: [C: 03+2] Add cable names report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/513003 (https://phabricator.wikimedia.org/T216469) (owner: 10CRusnov) [10:53:16] (03PS1) 10Elukey: camus: add support for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/518958 (https://phabricator.wikimedia.org/T212259) [10:54:32] (03PS2) 10Elukey: camus: add support for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/518958 (https://phabricator.wikimedia.org/T212259) [10:57:04] (03PS6) 10Giuseppe Lavagetto: profile::lvs::realserver: introduce ability to use safe-service-restart [puppet] - 10https://gerrit.wikimedia.org/r/518666 [10:58:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/17084/" [puppet] - 10https://gerrit.wikimedia.org/r/518666 (owner: 10Giuseppe Lavagetto) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190625T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:15] (03PS3) 10Elukey: camus: add support for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/518958 (https://phabricator.wikimedia.org/T212259) [11:00:33] o/ [11:00:44] I’m probably leaving soon, but it looks like there’s nothing to SWAT anyways [11:00:56] should we still !log EU SWAT done if nothing happens? [11:03:03] 10Operations, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10Gilles) Now that the AS report is collecting more data, I've manually compiled a list of AS we could directly peer with (and don't yet), having checked that we have at l... [11:03:21] Lucas_WMDE: Don't need to log anything really :) [11:03:43] ok d) [11:03:45] * :) [11:05:01] (03PS4) 10Elukey: camus: add support for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/518958 (https://phabricator.wikimedia.org/T212259) [11:07:36] (03PS5) 10Elukey: camus: add support for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/518958 (https://phabricator.wikimedia.org/T212259) [11:10:23] Hi people, are our servers currently under maintenance or experiencing a technical problem? [11:10:37] At least that's what upload.wikimedia.org is telling me right now [11:10:40] not that I know of. what's up? [11:10:59] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::webserver: count all nodes once [puppet] - 10https://gerrit.wikimedia.org/r/518959 [11:11:00] Can't see a thumbnail of a page of a book I'm working on on Wikisource [11:11:14] Error: 429, Too Many Requests at Tue, 25 Jun 2019 11:11:08 GMT [11:11:29] Varnish XID 471431549 // Upstream caches: cp3045 int [11:11:40] browser? [11:11:54] Firefox [11:12:04] 3 = esams (note for us) [11:12:20] link, if you don't mind? [11:12:30] https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf/page257-836px-Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf.jpg [11:13:30] apergos: I get Error: 429, Too Many Requests [11:13:36] yep, got the same instantly (no wait for a timeout, just an immediate rejection) [11:13:48] actually no [11:13:57] (03PS6) 10Elukey: camus: add support for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/518958 (https://phabricator.wikimedia.org/T212259) [11:14:29] I got via cp3047 but yes the same message basically [11:16:55] this sounds similar to https://phabricator.wikimedia.org/T226318 [11:19:48] what's weird is that the reply is instantaneous (logged-in user too) [11:20:01] no delay of 'I'm trying for 5 seconds' or anything [11:21:00] (03CR) 10Elukey: "Looks finally good: https://puppet-compiler.wmflabs.org/compiler1001/17091/" [puppet] - 10https://gerrit.wikimedia.org/r/518958 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [11:22:57] (03PS1) 10Jbond: etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 [11:23:21] (03CR) 10jerkins-bot: [V: 04-1] etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 (owner: 10Jbond) [11:23:23] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/518960 (owner: 10Jbond) [11:23:55] (03CR) 10jerkins-bot: [V: 04-1] etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 (owner: 10Jbond) [11:24:23] (03PS2) 10Jbond: etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 [11:27:17] AILURE_THROTTLING_MAX and FAILURE_THROTTLING_DURATION I suppose those are responsible for the automatic rejection now that it's failed once [11:27:35] *FAILURE_THROTTLING_MAX [11:28:38] apergos: Sorry about that, had a phone call and got disconnected [11:28:50] apergos: Do you want me to provide any further information re: that error? [11:28:56] don't worry, I'm just slowly looking for anything that could shed extra light on it [11:29:03] (03PS3) 10Jbond: etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 [11:29:15] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/518960 (owner: 10Jbond) [11:29:17] no, that's probably all, and it might be that as em a says, you are another instance of https://phabricator.wikimedia.org/T226318 [11:29:34] and worth adding a description of the event (and the approximate time utc) there [11:29:52] I'll keep looking but if I don't see something obvious, it will have to go to thumbor folks to really dig in [11:30:26] yeah I'm gonna add the findings to T226318 [11:30:27] T226318: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 [11:30:34] great [11:30:35] Hm, looks like the whole book is failing now [11:30:49] https://pl.wikisource.org/wiki/Strona:Dezydery_Ch%C5%82apowski_-_O_rolnictwie.pdf/226 [11:30:51] I'm looking at thumbor logs now to see if I can find any mention of the url [11:30:56] (in kibana) [11:31:07] Not sure if you can reproduce from where you're connecting [11:36:03] 10Operations, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10alaa_wmde) Hello .. apologies for not replying sooner and thanks @hashar for pinging me directly about it :) I will follow the links, indeed I should have access to turnilo a... [11:37:49] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10ema) @Gilles we've just had an interesting report on #wikimedia-operations that seems... [11:38:18] odder_: thanks for your report! I've added more information to the task: https://phabricator.wikimedia.org/T226318#5281822 [11:39:23] Thanks very much :-) It's a bummer this has occurred right now because I intended to work on the book for a few minutes but I'll try to find a workaround :) [11:42:33] https://logstash.wikimedia.org/goto/0d42f0eb059ac6b16bc4b1e2c2a1fa86 ok, I have some relevant log entries at last, though my kibana-fu is not good enough to get rid of all the cruft [11:43:23] https://logstash.wikimedia.org/goto/5a9e4e8d9e9b72a753fd01873a73503d these might be just the one url [11:43:52] ExiftoolRunner] error: 'Warning: Skipped unknown 99 byte header - /srv/thumbor/tmp/thumbor@8802/tmpt2WcOM\n' no idea what tht issue is [11:44:12] guess I'll add these things to the ticket [11:44:35] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10hashar) [11:46:49] (03PS1) 10Elukey: Deprecate the cdh module [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518972 [11:46:55] (03CR) 10jerkins-bot: [V: 04-1] Deprecate the cdh module [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518972 (owner: 10Elukey) [11:47:05] thanks apergos [11:47:34] (03CR) 10Elukey: [V: 03+2 C: 03+2] Deprecate the cdh module [puppet/cdh] - 10https://gerrit.wikimedia.org/r/518972 (owner: 10Elukey) [11:48:12] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10ema) [11:48:51] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10ArielGlenn) kibana entries for the https://upload.wikimedia.org/wikipedia/commons/thu... [11:49:19] 10Operations, 10Analytics, 10Cleanup, 10Patch-For-Review: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10elukey) [11:49:27] sure. I guess this is a thumbor issue in the end [11:49:48] it would be nice if the right status result were reported out of the caches though [12:02:29] 10Operations, 10Puppet, 10Release-Engineering-Team, 10Release-Engineering-Team-TODO, and 2 others: Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) I hit trailing comma issue reported by ayounsi above and have attempted a fix. CR which trigger... [12:05:20] (03PS4) 10Jbond: etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 [12:05:32] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/518960 (owner: 10Jbond) [12:05:42] (03PS5) 10Jbond: etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 [12:08:25] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/518960 (owner: 10Jbond) [12:09:44] (03PS6) 10Jbond: etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 [12:10:46] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/518960 (owner: 10Jbond) [12:13:32] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/210/" [puppet] - 10https://gerrit.wikimedia.org/r/518960 (owner: 10Jbond) [12:18:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 (owner: 10Jbond) [12:18:45] jouncebot: now [12:18:46] No deployments scheduled for the next 3 hour(s) and 41 minute(s) [12:20:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::mediawiki::webserver: count all nodes once [puppet] - 10https://gerrit.wikimedia.org/r/518959 (owner: 10Giuseppe Lavagetto) [12:22:03] (03PS7) 10Jbond: etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 [12:22:32] (03CR) 10Jbond: [C: 03+2] etcdmirror: Ensure etcdmirror is disabled on the inactive hosts [puppet] - 10https://gerrit.wikimedia.org/r/518960 (owner: 10Jbond) [12:22:33] !log Upgrade to scap 3.10.0-1 on mw* codfw [12:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:13] !log Upgrade to scap 3.10.0-1 on mw-api-canary as well - T224915 [12:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:19] T224915: Deploy scap 3.10.0-1 - https://phabricator.wikimedia.org/T224915 [12:26:43] !log akosiaris@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=kubernetes2001.* [12:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:12] !log fully depool kubernetes2001 T226237 [12:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:17] T226237: Investigate outgoing discarded packets in the codfw kubernetes cluster - https://phabricator.wikimedia.org/T226237 [12:30:06] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10ema) Note that thumbor is occasionally returning 500 for that object. Hitting ATS to... [12:32:19] !log Upgrade scap to codfw - T224915 [12:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:25] T224915: Deploy scap 3.10.0-1 - https://phabricator.wikimedia.org/T224915 [12:34:58] !log Upgrade scap to eqiad - T224915 [12:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:33] (03PS1) 10Ema: cache: reimage cp5002 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/518979 (https://phabricator.wikimedia.org/T226477) [12:44:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/518979 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [12:46:14] !log depool cp5002 and reimage as upload_ats T226477 [12:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:19] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [12:46:45] (03CR) 10Ema: [C: 03+2] cache: reimage cp5002 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/518979 (https://phabricator.wikimedia.org/T226477) (owner: 10Ema) [12:48:40] !log swift eqiad-prod: put back ms-be1033 - T223518 [12:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:46] T223518: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 [12:49:41] !log Stop MySQL on db1117:m5 (checked dumps, they are done) to clone db1133 - T222682 [12:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:46] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [12:50:36] (03PS1) 10Giuseppe Lavagetto: profile::lvs::realserver: fixes to safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/519002 [12:51:20] (03CR) 10jerkins-bot: [V: 04-1] profile::lvs::realserver: fixes to safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/519002 (owner: 10Giuseppe Lavagetto) [12:53:07] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts: ` ['cp5002.eqsin.wmnet'] ` The log can be found in `... [12:54:38] Sorry if not right place, but im not recieving a confirmation email from lists@wikimedia saying my email was recieved to wikitech-l however my emails show in the archive... is this just a delay in the mailsever? [12:54:44] Mailserver* [12:56:47] (03PS2) 10Giuseppe Lavagetto: profile::lvs::realserver: fixes to safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/519002 [12:57:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "Let's try this for sure and see if it helps, I'm generally skeptical of bumping timeouts but no harm in trying. Thanks for looking into th" [puppet] - 10https://gerrit.wikimedia.org/r/518658 (https://phabricator.wikimedia.org/T226373) (owner: 10Gilles) [12:57:11] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1001/17093/mw1261.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/519002 (owner: 10Giuseppe Lavagetto) [12:57:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] profile::lvs::realserver: fixes to safe restarts [puppet] - 10https://gerrit.wikimedia.org/r/519002 (owner: 10Giuseppe Lavagetto) [13:01:22] (03PS3) 10Filippo Giunchedi: dsa-check-hpssacli: import latest version from DSA [puppet] - 10https://gerrit.wikimedia.org/r/516724 (https://phabricator.wikimedia.org/T220787) (owner: 10Faidon Liambotis) [13:05:27] (03PS2) 10Giuseppe Lavagetto: mediawiki: use safe-restart scripts on all appserver, apis [puppet] - 10https://gerrit.wikimedia.org/r/518667 [13:07:06] 10Operations, 10Discovery-Search: Create cookbook to reindex into elasticsearch / cirrus - https://phabricator.wikimedia.org/T219507 (10Mathew.onipe) [13:08:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17094/mw1275.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518667 (owner: 10Giuseppe Lavagetto) [13:08:20] (03CR) 10Fsero: mediawiki: use safe-restart scripts on all appserver, apis (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/518667 (owner: 10Giuseppe Lavagetto) [13:08:55] <_joe_> argg [13:09:12] <_joe_> how did I manage to get a copy/paste wrong [13:09:44] <3 _joe_ [13:10:21] <_joe_> oh no I did get it wrong the first time [13:11:12] (03PS1) 10Giuseppe Lavagetto: mediawiki::webservers: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/519008 [13:11:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::webservers: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/519008 (owner: 10Giuseppe Lavagetto) [13:12:00] <_joe_> thanks fsero :P [13:12:35] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:12:46] (03PS2) 10Filippo Giunchedi: profile: fix swift symlink for WMCS LVs [puppet] - 10https://gerrit.wikimedia.org/r/516791 [13:13:22] James_F: the echo icons from the top toolbar have vanished for me - is that happening for more people? [13:13:30] 10Operations, 10Analytics, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10Ottomata) Why do we want to delete everything? Can't we just leave it up with a note in the README me that active development has moved into ops/puppet? [13:16:11] 10Operations, 10Analytics, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10elukey) >>! In T226474#5282138, @Ottomata wrote: > Why do we want to delete everything? Can't we just leave it up with a note in the README me that active development has moved into ops/puppet?... [13:16:26] (03CR) 10Filippo Giunchedi: [C: 03+2] dsa-check-hpssacli: import latest version from DSA [puppet] - 10https://gerrit.wikimedia.org/r/516724 (https://phabricator.wikimedia.org/T220787) (owner: 10Faidon Liambotis) [13:16:35] (03PS4) 10Filippo Giunchedi: dsa-check-hpssacli: import latest version from DSA [puppet] - 10https://gerrit.wikimedia.org/r/516724 (https://phabricator.wikimedia.org/T220787) (owner: 10Faidon Liambotis) [13:18:07] 10Operations, 10decommission, 10media-storage, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10jijiki) @RobH I had a similar issue with cumin and ms-be1013.eqiad.wmnet, is it possible we move forward with removing it from the fleet? [13:18:55] (03PS2) 10Giuseppe Lavagetto: docker-registry: use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518668 [13:19:44] 10Operations, 10ops-eqiad, 10Cassandra, 10DC-Ops, and 4 others: restbase-dev1006 has a broken disk - https://phabricator.wikimedia.org/T224260 (10jijiki) @RobH @Cmjohnson I had an issue today again running cumin to the fleet because restbase-dev1006 was stalling due to disk errors :( [13:21:05] 10Operations, 10DBA, 10Patch-For-Review, 10codfw-rollout: [RFC] improve parsercache replication and sharding handling - https://phabricator.wikimedia.org/T133523 (10Marostegui) [13:32:41] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10Gilles) A lot of files fail to render for various reasons, and end up as 429s because... [13:33:44] (03PS3) 10Giuseppe Lavagetto: docker-registry: use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518668 [13:33:46] (03PS1) 10Giuseppe Lavagetto: safe-service-restart: require conftool::scripts [puppet] - 10https://gerrit.wikimedia.org/r/519013 [13:33:48] 10Operations, 10Analytics, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10Ottomata) I guess, just kind of a shame since now google will stop indexing and returning it in search results. It'd be nice if people could actually find it. [13:33:53] (03CR) 10Gilles: "That's clearly not the solution, but I want to see if it decreases the frequency of errors or not." [puppet] - 10https://gerrit.wikimedia.org/r/518658 (https://phabricator.wikimedia.org/T226373) (owner: 10Gilles) [13:35:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: require conftool::scripts [puppet] - 10https://gerrit.wikimedia.org/r/519013 (owner: 10Giuseppe Lavagetto) [13:35:21] (03PS2) 10Giuseppe Lavagetto: safe-service-restart: require conftool::scripts [puppet] - 10https://gerrit.wikimedia.org/r/519013 [13:37:12] 10Operations, 10Analytics, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10elukey) I don't find the documentation that states the procedure in the description, the last time @Krinkle showed it to me, maybe there are some caveats to get around it. Indexing will be done... [13:37:59] (03PS4) 10Giuseppe Lavagetto: docker-registry: use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518668 [13:38:19] RECOVERY - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is OK: puppetdb.PuppetDB OK https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:39:33] (03CR) 10Mathew.onipe: Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [13:39:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/17096/registry1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518668 (owner: 10Giuseppe Lavagetto) [13:41:02] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi) >>! In T210723#5255625, @faidon wrote: > So, the timeout patch above bumped the timeouts to 100s I t... [13:42:35] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: introduce basic k8s node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/519015 (https://phabricator.wikimedia.org/T215531) [13:44:44] (03PS2) 10Herron: kafka-main: replace kafka2003 hardware with kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) [13:54:00] (03PS5) 10Filippo Giunchedi: dsa-check-hpssacli: refactor for speed/efficiency [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723) (owner: 10Faidon Liambotis) [13:54:02] (03PS5) 10Filippo Giunchedi: dsa-check-hpssacli: make compatible with ssacli [puppet] - 10https://gerrit.wikimedia.org/r/516726 (https://phabricator.wikimedia.org/T220787) (owner: 10Faidon Liambotis) [13:54:38] !log changing replication factor of v4 keyspace for maps codfw cluster - T226161 [13:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:45] T226161: Change maps codfw replication factor for v4 keyspace - https://phabricator.wikimedia.org/T226161 [13:54:57] 10Operations, 10Analytics, 10Cleanup: Archive cdh puppet submodule - https://phabricator.wikimedia.org/T226474 (10Ottomata) Yeah, but it will be lost to non WMFers, since it is buried deep in a huge repo. Oh well I guess :( [13:56:12] (03CR) 10Filippo Giunchedi: "I've changed the iteration over %drives to be sorted for consistent output, PTAL! Looks good to be merged to me." [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723) (owner: 10Faidon Liambotis) [13:56:34] 10Operations, 10ops-codfw: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) @akosiaris this is where it stops This is an overview of your currently configured partitions and mount │ │ points. Select a partition to modify its settings (file system, m... [13:56:38] (03PS3) 10Ottomata: Remove monitoring and alerts for kafka analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/518814 (https://phabricator.wikimedia.org/T183303) [13:57:34] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp5002.eqsin.wmnet'] ` and were **ALL** successful. [13:58:34] (03PS1) 10Fsero: registry: improving swift replication [puppet] - 10https://gerrit.wikimedia.org/r/519018 [13:59:21] (03CR) 10Fsero: "This should help with swift replication being slow on registries" [puppet] - 10https://gerrit.wikimedia.org/r/519018 (owner: 10Fsero) [14:00:18] 10Operations, 10observability: Icinga custom checks should follow our HTTP User-Agent policy - https://phabricator.wikimedia.org/T226508 (10CDanis) [14:02:39] !log pool cp5002 w/ ATS backend T226477 [14:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:43] T226477: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 [14:05:12] (03CR) 10Fdans: "Removing -1 since referenced change has been merged" [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T221064) (owner: 10Fdans) [14:06:44] (03PS1) 10MarcoAurelio: Initial commit [debs/file-read-backwards] - 10https://gerrit.wikimedia.org/r/519020 [14:08:36] (03CR) 10MarcoAurelio: [V: 03+2 C: 03+2] Initial commit [debs/file-read-backwards] - 10https://gerrit.wikimedia.org/r/519020 (owner: 10MarcoAurelio) [14:09:09] (03CR) 10Ottomata: "This will work since reportupdater::job will only clone a given repository once. We should make this the default $repository in reportupd" [puppet] - 10https://gerrit.wikimedia.org/r/517085 (https://phabricator.wikimedia.org/T221064) (owner: 10Fdans) [14:09:41] (03CR) 10Ottomata: [C: 03+2] Remove monitoring and alerts for kafka analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/518814 (https://phabricator.wikimedia.org/T183303) (owner: 10Ottomata) [14:11:57] (03PS1) 10Herron: kafka-main2001: disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519021 [14:12:56] (03CR) 10Giuseppe Lavagetto: "AIUI this is only impacting the registry container, right? if so, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/519018 (owner: 10Fsero) [14:15:22] (03PS1) 10Ottomata: Set role spare::system for old analytics Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/519023 (https://phabricator.wikimedia.org/T183303) [14:15:36] PROBLEM - jmxtrans on kafka1012 is CRITICAL: NRPE: Command check_jmxtrans not defined https://wikitech.wikimedia.org/wiki/Kafka [14:15:41] PROBLEM - Kafka Broker Server on kafka1012 is CRITICAL: NRPE: Command check_kafka not defined https://wikitech.wikimedia.org/wiki/Kafka [14:15:56] oh! hmmm indeed icinga lag here. [14:15:58] (03CR) 10Fsero: "gently reminder" [puppet] - 10https://gerrit.wikimedia.org/r/517888 (https://phabricator.wikimedia.org/T212130) (owner: 10Fsero) [14:16:05] heyo [14:16:11] <_joe_> wut [14:16:15] expected I take it, ottomata ? [14:16:17] that... must be a puppetization error? [14:16:19] yes [14:16:19] no [14:16:23] that is [14:16:23] oh look. I gota page, and I got it on time for once [14:16:23] https://phabricator.wikimedia.org/T183303 [14:16:32] <_joe_> ottomata: we're changing the decom procedure btw [14:16:36] oh? [14:16:36] <_joe_> finally [14:16:52] sorry icinga puppet just lagged there, the checks all just got removed from icinga [14:16:52] <_joe_> there was a discussion at the SRE summit [14:17:45] any info I can look at? [14:17:50] i'm about to apply spare::system [14:17:52] to these old nodes [14:19:23] <_joe_> ottomata: in the SRE team drive there is the final document [14:19:29] "Quick server decom step: `dd` the bootloader to kill it, update Netbox state - thus the server is dead. Ideally the port mapping information from Netbox would be displayed to aid in removing the host from the switch. This decouples the “uninterruptible” steps into interruptible steps." [14:19:32] <_joe_> which is almost ready for publication [14:19:34] (03PS1) 10Herron: kafka-main2002 disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519025 [14:19:36] (03PS1) 10Herron: kafka-main2003 disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519026 [14:19:53] (03CR) 10Urbanecm: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518391 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [14:20:15] (03CR) 10Urbanecm: "> Patch Set 1: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518390 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [14:20:23] _joe_: I don't think that drive is shared with me [14:20:26] (03CR) 10Jbond: [C: 03+1] "LGMT" [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723) (owner: 10Faidon Liambotis) [14:20:46] (03CR) 10Herron: [C: 03+2] kafka-main2003 disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519026 (owner: 10Herron) [14:21:00] (03PS2) 10Herron: kafka-main2003 disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519026 [14:21:27] !log rebooting cloudvirt1014, 1018, 1024 [14:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:39] _joe_: can/should I continue with this decom? [14:21:40] (03CR) 10Urbanecm: [C: 03+1] "LGTM, Zoran!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518390 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [14:22:13] (03CR) 10Urbanecm: [C: 03+1] "LGTM, Zoran!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518391 (https://phabricator.wikimedia.org/T226315) (owner: 10Zoranzoki21) [14:23:25] ottomata: yeah, until the new process is officially in place and the tooling/docs are around, simply switch them to role(spare) [14:23:53] ok thanks [14:23:59] (03CR) 10Ottomata: [C: 03+2] Set role spare::system for old analytics Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/519023 (https://phabricator.wikimedia.org/T183303) (owner: 10Ottomata) [14:25:01] but please create a proper decomission ticket which lists all the steps for proper tracking: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_to_Spares_OR_Decommission [14:25:05] I will for sure! [14:25:07] am following that. [14:25:52] (03PS3) 10Herron: kafka-main2003 disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/519026 [14:26:10] !log shutting down Kafka on old analytics brokers - T183303 [14:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:15] T183303: Decomission old analytics kafka cluster - https://phabricator.wikimedia.org/T183303 [14:27:00] wow this is a great result, finally! [14:27:55] yeah long time coming! [14:32:39] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi) Also a spot-check for hosts with current warnings/errors from hpssacli, also results match between o... [14:33:25] (03PS1) 10Ottomata: Remove now unused role::kafka::analytics* [puppet] - 10https://gerrit.wikimedia.org/r/519033 (https://phabricator.wikimedia.org/T183303) [14:34:01] (03CR) 10Ottomata: [C: 03+2] Remove now unused role::kafka::analytics* [puppet] - 10https://gerrit.wikimedia.org/r/519033 (https://phabricator.wikimedia.org/T183303) (owner: 10Ottomata) [14:34:32] (03PS6) 10Filippo Giunchedi: dsa-check-hpssacli: refactor for speed/efficiency [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723) (owner: 10Faidon Liambotis) [14:35:17] (03CR) 10Muehlenhoff: "Please also remove the relevant aliases in modules/profile/templates/cumin/aliases.yaml.erb if you drop a role." [puppet] - 10https://gerrit.wikimedia.org/r/519033 (https://phabricator.wikimedia.org/T183303) (owner: 10Ottomata) [14:35:39] (03CR) 10Filippo Giunchedi: [C: 03+2] dsa-check-hpssacli: refactor for speed/efficiency [puppet] - 10https://gerrit.wikimedia.org/r/516725 (https://phabricator.wikimedia.org/T210723) (owner: 10Faidon Liambotis) [14:38:14] (03PS6) 10Filippo Giunchedi: dsa-check-hpssacli: make compatible with ssacli [puppet] - 10https://gerrit.wikimedia.org/r/516726 (https://phabricator.wikimedia.org/T220787) (owner: 10Faidon Liambotis) [14:38:25] moritzm: should I remove the linux-host-entries for these hosts? [14:38:49] oh i see that is part of dc ops steps [14:38:49] nm [14:39:07] yeah, but also doesn't hurt to drop them already if the hosts are old and up for decomission [14:39:13] oh ok will do [14:40:14] (03PS1) 10Ottomata: Remove various kafka analytics cluster related configs [puppet] - 10https://gerrit.wikimedia.org/r/519034 (https://phabricator.wikimedia.org/T183303) [14:41:12] (03PS2) 10Ottomata: Remove various kafka analytics cluster related configs [puppet] - 10https://gerrit.wikimedia.org/r/519034 (https://phabricator.wikimedia.org/T183303) [14:41:42] (03PS3) 10Ottomata: Remove various kafka analytics cluster related configs [puppet] - 10https://gerrit.wikimedia.org/r/519034 (https://phabricator.wikimedia.org/T183303) [14:41:45] (03PS1) 10Bstorm: cloud: add pin for jessie python3 broken package [puppet] - 10https://gerrit.wikimedia.org/r/519035 (https://phabricator.wikimedia.org/T226480) [14:42:17] (03CR) 10jerkins-bot: [V: 04-1] cloud: add pin for jessie python3 broken package [puppet] - 10https://gerrit.wikimedia.org/r/519035 (https://phabricator.wikimedia.org/T226480) (owner: 10Bstorm) [14:42:38] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [14:43:07] !log beginning replacement of kafka2003 with kafka-main2003 T225005 [14:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:13] T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 [14:43:54] (03CR) 10Urbanecm: [C: 04-1] Enable Page Previews as default on hewikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad) [14:43:58] (03CR) 10Ottomata: [C: 03+2] "Also removes burrow monitoring of analytics kafka cluster:" [puppet] - 10https://gerrit.wikimedia.org/r/519034 (https://phabricator.wikimedia.org/T183303) (owner: 10Ottomata) [14:44:06] (03PS4) 10Ottomata: Remove various kafka analytics cluster related configs [puppet] - 10https://gerrit.wikimedia.org/r/519034 (https://phabricator.wikimedia.org/T183303) [14:44:10] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Remove various kafka analytics cluster related configs [puppet] - 10https://gerrit.wikimedia.org/r/519034 (https://phabricator.wikimedia.org/T183303) (owner: 10Ottomata) [14:44:51] (03CR) 10Muehlenhoff: "FWIW; there's a revised version incoming:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/519035 (https://phabricator.wikimedia.org/T226480) (owner: 10Bstorm) [14:45:11] (03Abandoned) 10Bstorm: cloud: add pin for jessie python3 broken package [puppet] - 10https://gerrit.wikimedia.org/r/519035 (https://phabricator.wikimedia.org/T226480) (owner: 10Bstorm) [14:45:29] (03CR) 10Urbanecm: [C: 04-1] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad) [14:45:35] (03CR) 10Jhedden: "The package has been fixed upstream in version 3.4.2-1+deb8u4. I've confirmed that it's working as expected." [puppet] - 10https://gerrit.wikimedia.org/r/519035 (https://phabricator.wikimedia.org/T226480) (owner: 10Bstorm) [14:46:18] 10Operations, 10SRE-Access-Requests: Request access to analytics cluster for Alaa Sarhan - https://phabricator.wikimedia.org/T223697 (10Nuria) >I will probably still need to have hadoop access at some point, as some inquiries might not be fulfilled by turnilo with the sampling and 1-week lifetime of data limit... [14:49:32] (03PS6) 10Urbanecm: Set default aliases for Project_talk namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493385 (https://phabricator.wikimedia.org/T173070) (owner: 10Ammarpad) [14:49:45] 10Operations, 10Analytics, 10decommission: Reclaim/Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Ottomata) [14:49:55] 10Operations, 10Analytics, 10decommission: Reclaim/Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10Ottomata) [14:51:34] the UNKNOWNs for hp raid in icinga is me, known [14:53:25] (03PS1) 10Filippo Giunchedi: raid: allow 'controller all show detail' for nagios [puppet] - 10https://gerrit.wikimedia.org/r/519038 (https://phabricator.wikimedia.org/T210723) [14:53:30] known unknowns 🤔 [14:53:34] PROBLEM - puppet last run on graphite1004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [14:55:47] (03CR) 10Urbanecm: "Should this be deployed?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/430627 (https://phabricator.wikimedia.org/T176553) (owner: 10Huji) [14:57:39] (03PS2) 10Filippo Giunchedi: raid: adjust sudo permissions for new HP RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/519038 (https://phabricator.wikimedia.org/T210723) [14:58:38] (03CR) 10Filippo Giunchedi: [C: 03+2] raid: adjust sudo permissions for new HP RAID checks [puppet] - 10https://gerrit.wikimedia.org/r/519038 (https://phabricator.wikimedia.org/T210723) (owner: 10Filippo Giunchedi) [14:59:07] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 2 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki) p:05High→03Normal @thiemowmde @WMDE-Fisch I have installed php-wikidiff2_1.8.2-1~wmf1_amd64 on deployment-mediawi... [14:59:57] (03PS3) 10Herron: kafka-main: replace kafka2003 hardware with kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) [15:01:22] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 2 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10jijiki) [15:01:43] (03PS4) 10Herron: kafka-main: replace kafka2003 hardware with kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) [15:01:53] 10Operations, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10ayounsi) Current process is to lookup their email contact in PeeringDB and manually send them an email, CCing our peering@ email. [15:03:11] (03CR) 10Herron: [C: 03+2] kafka-main: replace kafka2003 hardware with kafka-main2003 [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [15:05:00] mhhh the hp raid checks that went unknown but were critical before might trigger the raid handler to open a task, I'll take a look and cleanup as needed [15:06:22] PROBLEM - Kafka Broker Under Replicated Partitions on kafka2002 is CRITICAL: 263 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka2002 [15:06:52] PROBLEM - Kafka Broker Under Replicated Partitions on kafka2001 is CRITICAL: 346 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka2001 [15:07:33] ^ that’s from me depooling kafka2003 [15:07:57] (03PS1) 10Alexandros Kosiaris: kubernetes: Incoming BGP should work over IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/519041 [15:08:59] (03CR) 10Mforns: analytics::refinery::job::data_purge add deletion for data_quality_hourly (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [15:09:18] (03PS1) 10Ottomata: Produce page-properties-change stream to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519043 (https://phabricator.wikimedia.org/T211248) [15:09:23] (03CR) 10Fsero: [C: 03+2] kubernetes: Incoming BGP should work over IPv6 as well [puppet] - 10https://gerrit.wikimedia.org/r/519041 (owner: 10Alexandros Kosiaris) [15:11:41] (03CR) 10Ottomata: [C: 03+2] Produce page-properties-change stream to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519043 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [15:11:59] (03CR) 10jenkins-bot: Produce page-properties-change stream to eventgate-main [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519043 (https://phabricator.wikimedia.org/T211248) (owner: 10Ottomata) [15:12:14] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [15:12:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:12:42] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:12:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:13:13] 10Operations, 10ops-eqiad: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T226519 (10ops-monitoring-bot) [15:13:32] (03CR) 10Mforns: [C: 04-1] "Wait elukey, I forgot the --skip-trash param :[ adding." [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [15:13:37] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Produce page-properties-change stream to eventgate-main - T211248 (duration: 00m 58s) [15:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:43] T211248: Modern Event Platform: Stream Intake Service: Migrate Mediawiki Eventbus events to eventgate-main - https://phabricator.wikimedia.org/T211248 [15:14:54] ACKNOWLEDGEMENT - HP RAID on ms-be2018 is CRITICAL: CRITICAL: Slot 3: Failed: 1I:1:1 - OK: 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226520 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:14:57] 10Operations, 10ops-codfw: Degraded RAID on ms-be2018 - https://phabricator.wikimedia.org/T226520 (10ops-monitoring-bot) [15:15:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [15:16:14] Krinkle: You deploying now, or can I sling out https://gerrit.wikimedia.org/r/c/mediawiki/skins/MonoBook/+/519045 ? [15:16:42] (03PS1) 10Gehel: wdqs: publish full MDC in file based logs. [puppet] - 10https://gerrit.wikimedia.org/r/519046 [15:19:10] 10Operations, 10Traffic, 10Core Platform Team Backlog (Designing), 10MW-1.34-notes (1.34.0-wmf.6; 2019-05-21), and 6 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) [15:19:45] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [15:20:27] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [15:22:46] (03PS1) 10Matthias Mullie: [SDC] Enable other statements on testcommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519047 [15:23:16] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) New proposal after talking to @faidon. The one above focuses on dropping invalids. The one bellow adds visibility on Invalid (until we drop them) as well as unknown and valids. First push the following to cr4-ulsfo... [15:24:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [15:25:15] (03PS3) 10CRusnov: Add a passthrough configuration system [software/netbox] - 10https://gerrit.wikimedia.org/r/518785 (https://phabricator.wikimedia.org/T209182) [15:25:31] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10Volans) @fgiunchedi the sort should be fixed otherwise will keep alerting on icinga because of the changing text... [15:25:48] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Add a passthrough configuration system (031 comment) [software/netbox] - 10https://gerrit.wikimedia.org/r/518785 (https://phabricator.wikimedia.org/T209182) (owner: 10CRusnov) [15:27:46] (03PS1) 10Jbond: audit_hiera: This is a small script to audit the private repo [labs/private] - 10https://gerrit.wikimedia.org/r/519050 [15:27:49] (03PS1) 10Jbond: missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 [15:28:04] 10Operations, 10ops-eqiad: Degraded RAID on db1077 - https://phabricator.wikimedia.org/T226519 (10Marostegui) 05Open→03Declined This is not the RAID, this is the BBU which is broken - T225391#5261662 but the host is out of warranty [15:28:29] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.10/skins/MonoBook/includes/SkinMonoBook.php: T226503 Fix Notifications RL module dependency (duration: 00m 57s) [15:28:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:35] T226503: Notification icons gone on meta wiki when using Monobook: "Error: Unknown module: ext.echo.badgeicons" - https://phabricator.wikimedia.org/T226503 [15:29:56] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10fgiunchedi) >>! In T210723#5282812, @Volans wrote: > @fgiunchedi the sort should be fixed otherwise will keep al... [15:32:17] (03PS1) 10CRusnov: netbox: Add icinga/cron for cable report [puppet] - 10https://gerrit.wikimedia.org/r/519054 [15:34:53] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10JobSnijders) > To be researched: I'm not sure yet if bringing the validator closer to the routers (eg. in POPs) brings significant improvements. If so and once T96852 is solved, we could bring them closer to the POP routers.... [15:35:01] (03PS4) 10Mforns: analytics::refinery::job::data_purge add deletion for data_quality_hourly [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) [15:38:48] 10Operations, 10ops-eqiad, 10Analytics, 10DC-Ops, 10decommission: Reclaim/Decommission old Kafka analytics brokers: kafka1012,kafka1013,kafka1014,kafka1020,kafka1022,kafka1023 - https://phabricator.wikimedia.org/T226517 (10jbond) p:05Triage→03Normal [15:39:30] (03CR) 10Mforns: "OK, added the --skip-trash and corrected an imprecision in the regex. Tested again and looks good!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/518069 (https://phabricator.wikimedia.org/T215863) (owner: 10Mforns) [15:39:48] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) [15:43:57] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10JobSnijders) >>! In T220669#5282795, @ayounsi wrote: > Once we're ready to drop invalids everywhere, move the above term to the `BGP_community_actions` policy. I've reviewed the proposed configuration and this looks good t... [15:47:01] James_F: was afk, no problem [15:47:14] (03PS1) 10Ema: varnishfetcherror: log BogoHeader [puppet] - 10https://gerrit.wikimedia.org/r/519056 (https://phabricator.wikimedia.org/T226375) [15:47:58] (03PS2) 10Jbond: missing data: audited using audit_hiera.py script [labs/private] - 10https://gerrit.wikimedia.org/r/519051 [15:48:51] (03CR) 10CRusnov: "Compiler output looks good." [puppet] - 10https://gerrit.wikimedia.org/r/519054 (owner: 10CRusnov) [15:51:12] tgr: https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/519040/ is on mwdebug1002 now [15:51:47] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Address recurrent service check time out for "HP RAID" on swift backend hosts - https://phabricator.wikimedia.org/T210723 (10Volans) Great! Thanks a lot. [15:53:15] (03PS1) 10Elukey: Set oozie as proxy for the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/519057 (https://phabricator.wikimedia.org/T212259) [15:53:18] doing a rename now to test it [15:53:49] (03CR) 10Volans: "Is there a task to discuss it? I have some general concerns, namely:" [labs/private] - 10https://gerrit.wikimedia.org/r/519050 (owner: 10Jbond) [15:53:56] (03CR) 10Elukey: [C: 03+2] Set oozie as proxy for the Hadoop testing cluster [puppet] - 10https://gerrit.wikimedia.org/r/519057 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [15:54:43] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:55:15] I see https://test.wikipedia.org/wiki/File:Riot-irc-join-2.png [15:55:15] in the logs [15:55:17] DBPErformance warning [15:55:20] Krinkle: I can't reproduce the bug on testwiki, with or without X-Wikimedia-Debug [15:55:21] but presumably pre-existing [15:56:14] aye, SpecialPageTranslationMovePage is getting in the middle there with a master-query permission check on GET [15:56:16] oh well [15:56:21] Aright, going ahead then [15:56:53] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.10/includes/: T226448 / 40e725b6502cd6 (duration: 01m 20s) [15:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:02] T226448: Fatal logged after renaming files: "LocalFile.php: Call to a member function purgeEverything() on boolean" - https://phabricator.wikimedia.org/T226448 [15:57:13] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/519054 (owner: 10CRusnov) [15:57:36] (03CR) 10CRusnov: [C: 03+2] netbox: Add icinga/cron for cable report [puppet] - 10https://gerrit.wikimedia.org/r/519054 (owner: 10CRusnov) [15:59:07] PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:59:12] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.10/maintenance/: T226448 / 40e725b6502cd6 (duration: 01m 15s) [15:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:33] PROBLEM - Apache HTTP on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [15:59:47] PROBLEM - HHVM rendering on mw1290 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1309 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:00:05] godog and _joe_: Time to snap out of that daydream and deploy Puppet SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190625T1600). [16:00:05] tgr: A patch you scheduled for Puppet SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:17] o/ [16:00:51] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 1.993 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:01:03] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 76080 bytes in 0.272 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:01:14] (03PS2) 10CRusnov: netbox: Add icinga/cron for cable report [puppet] - 10https://gerrit.wikimedia.org/r/519054 [16:01:35] RECOVERY - Nginx local proxy to apache on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:01:47] Looks like we've survived another scary heartattack where a server goes into criticial just normally after a deployment. [16:01:50] * Krinkle resumes breath [16:04:30] 10Operations, 10Operations-Software-Development, 10netbox, 10Patch-For-Review: Netbox: cable termination names report - https://phabricator.wikimedia.org/T216469 (10crusnov) 05Open→03Resolved [16:04:33] PROBLEM - Nginx local proxy to apache on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:04:53] PROBLEM - HHVM rendering on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:05:23] PROBLEM - Apache HTTP on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:06:14] <_joe_> tgr: gimme 5 minutes, sorry [16:06:35] RECOVERY - Apache HTTP on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 1.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:07:00] no rush [16:07:05] RECOVERY - Nginx local proxy to apache on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:07:06] the patch is trivial [16:07:25] RECOVERY - HHVM rendering on mw1248 is OK: HTTP OK: HTTP/1.1 200 OK - 76080 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:07:50] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add permissive CORS headers for wikimedia.org/.well-known/matrix [puppet] - 10https://gerrit.wikimedia.org/r/518209 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza) [16:08:25] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:08:41] (03PS2) 10Jbond: audit_hiera: This is a small script to audit the private repo [labs/private] - 10https://gerrit.wikimedia.org/r/519050 (https://phabricator.wikimedia.org/T226530) [16:09:16] <_joe_> tgr: yeah but I want to do my due diligence in terms of how to deploy it [16:09:19] (03CR) 10Jbond: "> Patch Set 1:" [labs/private] - 10https://gerrit.wikimedia.org/r/519050 (https://phabricator.wikimedia.org/T226530) (owner: 10Jbond) [16:10:19] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) a:05RobH→03akosiaris Alex is working on a new partman for these hosts, so I'm assigning this task to him as its blocked on that for the OS installation. The hosts are remotely accessible, and... [16:10:57] (03CR) 10Ottomata: [C: 03+1] Replace profile::analytics::systemd_timer with kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:14:11] PROBLEM - puppet last run on netmon1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/netbox-reports] [16:14:22] heckin [16:14:25] looking at that [16:15:20] (03CR) 10Ottomata: [C: 03+1] camus: add support for kerberos [puppet] - 10https://gerrit.wikimedia.org/r/518958 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [16:16:24] 10Operations, 10WMDE-QWERTY-Team, 10serviceops, 10wikidiff2, and 2 others: Deploy Wikidiff2 version 1.8.2 with the timeout issue fixed - https://phabricator.wikimedia.org/T223391 (10awight) Smoke testing the diffs shows that nothing seems to have been broken by the upgrade. We haven't been able to verify... [16:16:35] 10Operations, 10ops-ulsfo: rack/setup/install ganeti400[123] - https://phabricator.wikimedia.org/T226444 (10RobH) a:05akosiaris→03RobH Actually, a bit of confusion we clarified with IRC chat. These are NOT the same config of partman he is working on, but I cannot install these until we re-enable puppet fo... [16:16:42] (03PS2) 10Giuseppe Lavagetto: Add permissive CORS headers for wikimedia.org/.well-known/matrix [puppet] - 10https://gerrit.wikimedia.org/r/518209 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza) [16:17:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add permissive CORS headers for wikimedia.org/.well-known/matrix [puppet] - 10https://gerrit.wikimedia.org/r/518209 (https://phabricator.wikimedia.org/T223835) (owner: 10Gergő Tisza) [16:20:14] <_joe_> tgr: I'm running puppet on mwdebug1002 [16:20:36] <_joe_> so you can activate x-w-d and try to see if it does what you need [16:21:07] <_joe_> my sanity tests look good [16:23:31] _joe_: I don't see any access-control headers [16:23:50] <_joe_> oh, what url are you testing? [16:24:15] RECOVERY - puppet last run on netmon1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:24:15] curl -sv -H 'X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet' https://wikimedia.org/.well-known/matrix/server [16:24:21] curl -sv -H 'X-Wikimedia-Debug: backend=mwdebug1002.eqiad.wmnet' https://www.wikimedia.org/.well-known/matrix/server [16:24:33] RECOVERY - Kafka Broker Under Replicated Partitions on kafka2001 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka2001 [16:24:41] (03PS1) 10RobH: decom bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/519061 (https://phabricator.wikimedia.org/T178592) [16:25:33] !log upgrade and restart db1114 (test-s1) [16:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:38] incidentally, a query to https://www.wikimedia.org/.well-known/matrix/ results in a 301 redirect to the HTTP version of the same URL [16:25:48] doesn't affect me but it's weird [16:26:00] (03CR) 10RobH: [C: 03+2] decom bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/519061 (https://phabricator.wikimedia.org/T178592) (owner: 10RobH) [16:26:18] (03PS2) 10RobH: decom bast4001 [puppet] - 10https://gerrit.wikimedia.org/r/519061 (https://phabricator.wikimedia.org/T178592) [16:31:49] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:33:56] 10Operations, 10ops-ulsfo, 10decommission: decommission/replace bast4001.wikimedia.org - https://phabricator.wikimedia.org/T178592 (10RobH) [16:36:50] 10Operations, 10Analytics, 10Analytics-Kanban, 10Discovery, and 2 others: Make hadoop cluster able to push to swift - https://phabricator.wikimedia.org/T219544 (10Nuria) Closing ticket as workflow is deployed and available for people (cc @EBernhardson @bmansurov) to try. There is still another piece about... [16:37:08] (03PS2) 10RobH: setting ganeti400[123] production dns [dns] - 10https://gerrit.wikimedia.org/r/518834 (https://phabricator.wikimedia.org/T226444) [16:37:10] (03PS1) 10RobH: removing bast4001 dns [dns] - 10https://gerrit.wikimedia.org/r/519062 (https://phabricator.wikimedia.org/T178592) [16:37:19] (03PS2) 10RobH: removing bast4001 dns [dns] - 10https://gerrit.wikimedia.org/r/519062 (https://phabricator.wikimedia.org/T178592) [16:37:57] (03CR) 10RobH: [C: 03+2] setting ganeti400[123] production dns [dns] - 10https://gerrit.wikimedia.org/r/518834 (https://phabricator.wikimedia.org/T226444) (owner: 10RobH) [16:38:21] (03PS3) 10RobH: removing bast4001 dns [dns] - 10https://gerrit.wikimedia.org/r/519062 (https://phabricator.wikimedia.org/T178592) [16:40:43] RECOVERY - Kafka Broker Under Replicated Partitions on kafka2002 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka2002 [16:41:08] jouncebot: now [16:41:08] For the next 0 hour(s) and 18 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190625T1600) [16:41:11] jouncebot: next [16:41:12] In 0 hour(s) and 18 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190625T1700) [16:42:46] (03PS2) 10Reedy: Remove $wgFlaggedRevsLowProfile already moved to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518853 [16:42:50] (03CR) 10Reedy: [C: 03+2] Remove $wgFlaggedRevsLowProfile already moved to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518853 (owner: 10Reedy) [16:43:48] (03Merged) 10jenkins-bot: Remove $wgFlaggedRevsLowProfile already moved to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518853 (owner: 10Reedy) [16:45:35] !log reedy@deploy1001 Synchronized wmf-config/flaggedrevs.php: Remove some dupe config (duration: 00m 55s) [16:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:22] (03CR) 10jenkins-bot: Remove $wgFlaggedRevsLowProfile already moved to IS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518853 (owner: 10Reedy) [16:47:20] (03PS1) 10Jcrespo: mariadb: Upgrade db1114 (testing buster host) to MariaDB 10.3 [puppet] - 10https://gerrit.wikimedia.org/r/519066 (https://phabricator.wikimedia.org/T193224) [16:49:42] (03CR) 10Jcrespo: "I think this will break deb monitor because the package is not on repo." [puppet] - 10https://gerrit.wikimedia.org/r/519066 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [16:51:06] (03CR) 10Jcrespo: [C: 03+2] mariadb: Upgrade db1114 (testing buster host) to MariaDB 10.3 [puppet] - 10https://gerrit.wikimedia.org/r/519066 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [16:51:24] (03PS2) 10Jcrespo: mariadb: Upgrade db1114 (testing buster host) to MariaDB 10.3 [puppet] - 10https://gerrit.wikimedia.org/r/519066 (https://phabricator.wikimedia.org/T193224) [16:54:19] (03CR) 10RobH: [C: 03+2] removing bast4001 dns [dns] - 10https://gerrit.wikimedia.org/r/519062 (https://phabricator.wikimedia.org/T178592) (owner: 10RobH) [16:56:55] 10Operations, 10DBA, 10MediaWiki-Database, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10jcrespo) [[ https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20promet... [16:58:22] (03PS1) 10Cwhite: add debian folder [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519068 [16:58:46] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327 (10RobH) [16:58:49] 10Operations, 10ops-ulsfo, 10decommission: decommission/replace bast4001.wikimedia.org - https://phabricator.wikimedia.org/T178592 (10RobH) 05Open→03Resolved [17:00:04] cscott, arlolra, subbu, and halfak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Graphoid / Parsoid / Citoid / ORES . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190625T1700). [17:00:08] (03PS1) 10Andrew Bogott: root keys: remove keys for some long-dormant users [labs/private] - 10https://gerrit.wikimedia.org/r/519069 [17:00:24] 10Operations, 10ops-ulsfo, 10Traffic: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327 (10RobH) [17:00:28] 10Operations, 10ops-ulsfo, 10Traffic, 10decommission: decom cp40(09|1[078]) - https://phabricator.wikimedia.org/T178815 (10RobH) 05Stalled→03Resolved [17:00:49] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] root keys: remove keys for some long-dormant users [labs/private] - 10https://gerrit.wikimedia.org/r/519069 (owner: 10Andrew Bogott) [17:01:26] !log finished migration of kafka2003 to kafka-main2003 — enabling alert notifications for kafka-main2003, and leaving kafka2003 disabled T225005 [17:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:32] T225005: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 [17:01:51] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 2 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10herron) [17:02:08] (03PS1) 10Jcrespo: mariadb-package: Upgrade to the latest 10.1/10.3 options [software] - 10https://gerrit.wikimedia.org/r/519071 [17:02:10] (03PS1) 10Herron: Revert "kafka-main2003 disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519070 [17:03:02] 10Operations, 10ops-ulsfo, 10decommission: decommission backup4001 - https://phabricator.wikimedia.org/T161904 (10RobH) [17:03:03] (03PS2) 10Herron: Revert "kafka-main2003 disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519070 [17:04:08] (03CR) 10Herron: [C: 03+2] Revert "kafka-main2003 disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/519070 (owner: 10Herron) [17:05:56] (03PS1) 10RobH: decom frbackup4001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/519072 (https://phabricator.wikimedia.org/T161904) [17:06:36] (03CR) 10RobH: [C: 03+2] decom frbackup4001 dns entries [dns] - 10https://gerrit.wikimedia.org/r/519072 (https://phabricator.wikimedia.org/T161904) (owner: 10RobH) [17:06:40] (03PS1) 10Jcrespo: mariadb: Prepare core for buster [puppet] - 10https://gerrit.wikimedia.org/r/519073 (https://phabricator.wikimedia.org/T193224) [17:08:17] 10Operations, 10DBA, 10MediaWiki-Database, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10Marostegui) >>! In T193224#5283230, @jcrespo wrote: > [[ https://grafana.wikimedia.org/d/... [17:08:33] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@ca96238]: undo: modify agents for T226471 [17:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:38] T226471: WDQS bans its own monitoring due to bad user agent - https://phabricator.wikimedia.org/T226471 [17:10:39] (03PS2) 10Zoranzoki21: Change name of Serbian Wikinews in InitialiseSettings.php (part 2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518391 (https://phabricator.wikimedia.org/T226315) [17:10:46] 10Operations, 10ops-ulsfo, 10decommission: decommission backup4001 - https://phabricator.wikimedia.org/T161904 (10RobH) 05Open→03Resolved [17:10:52] (03PS2) 10Zoranzoki21: Change name of Serbian Wikinews (part 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/518390 (https://phabricator.wikimedia.org/T226315) [17:10:59] 10Operations, 10DBA, 10MediaWiki-Database, 10Patch-For-Review: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished - https://phabricator.wikimedia.org/T193224 (10jcrespo) > Nice! Is it replicating from the master? It is, although tendril doesn't seem... [17:13:56] (03PS1) 10Alexandros Kosiaris: ganeti: Setup buster and a software RAID5 recipe [puppet] - 10https://gerrit.wikimedia.org/r/519075 (https://phabricator.wikimedia.org/T224603) [17:17:25] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10akosiaris) So, the controllers on those boxes can't do hardware RAID and hence the drivers sees them as AHCI. That's fine, we already have multiple boxes with software... [17:24:47] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@ca96238]: undo: modify agents for T226471 (duration: 16m 14s) [17:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:52] T226471: WDQS bans its own monitoring due to bad user agent - https://phabricator.wikimedia.org/T226471 [17:25:06] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [17:26:28] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) [17:29:42] (03PS2) 10Arturo Borrero Gonzalez: toolforge: k8s: introduce basic k8s node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/519015 (https://phabricator.wikimedia.org/T215531) [17:29:50] 10Operations, 10Release Pipeline, 10serviceops, 10Core Platform Team (RESTBase Split (CDP2)), and 5 others: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 (10mobrovac) [17:31:04] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: introduce basic k8s node puppet code [puppet] - 10https://gerrit.wikimedia.org/r/519015 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [17:34:12] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install (3) new osd ceph nodes - https://phabricator.wikimedia.org/T224188 (10Bstorm) The above question is aimed at @ayounsi and @faidon. [17:34:29] 10Operations, 10Dumps-Generation, 10procurement: consider getting a third dumpsdata server - https://phabricator.wikimedia.org/T219768 (10RobH) [17:34:42] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [17:36:00] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [17:36:47] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Epic: Epic: Deprecation warning on elasticsearch 6 - https://phabricator.wikimedia.org/T218994 (10dcausse) [17:45:17] (03PS1) 10Catrope: Enable GrowthExperiments homepage on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519080 (https://phabricator.wikimedia.org/T218237) [17:45:19] (03PS1) 10Catrope: Enable GrowthExperiments homepage for 50% of new users on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519081 [17:48:50] (03PS1) 10Catrope: Enable mobile homepage on cswiki, kowiki, viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519082 (https://phabricator.wikimedia.org/T215983) [17:49:47] !log jiji@deploy1001 Started deploy [cpjobqueue/deploy@eb8f692]: Migrating RecentChangesUpdate to PHP7 - T219148 [17:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:53] T219148: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 [17:51:24] !log jiji@deploy1001 Finished deploy [cpjobqueue/deploy@eb8f692]: Migrating RecentChangesUpdate to PHP7 - T219148 (duration: 01m 37s) [17:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:28] 10Operations, 10DC-Ops, 10Traffic: poll power data for redeployment of esams/knams - https://phabricator.wikimedia.org/T225720 (10RobH) [17:57:42] (03PS1) 10Herron: kafka2003 move to role::spare::system [puppet] - 10https://gerrit.wikimedia.org/r/519084 (https://phabricator.wikimedia.org/T225005) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190625T1800) [18:02:40] 10Operations, 10DC-Ops, 10Traffic: poll power data for redeployment of esams/knams - https://phabricator.wikimedia.org/T225720 (10RobH) Ok, using the data I've summarized from the above outputs, we have the following rough power draws at peak hours: === power figures for 2019-07-20 === @robh pulled the dat... [18:08:22] (03PS1) 10Arturo Borrero Gonzalez: toolforge: k8s: worker: disable swap [puppet] - 10https://gerrit.wikimedia.org/r/519085 (https://phabricator.wikimedia.org/T215531) [18:09:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: k8s: worker: disable swap [puppet] - 10https://gerrit.wikimedia.org/r/519085 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [18:22:26] (03PS2) 10Elukey: Replace profile::analytics::systemd_timer with kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) [18:27:44] (03PS1) 10Bstorm: toolforge: remove junk from k8s::apilb [puppet] - 10https://gerrit.wikimedia.org/r/519086 (https://phabricator.wikimedia.org/T215531) [18:28:18] (03PS3) 10Elukey: Replace profile::analytics::systemd_timer with kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) [18:29:07] (03CR) 10jerkins-bot: [V: 04-1] Replace profile::analytics::systemd_timer with kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) (owner: 10Elukey) [18:29:48] ufff [18:29:58] (03CR) 10Papaul: [C: 03+2] ganeti: Setup buster and a software RAID5 recipe [puppet] - 10https://gerrit.wikimedia.org/r/519075 (https://phabricator.wikimedia.org/T224603) (owner: 10Alexandros Kosiaris) [18:32:27] (03PS4) 10Elukey: Replace profile::analytics::systemd_timer with kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) [18:34:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toolforge: remove junk from k8s::apilb [puppet] - 10https://gerrit.wikimedia.org/r/519086 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [18:34:51] (03CR) 10Muehlenhoff: "The patch mentions buster, but doesn't enable Buster as the OS? If we want to mix stretch and buster we'll need to deploy T210289 first." [puppet] - 10https://gerrit.wikimedia.org/r/519075 (https://phabricator.wikimedia.org/T224603) (owner: 10Alexandros Kosiaris) [18:36:12] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [18:37:30] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 76084 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:38:27] (03PS5) 10Elukey: Replace profile::analytics::systemd_timer with kerberos::systemd_timer [puppet] - 10https://gerrit.wikimedia.org/r/518954 (https://phabricator.wikimedia.org/T212259) [18:42:41] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Addshore) Will this change also get rolled out to 3rd parties using the updater? / Is it in a certain release? [18:43:00] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [18:44:26] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 76086 bytes in 9.029 second response time https://wikitech.wikimedia.org/wiki/Application_servers [18:45:04] !log cutting the branch f or 1.34.0-wmf.11 T220736 [18:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:10] T220736: 1.34.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T220736 [18:55:36] (03CR) 10Bstorm: [C: 03+2] toolforge: remove junk from k8s::apilb [puppet] - 10https://gerrit.wikimedia.org/r/519086 (https://phabricator.wikimedia.org/T215531) (owner: 10Bstorm) [19:00:04] longma: How many deployers does it take to do MediaWiki train - American version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190625T1900). [19:02:30] longma: Good luck! [19:02:52] thanks :D [19:04:16] James_F: got a minute? can I PM? [19:04:26] hauskatze: Sure, any time. [19:04:35] James_F: thanks [19:07:39] !log looks like we are unblocked for wmf.10, deploying that first [19:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:04] !log deploying MediaWiki 1.34.0-wmf.10 to all wikis [19:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:29] \o/ [19:12:14] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) No release yet, but if you check out Updater or WDQS build, you get the same behavior. [19:12:16] (03PS1) 1020after4: all wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519092 [19:12:18] (03CR) 1020after4: [C: 03+2] all wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519092 (owner: 1020after4) [19:13:17] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519092 (owner: 1020after4) [19:13:33] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.10 refs T220735 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519092 (owner: 1020after4) [19:16:17] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.10 refs T220735 [19:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:22] T220735: 1.34.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T220735 [19:17:04] PROBLEM - PHP opcache health on mwdebug1001 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:17:44] RECOVERY - PHP opcache health on mwdebug2001 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [19:18:43] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/ codfw: ganeti2009 - ganeti201[0-8] - https://phabricator.wikimedia.org/T224603 (10Papaul) @akosiaris are we going with Buster see @MoritzMuehlenhoff comments [19:27:48] twentyafterfour: T220735 resolved, then? [19:27:48] T220735: 1.34.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T220735 [19:28:20] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:28:27] James_F: I believe so, it looks good to me [19:32:44] PROBLEM - Check systemd state on ms-be1028 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:37:06] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:38:37] 10Operations, 10serviceops, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Use PHP7 to run all async jobs - https://phabricator.wikimedia.org/T219148 (10jijiki) [19:39:56] (03PS2) 10Ammarpad: Enable Page Previews as default on hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) [19:40:09] (03PS3) 10Ammarpad: Enable Page Previews as default on hewikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) [19:40:43] (03CR) 10Ammarpad: Enable Page Previews as default on hewikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508088 (https://phabricator.wikimedia.org/T222017) (owner: 10Ammarpad) [19:41:06] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.34.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519094 [19:41:08] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.34.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519094 (owner: 10Jeena Huneidi) [19:42:01] (03Merged) 10jenkins-bot: testwikis wikis to 1.34.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519094 (owner: 10Jeena Huneidi) [19:42:16] (03CR) 10jenkins-bot: testwikis wikis to 1.34.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519094 (owner: 10Jeena Huneidi) [19:43:01] !log jhuneidi@deploy1001 Started scap: testwikis wikis to 1.34.0-wmf.11 [19:43:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:16] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational [19:50:52] PROBLEM - MegaRAID on db1072 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:50:53] ACKNOWLEDGEMENT - MegaRAID on db1072 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T226569 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [19:50:56] 10Operations, 10ops-eqiad: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T226569 (10ops-monitoring-bot) [19:52:38] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T226569 (10Marostegui) a:03Cmjohnson Can we get this disk replaced - this is m3 master. Thanks! [20:00:33] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [20:00:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:54] !log rebooting torrelay1001 for kernel security update [20:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:49] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [20:09:51] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:14] !log rebooting webperf hosts for kernel security update [20:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:30] RECOVERY - Check systemd state on ms-be1028 is OK: OK - running: The system is fully operational [20:12:41] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10Gilles) It does, when your IP gets hashed to a specific Varnish frontend, you get [20:18:22] PROBLEM - High CPU load on API appserver on mw1223 is CRITICAL: CRITICAL - load average: 61.94, 26.61, 15.60 [20:18:24] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 65.01, 30.95, 19.84 [20:19:12] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 70.02, 29.39, 17.82 [20:19:41] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10Gilles) Remember that x-cache headers are read from right to left. Trying this out right now with a clear cache and a clear local st... [20:20:02] PROBLEM - High CPU load on API appserver on mw1222 is CRITICAL: CRITICAL - load average: 61.97, 28.55, 16.71 [20:20:04] PROBLEM - High CPU load on API appserver on mw1276 is CRITICAL: CRITICAL - load average: 68.81, 34.58, 21.46 [20:20:38] PROBLEM - High CPU load on API appserver on mw1284 is CRITICAL: CRITICAL - load average: 60.07, 28.29, 18.62 [20:20:38] PROBLEM - High CPU load on API appserver on mw1290 is CRITICAL: CRITICAL - load average: 67.75, 32.51, 20.11 [20:20:52] PROBLEM - High CPU load on API appserver on mw1221 is CRITICAL: CRITICAL - load average: 50.21, 22.10, 13.96 [20:20:54] PROBLEM - High CPU load on API appserver on mw1281 is CRITICAL: CRITICAL - load average: 62.44, 27.59, 18.47 [20:20:54] PROBLEM - High CPU load on API appserver on mw1283 is CRITICAL: CRITICAL - load average: 67.54, 30.29, 19.78 [20:21:12] PROBLEM - High CPU load on API appserver on mw1278 is CRITICAL: CRITICAL - load average: 63.49, 28.92, 19.81 [20:21:16] RECOVERY - High CPU load on API appserver on mw1223 is OK: OK - load average: 17.13, 23.82, 16.61 [20:21:16] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 19.49, 27.66, 20.77 [20:21:25] 10Operations, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10Gilles) So I should do that for that list? Are you ok with me requesting peering from all of these AS? Is there an existing email template? [20:21:32] RECOVERY - High CPU load on API appserver on mw1276 is OK: OK - load average: 26.76, 29.63, 20.88 [20:22:06] RECOVERY - High CPU load on API appserver on mw1284 is OK: OK - load average: 37.30, 31.50, 20.83 [20:22:06] RECOVERY - High CPU load on API appserver on mw1290 is OK: OK - load average: 24.78, 27.46, 19.47 [20:22:20] RECOVERY - High CPU load on API appserver on mw1221 is OK: OK - load average: 19.32, 19.66, 13.89 [20:22:22] RECOVERY - High CPU load on API appserver on mw1283 is OK: OK - load average: 30.32, 28.19, 20.01 [20:22:22] RECOVERY - High CPU load on API appserver on mw1281 is OK: OK - load average: 26.24, 25.23, 18.52 [20:22:56] RECOVERY - High CPU load on API appserver on mw1222 is OK: OK - load average: 14.09, 22.49, 16.56 [20:23:34] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 11.00, 23.27, 18.79 [20:23:39] 10Operations, 10DC-Ops, 10Traffic: poll power data for redeployment of esams/knams - https://phabricator.wikimedia.org/T225720 (10RobH) Summarizing IRC discussion between @bblack and @robh: The R440 CP systems will pull 250W each in our estimates (pulled from live data at peak in eqiad) The R440 lvs/misc/ga... [20:23:42] ^^^ [20:23:55] somehow we just had a huge spike of cpu on some api servers :-\ [20:24:08] RECOVERY - High CPU load on API appserver on mw1278 is OK: OK - load average: 21.49, 29.82, 22.21 [20:24:17] longma: ^ [20:24:36] !log jhuneidi@deploy1001 Finished scap: testwikis wikis to 1.34.0-wmf.11 (duration: 41m 35s) [20:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:43] no idea whether that is realted to the train [20:24:43] hmm [20:24:45] maybe it was [20:24:52] hashar I thought it was okay since it recovered [20:25:01] or it is unrelated :) [20:25:46] FWIW, I've seen that happen on new deployment, hhvm flailing at the cache for a bit and then recovering [20:25:48] but it did happen during the scap sync [20:25:58] where "new deployment" == train [20:26:09] so similar to the web request timed out after 60 seconds, maybe [20:26:25] I think it's the other side of the 60 second timeout [20:26:30] yeah [20:26:45] should that happen for group 0? [20:26:54] speaking of them, I dont see the time out anymore in logstash [20:27:25] longma: it shouldn't but it does has been my experience. [20:27:40] ah ok [20:27:41] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [20:27:42] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:16] !log rebooting vega for kernel security update [20:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:30] 10Operations, 10Commons, 10Wikimedia-Site-requests, 10media-storage, 10User-Urbanecm: Server-side upload request for Hurtigruten minutt for minutt videos - https://phabricator.wikimedia.org/T223052 (10Urbanecm) [20:31:14] blarg. Looks like gerrit is stuck behind a sendemail thread again :\ [20:32:15] oh ok so it's not just me [20:32:30] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [20:32:34] I'm going to have to restart it. [20:32:48] Yeh, we are seeing piling connections. [20:33:00] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [20:33:11] !log restarting gerrit due to T224448 [20:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:16] T224448: Gerrit http threads stuck behind sendemail thread - https://phabricator.wikimedia.org/T224448 [20:35:44] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 868 bytes in 0.173 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [20:35:55] !log gerrit back [20:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:44] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 25892 bytes in 0.503 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [20:37:38] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [20:37:38] PROBLEM - puppet last run on db2095 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:37:40] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [20:37:44] PROBLEM - puppet last run on releases2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI Composer] [20:38:06] PROBLEM - puppet last run on gerrit2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_All-Avatars] [20:38:08] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [20:38:24] PROBLEM - puppet last run on labsdb1012 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:38:42] PROBLEM - puppet last run on db1124 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:39:06] PROBLEM - puppet last run on labsdb1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:39:48] I am not sure if I should be concerned about the above but I think it's not related to the train? [20:40:00] longma: that's related to the gerrit restart [20:40:18] ah whew, so those will clear up next run [20:40:46] hopefully [20:40:56] > sudo cumin -b 15 -p 95 'R:git::clone' 'run-puppet-agent -q --failed-only' [20:41:06] is evidently the magic cumin, if you've got the superpowers [20:41:08] PROBLEM - puppet last run on labsdb1011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:41:08] PROBLEM - puppet last run on releases1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/tools/release],Exec[git_pull_jenkins CI Composer] [20:41:32] PROBLEM - puppet last run on labsdb1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [20:42:20] just reran manually on kafka2001.codfw.wmnet and it's fine [20:42:46] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/cookbooks] [20:42:50] although I wonder what the 'downgrade to pson' is all about [20:43:11] going to toddle off to bed now... happy deploying [20:43:33] as always [20:43:34] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:43:37] have a good night [20:44:34] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [20:44:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:49] !log rebooting ununpentium for kernel security update [20:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:55] (03PS1) 10CDanis: dbctl: 'instance pool' now uses past percentage, instead of 100 [software/conftool] - 10https://gerrit.wikimedia.org/r/519129 [20:51:37] (03PS3) 10CRusnov: Add new dumpbackup.py script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) [20:52:54] (03PS4) 10CRusnov: Add new dumpbackup.py script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) [20:53:58] (03CR) 10CDanis: "This is a little different than what we discussed, which was just more heavily documenting the old behavior, but upon reflection I think i" [software/conftool] - 10https://gerrit.wikimedia.org/r/519129 (owner: 10CDanis) [20:55:20] (03CR) 10CRusnov: "I believe I have addressed the changes requested." (039 comments) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [20:57:04] (03PS1) 10Herron: kafka-main: replace kafka2002 hardware with kafka-main2002 [puppet] - 10https://gerrit.wikimedia.org/r/519130 (https://phabricator.wikimedia.org/T225005) [20:57:28] (03PS2) 10Herron: kafka-main: replace kafka2002 hardware with kafka-main2002 [puppet] - 10https://gerrit.wikimedia.org/r/519130 (https://phabricator.wikimedia.org/T225005) [20:59:48] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:00:12] RECOVERY - puppet last run on labsdb1012 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [21:00:36] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:00:55] (03PS2) 10Cwhite: branched from tags/v2.0.0 and added debian directory [debs/file-read-backwards] (debian) - 10https://gerrit.wikimedia.org/r/519068 [21:00:57] (03CR) 10Herron: "cc akosiaris for the calico bits" [puppet] - 10https://gerrit.wikimedia.org/r/519130 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [21:02:10] (03CR) 10Herron: "Would like to move forward with this tomorrow (Weds) AM (Eastern)" [puppet] - 10https://gerrit.wikimedia.org/r/519130 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [21:02:37] is gerrit still f'd? [21:02:43] (03PS5) 10CRusnov: Add new dumpbackup.py script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) [21:02:48] RECOVERY - puppet last run on labsdb1011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:02:48] RECOVERY - puppet last run on releases1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [21:03:09] !log contint1001: running puppet to clear a puppet alarm (due to Gerrit restart) [21:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:15] thcipriani yup [21:03:16] RECOVERY - puppet last run on labsdb1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:21] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team Backlog (Watching / External), and 3 others: Replace and expand codfw kafka main hosts (kafka200[123]) with kafka-main200[12345] - https://phabricator.wikimedia.org/T225005 (10herron) [21:03:22] Uh twentyafterfour i mean [21:03:35] apparently threads building up again behind sendmail [21:04:13] (03CR) 10Herron: "PCC LGTM https://puppet-compiler.wmflabs.org/compiler1002/17107/" [puppet] - 10https://gerrit.wikimedia.org/r/519084 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [21:04:52] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [21:04:52] RECOVERY - puppet last run on db2095 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:04:52] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:04:55] (03PS1) 10Thcipriani: Revert "Gerrit v2.15.14" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/519137 [21:04:58] RECOVERY - puppet last run on releases2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:05:20] RECOVERY - puppet last run on gerrit2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [21:05:29] (03CR) 10Paladox: [C: 03+2] Revert "Gerrit v2.15.14" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/519137 (owner: 10Thcipriani) [21:05:58] RECOVERY - puppet last run on db1124 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [21:06:03] (03PS6) 10CRusnov: Add new dumpbackup.py script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) [21:06:08] (03CR) 10Thcipriani: [V: 03+2] Revert "Gerrit v2.15.14" [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/519137 (owner: 10Thcipriani) [21:06:18] RECOVERY - puppet last run on labsdb1009 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:07:36] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@7b379a6]: revert Gerrit to 2.15.13 on gerrit2001 [21:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:47] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@7b379a6]: revert Gerrit to 2.15.13 on gerrit2001 (duration: 00m 10s) [21:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:14] (03PS1) 10MarcoAurelio: Initial commit [debs/coredns] - 10https://gerrit.wikimedia.org/r/519138 [21:09:39] (03CR) 10MarcoAurelio: [V: 03+2 C: 03+2] Initial commit [debs/coredns] - 10https://gerrit.wikimedia.org/r/519138 (owner: 10MarcoAurelio) [21:09:58] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [21:11:28] (03PS1) 10MarcoAurelio: Edit Project Config [debs/coredns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/519139 [21:11:38] (03PS1) 10Thcipriani: Revert "gerrit: Re-enable the use of HTTP auth tokens" [puppet] - 10https://gerrit.wikimedia.org/r/519140 [21:11:53] (03Abandoned) 10MarcoAurelio: Edit Project Config [debs/coredns] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/519139 (owner: 10MarcoAurelio) [21:14:22] (03CR) 10Herron: [C: 03+2] Revert "gerrit: Re-enable the use of HTTP auth tokens" [puppet] - 10https://gerrit.wikimedia.org/r/519140 (owner: 10Thcipriani) [21:19:10] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@7b379a6]: revert Gerrit to 2.15.13 on cobalt (restart incoming) [21:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:21] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@7b379a6]: revert Gerrit to 2.15.13 on cobalt (restart incoming) (duration: 00m 11s) [21:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:15] !log gerrit back on 2.15.13 [21:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:00] PROBLEM - puppet last run on schema1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:26:28] (03PS1) 10Jeena Huneidi: group0 wikis to 1.34.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519148 [21:26:30] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.34.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519148 (owner: 10Jeena Huneidi) [21:26:50] PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [21:27:25] (03Merged) 10jenkins-bot: group0 wikis to 1.34.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519148 (owner: 10Jeena Huneidi) [21:27:41] (03CR) 10jenkins-bot: group0 wikis to 1.34.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/519148 (owner: 10Jeena Huneidi) [21:30:13] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.34.0-wmf.11 [21:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:11] (03PS7) 10CRusnov: Add new dumpbackup.py script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) [21:42:42] (03CR) 10CRusnov: "Okay command-line is fixed after some testing. Everything works as expected now." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/518166 (https://phabricator.wikimedia.org/T223292) (owner: 10CRusnov) [21:52:16] PROBLEM - Check systemd state on notebook1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:53:14] RECOVERY - puppet last run on schema1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:54:02] RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [22:12:41] 10Operations, 10Performance-Team, 10Traffic: Send peering requests to AS with the worst TTFB - https://phabricator.wikimedia.org/T219486 (10faidon) >>! In T219486#5284083, @Gilles wrote: > So I should do that for that list? Are you ok with me requesting peering from all of these AS? > > Is there an existing... [22:19:36] longma: Ah, you didn't mention the train task T220736 in your !log messages; I guess we should add that useful tip to the train instructions. :-) [22:19:36] T220736: 1.34.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T220736 [22:20:29] James_F: I did for the one that I wrote myself but perhaps I need to do something else to get the automated one to add it? [22:22:48] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:23:52] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [22:24:08] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:24:35] 10Operations, 10Operations-Software-Development: cumin could use randomization/splay options - https://phabricator.wikimedia.org/T164587 (10crusnov) >>! In T164587#3239084, @Volans wrote: > @BBlack Thanks for opening this feature request, because right now it's totally implementation dependent and actually I r... [22:25:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [22:29:42] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [22:31:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [22:37:21] longma: If you'd done `scap sync-wikiversions "group0 to 1.34.0-wmf.11 T220736"`, it'd have ended up tagging the task as it went, so people following the task would see after the fact where we were with which bit. [22:37:22] T220736: 1.34.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T220736 [22:38:15] ahh, i ran ~/release/bin/deploy-promote group0 1.34.0-wmf.11 maybe it also takes the phab task argument [22:39:13] Oh, interesting.. [22:46:16] hrm, IIRC deploy-promote uses some envvar that I never use...we could probably make that better :) [22:47:14] PHABTASK=blah it looks like https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/release/+/master/bin/deploy-promote#182 [22:47:21] ahh [22:48:16] this is probably documented nowhere would be my guess. [22:48:22] Of course it uses a envvar, because those are always a good idea. ;-) [22:48:40] :D [22:48:52] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.10/extensions/Echo/modules/nojs/: T226503 Fix badge icons in Monobook (duration: 00m 57s) [22:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:58] T226503: Notification icons gone on meta wiki when using Monobook: "Error: Unknown module: ext.echo.badgeicons" - https://phabricator.wikimedia.org/T226503 [22:49:12] would you have had us use some kind of parser? for arguments? unheard of! [22:49:27] :-D [22:49:44] MW has a pretty (ridiculously) advanced CLI arguments parser… [22:50:03] But why re-use when we can write a totally new tool in python?! ;-) [22:50:03] heh, that is the word on the street [22:50:09] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.11/extensions/Echo/modules/nojs/: T226503 Fix badge icons in Monobook (duration: 00m 56s) [22:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:29] python? fancy. This is some hacky shell we wrote once upon a time. I'd be happy if this was moved to a real language at some point. [22:51:49] I meant scap, but yes. :-) [23:00:05] MaxSem, RoanKattouw, and Niharika: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190625T2300). Please do the needful. [23:00:05] Zoranzoki21: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:02:00] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes - https://phabricator.wikimedia.org/T226589 (10Jdforrester-WMF) [23:02:32] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes in eqsin - https://phabricator.wikimedia.org/T226477 (10Jdforrester-WMF) [23:02:36] 10Operations, 10Traffic, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in esams - https://phabricator.wikimedia.org/T222937 (10Jdforrester-WMF) [23:02:39] 10Operations, 10Traffic: Replace Varnish backends with ATS on cache upload nodes - https://phabricator.wikimedia.org/T226589 (10Jdforrester-WMF) [23:02:42] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10Jdforrester-WMF) [23:11:07] (03CR) 10Volans: "> Patch Set 1:" (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/519129 (owner: 10CDanis) [23:50:23] (03PS1) 10Volans: dbconfig: remove hostname from the instance schema [software/conftool] - 10https://gerrit.wikimedia.org/r/519155 [23:50:25] (03PS1) 10Volans: dbconfig: unify MediaWiki objects into one [software/conftool] - 10https://gerrit.wikimedia.org/r/519156 [23:53:57] (03CR) 10Volans: "This is the new list of objects:" [software/conftool] - 10https://gerrit.wikimedia.org/r/519156 (owner: 10Volans) [23:56:29] (03CR) 10Smalyshev: [C: 03+1] refactor wikidata entity dumps into wikibase + wikidata specific bits [puppet] - 10https://gerrit.wikimedia.org/r/517670 (https://phabricator.wikimedia.org/T221917) (owner: 10ArielGlenn)