[00:00:04] twentyafterfour: Time to snap out of that daydream and deploy Phabricator update. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190606T0000). [00:02:05] (03CR) 10Volans: "Just to make it clear, the other clear alternative, that is more verbose and more clean, but I think unnecessary in this representation gi" [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [00:03:17] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [00:10:54] (03CR) 10CDanis: dbconfig: use lists for sectionLoads sections (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [00:13:15] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [00:25:33] PROBLEM - puppet last run on dns2001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [00:35:25] (03CR) 10CDanis: [C: 03+2] dbconfig: fix return value [software/conftool] - 10https://gerrit.wikimedia.org/r/514543 (owner: 10Volans) [00:37:54] (03Merged) 10jenkins-bot: dbconfig: fix return value [software/conftool] - 10https://gerrit.wikimedia.org/r/514543 (owner: 10Volans) [00:52:35] RECOVERY - puppet last run on dns2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:02:59] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [01:21:27] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:54:46] (03PS1) 10Kosta Harlan: GrowthExperiments (Beta): Switch on mobile homepage traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514636 [01:57:16] (03PS1) 10Kosta Harlan: GrowthExperiments (testwiki): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514638 [01:57:48] (03PS2) 10Kosta Harlan: GrowthExperiments (Beta): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514636 [02:02:45] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [02:15:35] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [03:02:21] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [03:12:19] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:02:11] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [04:05:59] RECOVERY - Maps - OSM synchronization lag - codfw on icinga1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 1.476e+04 https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [04:12:11] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:12:27] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think we have quite a bit of code that does use KVObject.query across other projects (I'm thinking spicerack for one, but probably other" [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 (owner: 10Volans) [04:25:35] (03CR) 10Santhosh: Redirect Google Translate any wiki source to mobile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh) [04:26:29] (03PS6) 10Santhosh: Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) [05:02:23] 10Operations, 10ops-eqiad, 10DBA: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) So, there data is consistent on main tables ` archive logging page revision text user change_tag actor ipblocks comment ` Going to start repooling this host. [05:04:13] (03PS1) 10Marostegui: db1091: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514643 (https://phabricator.wikimedia.org/T225060) [05:04:57] (03CR) 10Marostegui: [C: 03+2] db1091: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/514643 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [05:06:12] (03PS1) 10Marostegui: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514644 [05:07:09] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514644 (owner: 10Marostegui) [05:07:56] (03Merged) 10jenkins-bot: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514644 (owner: 10Marostegui) [05:08:13] (03CR) 10jenkins-bot: db-eqiad.php: Slowly repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514644 (owner: 10Marostegui) [05:09:17] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Slowly repool db1091 after getting its BBU replaced T225060 (duration: 00m 56s) [05:09:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:23] T225060: db1091 crashed - https://phabricator.wikimedia.org/T225060 [05:11:41] !log Disable notifications db2042 - T225090 [05:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:46] T225090: Decommission db2042 - https://phabricator.wikimedia.org/T225090 [05:14:15] !log Stop MySQL on db2042 to copy its content to dbprov2001 as a temporary backup - T225090 [05:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:42] !log Remove db2042 from tendril and zarcillo [05:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:56] !log Remove db2042 from tendril and zarcillo T225090 [05:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:19:01] T225090: Decommission db2042 - https://phabricator.wikimedia.org/T225090 [05:25:41] (03PS1) 10Marostegui: mariadb: Prepare db2042 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/514646 (https://phabricator.wikimedia.org/T225090) [05:28:06] (03PS1) 10Marostegui: db-eqiad.php: More weight to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514647 [05:28:50] (03CR) 10Marostegui: [C: 03+2] mariadb: Prepare db2042 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/514646 (https://phabricator.wikimedia.org/T225090) (owner: 10Marostegui) [05:29:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More weight to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514647 (owner: 10Marostegui) [05:30:36] (03Merged) 10jenkins-bot: db-eqiad.php: More weight to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514647 (owner: 10Marostegui) [05:30:38] (03CR) 10jenkins-bot: db-eqiad.php: More weight to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514647 (owner: 10Marostegui) [05:32:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1091 after getting its BBU replaced (duration: 00m 55s) [05:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:33:40] 10Operations, 10ops-codfw, 10decommission: Decommission db2042 - https://phabricator.wikimedia.org/T225090 (10Marostegui) a:05Marostegui→03RobH db2042 is ready for DCOPs to take over. [05:41:46] !log Upgrade MySQL on s6 codfw hosts in preparation for s6 codfw master failover - T221533 [05:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:41:52] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [05:44:14] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514649 [05:45:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514649 (owner: 10Marostegui) [05:45:56] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514649 (owner: 10Marostegui) [05:46:24] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514649 (owner: 10Marostegui) [05:47:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1091 after getting its BBU replaced (duration: 00m 55s) [05:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:21] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514650 [05:58:56] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514650 (owner: 10Marostegui) [05:59:43] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514650 (owner: 10Marostegui) [05:59:58] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514650 (owner: 10Marostegui) [06:01:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic to db1091 after getting its BBU replaced (duration: 01m 01s) [06:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:34] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514651 (https://phabricator.wikimedia.org/T225060) [06:12:17] (03PS23) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [06:13:03] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514651 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [06:13:52] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514651 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [06:14:06] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514651 (https://phabricator.wikimedia.org/T225060) (owner: 10Marostegui) [06:14:40] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [06:15:46] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully repool db1091 after getting its BBU replaced (duration: 00m 54s) [06:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:32] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1091 crashed - https://phabricator.wikimedia.org/T225060 (10Marostegui) db1091 is fully repooled. I will remove db1135 from s4 after the SRE summit [06:20:32] (03PS1) 10Marostegui: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514652 (https://phabricator.wikimedia.org/T224852) [06:21:48] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514652 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [06:22:36] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514652 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [06:22:50] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514652 (https://phabricator.wikimedia.org/T224852) (owner: 10Marostegui) [06:23:50] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1121 for upgrade T224852 (duration: 00m 55s) [06:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:55] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [06:26:03] (03PS1) 10Jcrespo: Revert "mariadb: Depool labsdb1011 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/514654 [06:26:46] (03CR) 10Jcrespo: [C: 03+2] Revert "mariadb: Depool labsdb1011 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/514654 (owner: 10Jcrespo) [06:27:00] (03PS2) 10Jcrespo: Revert "mariadb: Depool labsdb1011 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/514654 [06:28:25] PROBLEM - puppet last run on kubestagetcd1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_long_procs] [06:31:14] !log Start topology changes on s6 codfw to promote db2046 as master - T221533 [06:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:21] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [06:32:47] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/puppet-enabled] [06:42:26] (03CR) 10Jcrespo: "Order of elements doesn't need to be preserved, those are listed in alphabetical order when appropriate (except the master, of course- whi" [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [06:46:23] (03PS1) 10Elukey: profile::mediawiki::mcrouter_wancache: set timeouts_until_tko to 10 [puppet] - 10https://gerrit.wikimedia.org/r/514656 (https://phabricator.wikimedia.org/T203786) [06:49:01] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16879/" [puppet] - 10https://gerrit.wikimedia.org/r/514656 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [06:54:35] PROBLEM - puppet last run on wtp1027 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [06:55:09] RECOVERY - puppet last run on kubestagetcd1003 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:56:23] PROBLEM - Check systemd state on ms-be2037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:59:39] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:02:41] PROBLEM - DPKG on kerberos1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [07:05:29] RECOVERY - DPKG on kerberos1001 is OK: All packages OK [07:06:53] PROBLEM - puppet last run on kerberos1001 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Package[exim4-config],Package[exim4-daemon-light],Exec[set debconf flag seen for wireshark-common/install-setuid] [07:07:08] (03PS1) 10Marostegui: mariadb: Promote db2046 to s6 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/514657 (https://phabricator.wikimedia.org/T221533) [07:11:36] (03PS1) 10Marostegui: db-codfw.php: Promote db2046 to s6 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514658 (https://phabricator.wikimedia.org/T221533) [07:12:17] RECOVERY - puppet last run on kerberos1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:13:09] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db2046 to s6 codfw master [puppet] - 10https://gerrit.wikimedia.org/r/514657 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [07:13:42] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Promote db2046 to s6 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514658 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [07:14:31] (03Merged) 10jenkins-bot: db-codfw.php: Promote db2046 to s6 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514658 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [07:16:02] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Promote db2046 to s6 master as db2039 will be decommissioned T221533 (duration: 00m 55s) [07:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:08] T221533: Decommission old coredb machines (<=db2042) - https://phabricator.wikimedia.org/T221533 [07:16:27] (03CR) 10jenkins-bot: db-codfw.php: Promote db2046 to s6 codfw master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514658 (https://phabricator.wikimedia.org/T221533) (owner: 10Marostegui) [07:18:30] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514457 (owner: 10Muehlenhoff) [07:20:36] !log Stop MySQL on db1121 for upgrade, this will generate lag on labs hosts for s6 - T224852 [07:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:41] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [07:21:31] RECOVERY - puppet last run on wtp1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:22:16] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [07:25:34] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514659 [07:27:21] 10Operations, 10DBA: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 (10Marostegui) All hosts in codfw are now running 10.1.39 so we are ready for the failover from that front. [07:30:18] (03PS1) 10Giuseppe Lavagetto: conftool::scripts: add a safe-service-restart script [puppet] - 10https://gerrit.wikimedia.org/r/514660 (https://phabricator.wikimedia.org/T224857) [07:30:21] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: use the safe script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/514661 [07:31:11] (03CR) 10jerkins-bot: [V: 04-1] conftool::scripts: add a safe-service-restart script [puppet] - 10https://gerrit.wikimedia.org/r/514660 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [07:33:17] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514659 (owner: 10Marostegui) [07:34:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514659 (owner: 10Marostegui) [07:34:25] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1121" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514659 (owner: 10Marostegui) [07:35:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1121 after upgrade T224852 (duration: 00m 53s) [07:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:25] T224852: Failover s4 primary master: db1068 to db1081 - https://phabricator.wikimedia.org/T224852 [07:35:38] (03CR) 10Elukey: profile::kerberos::kadminserver: allow puppetmaster to rsync keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [07:47:23] RECOVERY - Check systemd state on ms-be2037 is OK: OK - running: The system is fully operational [07:48:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] profile::mediawiki::mcrouter_wancache: set timeouts_until_tko to 10 [puppet] - 10https://gerrit.wikimedia.org/r/514656 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [07:48:56] (03CR) 10Muehlenhoff: [C: 03+1] profile::kerberos::kadminserver: allow puppetmaster to rsync keytabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [07:49:46] (03PS4) 10Elukey: profile::kerberos::kadminserver: allow puppetmaster to rsync keytabs [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) [07:55:25] (03PS1) 10Ema: Modify access rules [debs/varnish4] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/514662 [07:55:38] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: allow puppetmaster to rsync keytabs [puppet] - 10https://gerrit.wikimedia.org/r/514447 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [07:55:56] (03PS4) 10Elukey: profile::kerberos::kadminserver: add generate_keytabs.py [puppet] - 10https://gerrit.wikimedia.org/r/514471 (https://phabricator.wikimedia.org/T212257) [07:56:18] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: use the safe script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/514661 [07:56:58] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: add generate_keytabs.py [puppet] - 10https://gerrit.wikimedia.org/r/514471 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [08:08:25] (03CR) 10Volans: "> Patch Set 1:" [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [08:08:58] (03PS1) 10Marostegui: mariadb: Provision db1132 into m2 [puppet] - 10https://gerrit.wikimedia.org/r/514663 (https://phabricator.wikimedia.org/T222682) [08:09:54] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Provision db1132 into m2 [puppet] - 10https://gerrit.wikimedia.org/r/514663 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:10:39] PROBLEM - puppet last run on kerberos1001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [08:10:54] buuuu this is me --^ [08:13:12] elukey: what's the issue? [08:13:18] (03CR) 10Volans: "replies to cdanis inline" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [08:14:13] ah yes, the rsync defintion [08:14:17] moritzm: it complains about the format of the rsync module [08:15:38] (03CR) 10Volans: "> Patch Set 2: Code-Review-1" [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 (owner: 10Volans) [08:16:53] (03CR) 10Marostegui: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/16883/ the -1 is a known issue that will be tackled once the refactoring i" [puppet] - 10https://gerrit.wikimedia.org/r/514663 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:18:34] (03PS1) 10Hashar: Inherit from operations/debs instead of All-Projects [debs/varnish4] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/514664 [08:19:25] (03Abandoned) 10Ema: Modify access rules [debs/varnish4] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/514662 (owner: 10Ema) [08:19:55] (03CR) 10Hashar: [V: 03+2 C: 03+2] "That saves ema!" [debs/varnish4] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/514664 (owner: 10Hashar) [08:21:57] hashar: <3 [08:22:10] (03CR) 10Gehel: [C: 04-1] Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [08:22:19] (03CR) 10Ema: [C: 03+2] Add lintian override: postinst-must-call-ldconfig [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514514 (owner: 10Ema) [08:22:31] (03CR) 10Ema: [C: 03+2] Add 0019-vary-stevedore-mem-leak.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/513976 (owner: 10Ema) [08:22:47] \o/ [08:23:23] (03CR) 10Ema: [C: 03+2] Add 0020-assert-error-http1_minimal_response.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/513977 (https://phabricator.wikimedia.org/T224694) (owner: 10Ema) [08:23:41] (03CR) 10Ema: [C: 03+2] Add 0021-dont-test-gunzip-partial.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514043 (owner: 10Ema) [08:23:57] (03CR) 10Ema: [C: 03+2] Add 0022-deref-objcore-synth-err.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514315 (owner: 10Ema) [08:24:12] (03CR) 10Ema: [C: 03+2] Add 0023-pass-delivery-is-no-err.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514318 (owner: 10Ema) [08:24:29] PROBLEM - MariaDB Slave Lag: m1 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 763.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:24:38] (03CR) 10Ema: [C: 03+2] Drop 0001-gethdr_extrachance.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514441 (owner: 10Ema) [08:25:06] (03CR) 10Ema: [C: 03+2] Add 0025-extrachance-one-retry.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514443 (owner: 10Ema) [08:25:20] (03CR) 10Ema: [C: 03+2] Add 0024-vbt-get-force-fresh.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514442 (owner: 10Ema) [08:25:34] (03CR) 10Ema: [C: 03+2] Add 0026-transient-full-cache_req_body-panic.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514496 (owner: 10Ema) [08:25:47] (03CR) 10Ema: [C: 03+2] Add 0027-assert-error-vca_make_session.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514526 (owner: 10Ema) [08:25:53] RECOVERY - MariaDB Slave Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 46.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [08:25:59] (03CR) 10Ema: "recheck" [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514529 (owner: 10Ema) [08:30:21] (03PS2) 10Muehlenhoff: ntp: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514429 [08:31:00] (03PS1) 10Elukey: profile::kerberos::kadminserver: fix rsync module's ensure arg [puppet] - 10https://gerrit.wikimedia.org/r/514668 [08:31:08] (03PS2) 10Marostegui: mariadb: Provision db1132 into m2 [puppet] - 10https://gerrit.wikimedia.org/r/514663 (https://phabricator.wikimedia.org/T222682) [08:31:10] moritzm: --^ Bad Luca is Bad [08:31:26] (03CR) 10Muehlenhoff: [C: 03+2] ntp: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514429 (owner: 10Muehlenhoff) [08:32:04] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Provision db1132 into m2 [puppet] - 10https://gerrit.wikimedia.org/r/514663 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:32:18] (03CR) 10Muehlenhoff: [C: 03+1] profile::kerberos::kadminserver: fix rsync module's ensure arg [puppet] - 10https://gerrit.wikimedia.org/r/514668 (owner: 10Elukey) [08:32:25] (03PS3) 10Marostegui: mariadb: Provision db1132 into m2 [puppet] - 10https://gerrit.wikimedia.org/r/514663 (https://phabricator.wikimedia.org/T222682) [08:32:27] ah, I had also totally missed that [08:33:01] the puppet error message was really clear and pointing in the right direction [08:33:04] -.- [08:33:13] (03PS2) 10Elukey: profile::kerberos::kadminserver: fix rsync module's ensure arg [puppet] - 10https://gerrit.wikimedia.org/r/514668 [08:33:21] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Provision db1132 into m2 [puppet] - 10https://gerrit.wikimedia.org/r/514663 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:33:30] (03CR) 10Marostegui: [V: 03+2 C: 03+2] mariadb: Provision db1132 into m2 [puppet] - 10https://gerrit.wikimedia.org/r/514663 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [08:34:51] (03CR) 10Ema: [C: 03+2] Add 0028-panic-return-cond-fetch.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514529 (owner: 10Ema) [08:34:59] (03CR) 10Ema: [C: 03+2] Add 0029-ban-lurker-bo-backoff.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514533 (owner: 10Ema) [08:35:05] (03CR) 10Ema: [C: 03+2] Add 0030-startup-show-version.patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514538 (owner: 10Ema) [08:35:25] (03PS2) 10Muehlenhoff: sslcert: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514432 [08:35:58] (03PS3) 10Volans: icinga: manage metamonitor known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/514444 (https://phabricator.wikimedia.org/T222074) [08:36:55] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: fix rsync module's ensure arg [puppet] - 10https://gerrit.wikimedia.org/r/514668 (owner: 10Elukey) [08:37:02] (03PS3) 10Elukey: profile::kerberos::kadminserver: fix rsync module's ensure arg [puppet] - 10https://gerrit.wikimedia.org/r/514668 [08:37:05] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::kerberos::kadminserver: fix rsync module's ensure arg [puppet] - 10https://gerrit.wikimedia.org/r/514668 (owner: 10Elukey) [08:37:18] (03PS4) 10Muehlenhoff: icinga: manage metamonitor known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/514444 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [08:38:32] !log Stop MySQL on db1117:3322 - this will trigger haproxy alerts - T222682 [08:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:37] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [08:39:16] (03CR) 10Muehlenhoff: [C: 03+2] sslcert: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514432 (owner: 10Muehlenhoff) [08:39:24] (03PS3) 10Muehlenhoff: sslcert: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514432 [08:41:57] PROBLEM - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:42:21] ^ expected [08:42:49] ACKNOWLEDGEMENT - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [08:42:49] ACKNOWLEDGEMENT - haproxy failover on dbproxy1007 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [08:43:19] RECOVERY - puppet last run on kerberos1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:50:35] (03PS2) 10Giuseppe Lavagetto: conftool::scripts: add a safe-service-restart script [puppet] - 10https://gerrit.wikimedia.org/r/514660 (https://phabricator.wikimedia.org/T224857) [08:50:37] (03PS3) 10Giuseppe Lavagetto: mediawiki::php: use the safe script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/514661 [08:51:23] (03PS2) 10Muehlenhoff: cpufrequtils: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514434 [08:51:30] (03CR) 10jerkins-bot: [V: 04-1] conftool::scripts: add a safe-service-restart script [puppet] - 10https://gerrit.wikimedia.org/r/514660 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [08:52:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/514444 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [08:52:49] (03CR) 10Muehlenhoff: [C: 03+2] cpufrequtils: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514434 (owner: 10Muehlenhoff) [08:53:12] (03PS1) 10Ema: varnish (5.1.3-1wm10) stretch-wikimedia; urgency=medium [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514670 (https://phabricator.wikimedia.org/T224694) [08:55:37] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [08:55:46] ^ expected [08:56:20] ACKNOWLEDGEMENT - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 Marostegui expected https://wikitech.wikimedia.org/wiki/HAProxy [08:57:21] (03CR) 10jerkins-bot: [V: 04-1] varnish (5.1.3-1wm10) stretch-wikimedia; urgency=medium [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514670 (https://phabricator.wikimedia.org/T224694) (owner: 10Ema) [08:58:27] (03PS2) 10Muehlenhoff: swift: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514457 [09:00:26] (03PS1) 10Petar.petkovic: Remove Content Translation event logging config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514672 [09:02:47] (03PS24) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [09:05:02] (03CR) 10Muehlenhoff: [C: 03+2] swift: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/514457 (owner: 10Muehlenhoff) [09:08:57] (03PS4) 10Giuseppe Lavagetto: mediawiki::php: use the safe script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/514661 [09:11:01] (03PS3) 10Muehlenhoff: Remove support for Ubuntu/trusty in monitoring/metrics base classes [puppet] - 10https://gerrit.wikimedia.org/r/498130 [09:13:29] (03CR) 10Muehlenhoff: [C: 03+2] Remove support for Ubuntu/trusty in monitoring/metrics base classes [puppet] - 10https://gerrit.wikimedia.org/r/498130 (owner: 10Muehlenhoff) [09:14:29] (03PS11) 10Mathew.onipe: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) [09:14:31] (03PS4) 10Mathew.onipe: add WDQS reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/512915 (https://phabricator.wikimedia.org/T224385) [09:15:05] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [09:16:05] (03CR) 10Mathew.onipe: wdqs: add WDQS restart cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [09:16:49] (03PS1) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [09:17:41] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [09:19:01] (03CR) 10Gehel: [C: 03+2] Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [09:19:29] (03Abandoned) 10Muehlenhoff: hhvm: Remove support for upstart [puppet] - 10https://gerrit.wikimedia.org/r/486443 (owner: 10Muehlenhoff) [09:19:49] (03Abandoned) 10Muehlenhoff: base/ntp: Remove trusty/Ubuntu support [puppet] - 10https://gerrit.wikimedia.org/r/500403 (owner: 10Muehlenhoff) [09:20:24] (03PS5) 10Muehlenhoff: Remove support for Ubuntu in apt/debmonitor base classes [puppet] - 10https://gerrit.wikimedia.org/r/498134 [09:20:58] (03PS12) 10Gehel: wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [09:22:07] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) [09:22:10] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001: DISK WARNING - free space: /srv 88397 MB (10% inode=94%): - https://phabricator.wikimedia.org/T219850 (10hashar) 05Open→03Resolved I cleaned up some images yesterday: 2019-06-05 19:57 cont... [09:24:51] !log gehel@cumin2001 START - Cookbook sre.postgresql.postgres-init [09:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:39] !log updating qemu on ganeti2004 for some tests [09:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:51] 10Operations, 10Traffic, 10Performance-Team (Radar): Add profiling for Varnish and VCL - https://phabricator.wikimedia.org/T175710 (10Krinkle) [09:29:04] (03PS5) 10Giuseppe Lavagetto: mediawiki::php: use the safe script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/514661 [09:30:14] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::php: use the safe script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/514661 (owner: 10Giuseppe Lavagetto) [09:30:26] (03PS3) 10Volans: selectors: do not pre-compile the regex [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 [09:30:28] (03PS2) 10Volans: dbconfig: use lists for sectionLoads sections [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 [09:30:32] (03PS2) 10Ema: varnish (5.1.3-1wm10) stretch-wikimedia; urgency=medium [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514670 (https://phabricator.wikimedia.org/T224694) [09:30:57] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [09:30:58] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:22] (03PS4) 10Volans: types: do not pre-compile regex in SchemaRule [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 [09:31:24] (03PS3) 10Volans: dbconfig: use lists for sectionLoads sections [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 [09:31:59] !log rebooting mwdebug2002 for some tests [09:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:15] (03PS3) 10Ema: varnish (5.1.3-1wm10) stretch-wikimedia; urgency=medium [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514670 (https://phabricator.wikimedia.org/T224694) [09:35:48] (03CR) 10Mvolz: Enable reftabs on testwikidata (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [09:36:58] (03PS3) 10Giuseppe Lavagetto: conftool::scripts: add a safe-service-restart script [puppet] - 10https://gerrit.wikimedia.org/r/514660 (https://phabricator.wikimedia.org/T224857) [09:37:00] (03PS6) 10Giuseppe Lavagetto: mediawiki::php: use the safe script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/514661 [09:37:40] (03CR) 10Mobrovac: [C: 04-1] "For new services that go onto k8s, we don't create roles/profiles any more as we don't run puppet inside k8s nodes." [puppet] - 10https://gerrit.wikimedia.org/r/514490 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [09:38:00] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Kanban): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) The new disks can be shown as sdc and sdd. Currently I think we have 3 RAID 1 arrays, with LVM on t... [09:38:16] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::php: use the safe script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/514661 (owner: 10Giuseppe Lavagetto) [09:39:04] (03CR) 10Mobrovac: [C: 04-1] "You need to create a deployment chart in https://gerrit.wikimedia.org/r/#/admin/projects/operations/deployment-charts" [puppet] - 10https://gerrit.wikimedia.org/r/514490 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [09:39:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool::scripts: add a safe-service-restart script [puppet] - 10https://gerrit.wikimedia.org/r/514660 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [09:41:49] (03PS7) 10Giuseppe Lavagetto: mediawiki::php: use the safe script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/514661 [09:42:23] (03PS3) 10Lucas Werkmeister (WMDE): Stop using wmg variables for Score extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502269 [09:42:54] (03CR) 10jerkins-bot: [V: 04-1] mediawiki::php: use the safe script to restart php-fpm [puppet] - 10https://gerrit.wikimedia.org/r/514661 (owner: 10Giuseppe Lavagetto) [09:44:05] (03PS6) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 [09:45:13] (03PS2) 10Lucas Werkmeister (WMDE): Fix wgImportSources setting for wikidata dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502479 [09:45:15] (03PS2) 10Lucas Werkmeister (WMDE): Fix WBRepoCanonicalUriProperty setting for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502480 [09:47:07] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] "this is a bug of the linter." [puppet] - 10https://gerrit.wikimedia.org/r/514661 (owner: 10Giuseppe Lavagetto) [09:47:25] (03CR) 10Alexandros Kosiaris: [C: 04-2] "Yup, we no longer use puppet for new services. This is fully unneeded" [puppet] - 10https://gerrit.wikimedia.org/r/514490 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [09:50:09] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:50:21] RECOVERY - haproxy failover on dbproxy1007 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [09:50:51] (03PS2) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) [09:51:11] (03PS5) 10Volans: icinga: manage metamonitor known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/514444 (https://phabricator.wikimedia.org/T222074) [09:51:41] (03CR) 10jerkins-bot: [V: 04-1] Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [09:52:12] (03PS3) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) [09:52:23] (03CR) 10Volans: [C: 03+2] icinga: manage metamonitor known_hosts [puppet] - 10https://gerrit.wikimedia.org/r/514444 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [09:53:52] jouncebot: next [09:53:52] In 1 hour(s) and 6 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190606T1100) [09:55:35] (03CR) 10Lucas Werkmeister (WMDE): "better safe than sorry IMO :) this looks okay to me, but Addshore knows more about the Wikibase config" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) (owner: 10Mvolz) [09:57:37] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: fix restart script [puppet] - 10https://gerrit.wikimedia.org/r/514676 [09:57:56] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki::php: fix restart script [puppet] - 10https://gerrit.wikimedia.org/r/514676 (owner: 10Giuseppe Lavagetto) [09:59:16] !log mobrovac@deploy1001 scap-helm mathoid upgrade production stable/mathoid -f mathoid-values.yaml [namespace: mathoid, clusters: eqiad,codfw] [09:59:17] !log mobrovac@deploy1001 scap-helm mathoid cluster eqiad completed [09:59:18] !log mobrovac@deploy1001 scap-helm mathoid cluster codfw completed [09:59:18] !log mobrovac@deploy1001 scap-helm mathoid finished [09:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:25] (03PS4) 10Mvolz: Enable reftabs on testwikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514461 (https://phabricator.wikimedia.org/T199197) [10:03:49] PROBLEM - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is CRITICAL: //{format}/ (mass-energy equivalence (complete)) is CRITICAL: Test mass-energy equivalence (complete) returned the unexpected status 404 (expecting: 200): //{format}/ (mass-energy equivalence (svg)) is CRITICAL: Test mass-energy equivalence (svg) returned the unexpected status 404 (expecting: 200): //{format}/ (mass-energy equivalence (mml)) is CRITICAL: Test mas [10:03:49] nce (mml) returned the unexpected status 404 (expecting: 200): //{format}/ (mass-energy equivalence (texvcinfo)) is CRITICAL: Test mass-energy equivalence (texvcinfo) returned the unexpected status 404 (expecting: 200): //{format}/ (Invalid command (texvcinfo)) is CRITICAL: Test Invalid command (texvcinfo) returned the unexpected status 404 (expecting: 400): //_info (retrieve service info) is CRITICAL: Test retrieve service info [10:03:49] pected status 404 (expecting: 200): // (spec from root) is CRITICAL: Test spec from root returned the unexpected status 404 (expecting: 200): // (mass-energy equivalence (json)) is CRITICAL: Test mass-energy equivalence (json) returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mathoid [10:05:01] PROBLEM - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is CRITICAL: //{format}/ (mass-energy equivalence (complete)) is CRITICAL: Test mass-energy equivalence (complete) returned the unexpected status 404 (expecting: 200): //{format}/ (mass-energy equivalence (svg)) is CRITICAL: Test mass-energy equivalence (svg) returned the unexpected status 404 (expecting: 200): //{format}/ (mass-energy equivalence (mml)) is CRITICAL: Test mas [10:05:01] nce (mml) returned the unexpected status 404 (expecting: 200): //{format}/ (mass-energy equivalence (texvcinfo)) is CRITICAL: Test mass-energy equivalence (texvcinfo) returned the unexpected status 404 (expecting: 200): //{format}/ (Invalid command (texvcinfo)) is CRITICAL: Test Invalid command (texvcinfo) returned the unexpected status 404 (expecting: 400): //_info (retrieve service info) is CRITICAL: Test retrieve service info [10:05:01] pected status 404 (expecting: 200): // (spec from root) is CRITICAL: Test spec from root returned the unexpected status 404 (expecting: 200): // (mass-energy equivalence (json)) is CRITICAL: Test mass-energy equivalence (json) returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mathoid [10:06:37] RECOVERY - Mathoid LVS eqiad on mathoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mathoid [10:07:47] jouncebot: next [10:07:48] In 0 hour(s) and 52 minute(s): European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190606T1100) [10:07:51] RECOVERY - Mathoid LVS codfw on mathoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mathoid [10:10:01] !log rollbacked last deployment of mathoid to revision 16 [10:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:19] !log disable puppet on mw1* and mw[2163,2235,2255,2271] as prep step for mcrouter config deploy [10:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:46] (03PS2) 10Elukey: profile::mediawiki::mcrouter_wancache: set timeouts_until_tko to 10 [puppet] - 10https://gerrit.wikimedia.org/r/514656 (https://phabricator.wikimedia.org/T203786) [10:13:48] (03PS1) 10Urbanecm: Add new namespaces for several Thai projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514678 (https://phabricator.wikimedia.org/T216322) [10:14:32] (03CR) 10Elukey: [C: 03+2] profile::mediawiki::mcrouter_wancache: set timeouts_until_tko to 10 [puppet] - 10https://gerrit.wikimedia.org/r/514656 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [10:17:16] (03PS1) 10Volans: icinga: fix permission of contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/514679 (https://phabricator.wikimedia.org/T222074) [10:19:51] !log rolling restart of mcrouter on mw1* hosts to pick up config change (batch of 5 hosts, depool/run-puppet/pool) [10:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:09] I left aside the mcrouter's codfw proxies (mw[2163,2235,2255,2271]) because they are a bit delicate [10:22:19] will do them as soon as mw1* is completed [10:22:57] (03PS1) 10Mobrovac: Remove trailing slash in base path [software/service-checker] - 10https://gerrit.wikimedia.org/r/514681 [10:23:51] (03CR) 10jerkins-bot: [V: 04-1] Remove trailing slash in base path [software/service-checker] - 10https://gerrit.wikimedia.org/r/514681 (owner: 10Mobrovac) [10:28:13] (03PS2) 10Mobrovac: Remove trailing slash in base path [software/service-checker] - 10https://gerrit.wikimedia.org/r/514681 [10:28:18] (03PS2) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [10:28:36] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [10:29:20] (03PS3) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [10:29:37] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [10:29:48] (03CR) 10Ema: [C: 03+2] varnish (5.1.3-1wm10) stretch-wikimedia; urgency=medium [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/514670 (https://phabricator.wikimedia.org/T224694) (owner: 10Ema) [10:30:01] !log varnish 5.1.3-1wm10 uploaded to stretch-wikimedia T224694 [10:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:06] T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 [10:33:21] (03CR) 10Ema: [C: 03+1] "Outstanding work!" [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [10:34:41] (03CR) 10Vgutierrez: [C: 03+2] ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [10:34:51] (03PS67) 10Vgutierrez: ATS: Provide a TLS terminator profile [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [10:36:26] (03PS4) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [10:36:59] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [10:38:17] (03PS7) 10Ema: Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh) [10:38:32] (03PS5) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [10:39:29] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [10:40:25] (03CR) 10Ema: [C: 03+2] Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh) [10:43:27] !log mobrovac@deploy1001 scap-helm mathoid upgrade production stable/mathoid -f mathoid-values.yaml [namespace: mathoid, clusters: eqiad,codfw] [10:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:32] !log mobrovac@deploy1001 scap-helm mathoid cluster eqiad completed [10:43:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:38] !log mobrovac@deploy1001 scap-helm mathoid cluster codfw completed [10:43:38] !log mobrovac@deploy1001 scap-helm mathoid finished [10:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:20] (03CR) 10Vgutierrez: [C: 04-1] ATS: add hardening features to systemd unit (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/510168 (owner: 10Ema) [10:49:25] 10Operations, 10Traffic, 10Performance-Team (Radar): Add profiling for Varnish and VCL - https://phabricator.wikimedia.org/T175710 (10Krinkle) Maybe something to discuss with Traffic and possibly collaborate on in a future quarter. [10:50:26] rolling restart of mcrouter on mw1* completed [10:55:21] !log restart mcrouter on mw2163 (codfw mcrouter proxy) [10:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:44] I am checking with marostegui the effects of --^ [10:56:45] jouncebot: refresh [10:56:46] I refreshed my knowledge about deployments. [10:56:49] thx [10:56:49] (03CR) 10Muehlenhoff: [C: 03+1] icinga: fix permission of contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/514679 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [10:57:17] (03PS2) 10Volans: icinga: fix permission of contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/514679 (https://phabricator.wikimedia.org/T222074) [10:58:22] (03CR) 10Volans: [C: 03+2] icinga: fix permission of contacts.cfg [puppet] - 10https://gerrit.wikimedia.org/r/514679 (https://phabricator.wikimedia.org/T222074) (owner: 10Volans) [11:00:04] Amir1, Lucas_WMDE, and Urbanecm: #bothumor I � Unicode. All rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190606T1100). [11:00:04] Pchelolo, ottomata, Amir1, Urbanecm, and Lucas_WMDE: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:11] here [11:00:18] o/ [11:01:11] o/ [11:01:20] Pchelolo: are you a deployer? [11:01:29] nope :( [11:01:34] okay [11:01:38] I can do it [11:02:57] (03PS11) 10Ema: varnish: cache_upload rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224884) (owner: 10Jbond) [11:03:40] I think I’ll +2 the second change right now as well, these will take long enough to go through CI without waiting for the first build before starting the second one… [11:05:00] Pchelolo: no backports for wmf.7 btw? [11:05:55] o/ [11:07:29] Pchelolo: you _are_ a deployer ;) https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/admin/data/data.yaml#76 [11:07:33] Lucas_WMDE: wmf-8 is coming to the last group soon, so it's ok that it will go with a train [11:07:39] ok [11:07:57] zeljkof: heh, indeed :) but I don't really know how to swat mw changes.. [11:08:06] wmf.8 should be at all wikis in 2 hours, if there are no problems [11:08:25] Pchelolo: there's docs for that ;) https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers [11:08:27] (03CR) 10Volans: [C: 03+1] "LGTM, does the compiler agrees?" [puppet] - 10https://gerrit.wikimedia.org/r/498134 (owner: 10Muehlenhoff) [11:08:52] I'll try one day when I feel it's a lucky day :) [11:08:54] Pchelolo: I'm more than glad to pair with you on your first deployment, and be there for several next ones [11:09:25] thank you. that would be cool and I will come to you for that [11:13:53] (03CR) 10Giuseppe Lavagetto: "The structure LGTM, some comments, the more important one about the spread-out of such jobs." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [11:16:33] (03PS1) 10Jbond: installer: ensure facter3 and puppet5 components exists early [puppet] - 10https://gerrit.wikimedia.org/r/514689 [11:16:55] (03PS1) 10Giuseppe Lavagetto: safe-service-restart: better error handling [puppet] - 10https://gerrit.wikimedia.org/r/514690 [11:17:22] alright, the first backport was merged, going ahead [11:19:13] Pchelolo: the CirrusSearch backport is on mwdebug1002, can you test it? [11:19:32] Lucas_WMDE: no, unfortunately not : [11:20:10] ok [11:20:16] I’m just quickly checking that it’s not completely broken [11:20:20] looks sane, deploying [11:22:18] !log lucaswerkmeister-wmde@deploy1001 Synchronized php-1.34.0-wmf.8/extensions/CirrusSearch/: SWAT: [[gerrit:514566|Fix event validation error for cirrussearch-request event]] (duration: 01m 06s) [11:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:36] now waiting for gate-and-submit-swat on the EventBus backport [11:23:02] !log gehel@cumin2001 END (PASS) - Cookbook sre.postgresql.postgres-init (exit_code=0) [11:23:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:36] \o/ gehel :) [11:29:18] (03PS2) 10Jbond: installer: ensure facter3 and puppet5 components exists early [puppet] - 10https://gerrit.wikimedia.org/r/514689 [11:30:45] (03CR) 10Muehlenhoff: installer: ensure facter3 and puppet5 components exists early (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514689 (owner: 10Jbond) [11:31:45] (03PS3) 10Jbond: installer: ensure facter3 and puppet5 components exists early [puppet] - 10https://gerrit.wikimedia.org/r/514689 [11:32:11] Lucas_WMDE, sorry, I'M a bit late [11:32:15] let me know when I can deploy my change! [11:32:24] Pchelolo: looks like CI is failing on the second backport https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-docker/11550/console [11:32:34] Lucas_WMDE: ye... seing that.. [11:33:06] so I think we’ll skip that change and continue with the config changes [11:33:08] sorry [11:33:08] it doesn't seem to be related to the patch in question, but I think let's cancel SWAT for this one so that we fix CI for event bus first? [11:33:15] yeah [11:33:24] Amir1: you’re next, do you want to deploy or should I continue? [11:33:33] heh, ye. I'll update the deployment schedule saying it's not done [11:33:41] thank you Lucas_WMDE [11:33:48] Since you're already there, it would be great if you do it [11:33:52] sure [11:33:59] (03CR) 10Giuseppe Lavagetto: [C: 03+2] safe-service-restart: better error handling [puppet] - 10https://gerrit.wikimedia.org/r/514690 (owner: 10Giuseppe Lavagetto) [11:34:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514534 (owner: 10Ladsgroup) [11:35:38] (03Merged) 10jenkins-bot: Remove unused config variable wgWikibaseEnableSenses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514534 (owner: 10Ladsgroup) [11:36:13] it’s on mwdebug1002, testing [11:36:23] (03CR) 10jenkins-bot: Remove unused config variable wgWikibaseEnableSenses [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514534 (owner: 10Ladsgroup) [11:36:42] yup, senses are still there ^^ [11:36:44] deploying [11:37:38] (03PS1) 10Arturo Borrero Gonzalez: toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) [11:38:12] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [11:38:15] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:514534|Remove unused config variable wgWikibaseEnableSenses]] (duration: 00m 55s) [11:38:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:32] alright, now it’s Urbanecm’s turn [11:38:37] same question, do you want to deploy or should I? [11:38:43] I'll deploy it, thanks Lucas_WMDE [11:38:46] ok [11:38:59] (03PS2) 10Urbanecm: Add new namespaces for several Thai projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514678 (https://phabricator.wikimedia.org/T216322) [11:39:05] (03CR) 10Urbanecm: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514678 (https://phabricator.wikimedia.org/T216322) (owner: 10Urbanecm) [11:39:51] (03PS2) 10Arturo Borrero Gonzalez: toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) [11:40:00] (03Merged) 10jenkins-bot: Add new namespaces for several Thai projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514678 (https://phabricator.wikimedia.org/T216322) (owner: 10Urbanecm) [11:40:25] (03CR) 10jenkins-bot: Add new namespaces for several Thai projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514678 (https://phabricator.wikimedia.org/T216322) (owner: 10Urbanecm) [11:40:45] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [11:41:10] 10Operations, 10observability, 10Performance-Team (Radar), 10User-Elukey: Consider adding per-shard metrics to the prometheus mcrouter exporter - https://phabricator.wikimedia.org/T225059 (10aaron) [11:41:23] (03PS1) 10Giuseppe Lavagetto: mediawiki: increase opcache space on canaries [puppet] - 10https://gerrit.wikimedia.org/r/514695 (https://phabricator.wikimedia.org/T224857) [11:41:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: increase opcache space on canaries [puppet] - 10https://gerrit.wikimedia.org/r/514695 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [11:43:18] (03PS12) 10Ema: varnish: cache_upload miss/pass rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224884) (owner: 10Jbond) [11:43:54] Urbanecm: you don’t seem to be logged in on the deployment server as far as I can see [11:44:06] (03PS1) 10Reedy: Remove $wgLexemeDisableCirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514697 (https://phabricator.wikimedia.org/T225183) [11:44:27] Lucas_WMDE, something went wrong here, works now [11:44:41] ok [11:44:47] (03CR) 10Vgutierrez: [C: 03+1] varnish: cache_upload miss/pass rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224884) (owner: 10Jbond) [11:45:21] (03CR) 10Ema: [C: 03+2] varnish: cache_upload miss/pass rate limit [puppet] - 10https://gerrit.wikimedia.org/r/513596 (https://phabricator.wikimedia.org/T224884) (owner: 10Jbond) [11:46:00] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[:gerrit:514678|Add new namespaces for several Thai projects]] (T216322) (duration: 00m 54s) [11:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:05] T216322: Create a new namespaces on thai wikimedia projects - https://phabricator.wikimedia.org/T216322 [11:47:28] (03PS6) 10Muehlenhoff: Remove support for Ubuntu in apt/debmonitor base classes [puppet] - 10https://gerrit.wikimedia.org/r/498134 [11:47:34] !log running mwscript namespaceDupes.php --wiki=thwikibooks --fix for T216322 [11:47:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:11] (03PS3) 10Arturo Borrero Gonzalez: toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) [11:48:46] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [11:48:47] !log running mwscript namespaceDupes.php --wiki=thwikisource --fix (T216322) [11:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:03] (03CR) 10Muehlenhoff: [C: 03+2] Remove support for Ubuntu in apt/debmonitor base classes [puppet] - 10https://gerrit.wikimedia.org/r/498134 (owner: 10Muehlenhoff) [11:49:51] Lucas_WMDE, I'm done, if there's anything else, you can continue (or close the SWAT :)) [11:49:55] (03PS4) 10Arturo Borrero Gonzalez: toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) [11:50:03] yes, there’s one more config change by me :) [11:50:05] thanks! [11:50:18] (03PS7) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 [11:50:31] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 (owner: 10Lucas Werkmeister (WMDE)) [11:50:52] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [11:51:25] (03Merged) 10jenkins-bot: Specify $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 (owner: 10Lucas Werkmeister (WMDE)) [11:51:39] (03CR) 10jenkins-bot: Specify $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 (owner: 10Lucas Werkmeister (WMDE)) [11:52:04] it’s on mwdebug1002, testing… [11:52:38] (03PS1) 10Ema: cp1075: stop passing gethdr_extrachance=0 [puppet] - 10https://gerrit.wikimedia.org/r/514699 (https://phabricator.wikimedia.org/T224694) [11:54:17] looks good, syncing [11:55:00] !log lucaswerkmeister-wmde@deploy1001 scap failed: average error rate on 8/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [11:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:08] (03PS2) 10Ema: cp1075: do not pass gethdr_extrachance=0 [puppet] - 10https://gerrit.wikimedia.org/r/514699 (https://phabricator.wikimedia.org/T224694) [11:55:10] oops, errors [11:55:20] looots of errors in fatalmonitor too, eek [11:55:29] Undefined variable: wmgWBRepoConceptBaseUri [11:56:04] (03CR) 10CDanis: dbconfig: use lists for sectionLoads sections (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [11:56:42] is it possible that scap synced Wikibase.php (where the variable is used) before InitialiseSettings.php (where it should be defined)? [11:56:59] and why were there no errors on mwdebug1002 (that I could see)? [11:57:11] Because you pulled onto mwdebug [11:57:16] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [11:57:18] it updates everything, not just a single file [11:57:33] Lucas_WMDE, well, everything's possible :). You can use scap sync-file to sync IS.php and then sync Wikibase.php [11:57:42] that'll guarantee IS.php is synced before Wikibase.php [11:58:08] well [11:58:12] there’s only two minutes left in SWAT [11:58:19] and fatalmonitor is still full of those errors [11:58:26] so I might not see new errors [11:58:29] so let’s not do that change now [11:58:36] I’ll revert it [11:58:43] okay, that's another possibility :) [11:58:44] (03CR) 10Ema: [C: 03+2] cp1075: do not pass gethdr_extrachance=0 [puppet] - 10https://gerrit.wikimedia.org/r/514699 (https://phabricator.wikimedia.org/T224694) (owner: 10Ema) [11:59:16] Just touch IS and sync it [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190606T1200) [12:00:06] (03PS5) 10Arturo Borrero Gonzalez: toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) [12:00:29] !log cp1075: upgrade varnish to 5.1.3-1wm10 T224694 [12:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:34] T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 [12:01:12] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [12:01:42] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:01:49] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Specify $wgWBRepoSettings['conceptBaseUri']" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514700 [12:02:03] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514700 (owner: 10Lucas Werkmeister (WMDE)) [12:02:40] (03CR) 10Hashar: "12:00:36 Could not parse for environment *root*: illegal comma separated argument list at /srv/workspace/puppet/modules/profile/manifests/" [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [12:03:00] (03Merged) 10jenkins-bot: Revert "Specify $wgWBRepoSettings['conceptBaseUri']" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514700 (owner: 10Lucas Werkmeister (WMDE)) [12:03:17] (03CR) 10jenkins-bot: Revert "Specify $wgWBRepoSettings['conceptBaseUri']" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514700 (owner: 10Lucas Werkmeister (WMDE)) [12:03:54] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:03:56] (03PS4) 10Jbond: installer: ensure facter3 and puppet5 components exists early [puppet] - 10https://gerrit.wikimedia.org/r/514689 [12:04:04] (03PS6) 10Arturo Borrero Gonzalez: toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) [12:04:32] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:04:50] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/: SWAT: [[gerrit:514700|Revert "Specify $wgWBRepoSettings['conceptBaseUri']" (duration: 00m 56s) [12:04:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:00] !log EU SWAT done [12:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:07] (03CR) 10jerkins-bot: [V: 04-1] toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [12:05:38] Lucas_WMDE: are the mw exceptions under control? [12:06:00] should be fine now, I reverted the change [12:06:05] super thanks :) [12:06:09] just wanted to triple check [12:06:10] (and synced it, just to be sure, though it never made it past the canary hosts anyways) [12:06:28] fatalmonitor output is recovering, now mostly “entire web request took longer than 60 seconds” again [12:06:35] yep yep [12:06:36] just one “undefined variable” left at the top [12:06:43] looking into logstash now to see what caused this [12:06:52] I definitely missed commonswiki [12:07:05] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, we in fact need that for archive.debian.org, which by design doesn't get updated any more. This special case is actually why t" [puppet] - 10https://gerrit.wikimedia.org/r/514555 (owner: 10Jbond) [12:07:09] but the undefined variables are strange… the wiki column for those shows as “-” [12:07:15] what does that mean? [12:07:25] no idea :) [12:07:52] I’ll create a Phab task and try to figure it out there :) [12:07:58] this doesn’t need an incident report right? [12:08:32] (03CR) 10Jbond: [C: 03+2] pbuilder: disable Acquire::Check-Valid-Until on repos [puppet] - 10https://gerrit.wikimedia.org/r/514555 (owner: 10Jbond) [12:08:50] <_joe_> well [12:08:58] <_joe_> was it user-visible? [12:09:01] <_joe_> I think so [12:09:15] (03PS7) 10Arturo Borrero Gonzalez: toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) [12:09:26] _joe_, just for users who used the canary host [12:09:28] (IIRC) [12:09:30] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [12:09:35] [12:55:00] !log lucaswerkmeister-wmde@deploy1001 scap failed: average error rate on 8/11 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/db09a36be5ed3e81155041f7d46ad040 for details) [12:09:43] That would/should suggest it wasn't fully deployed [12:09:44] <_joe_> Urbanecm: no, it was deployed to all servers [12:09:52] <_joe_> definitely [12:10:11] _joe_, see above, Reedy says "it wasn't fully deployed" [12:10:27] !log restart mcrouter on mw2235 [12:10:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:33] <_joe_> no, it definitely was, I am saying. [12:10:33] * Reedy looks at the logstash link [12:10:50] _joe_: was it? I thought it wasn’t [12:10:50] <_joe_> fatalmonitor has errors from all servers. [12:11:04] !log cp1075: repool with varnish 5.1.3-1wm10 T224694 [12:11:06] the errors on that logstash link... [12:11:07] [{exception_id}] {exception_url} ErrorException from line 309 of /srv/mediawiki/php-1.34.0-wmf.7/includes/debug/MWDebug.php: PHP Warning: MediaWiki\Storage\SqlBlobStore::fetchBlob: Bad data in text row 106803. [Called from MediaWiki\Storage\SqlBlobStore [12:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:09] T224694: cp3041 - Varnish frontend child restarted icinga alert - https://phabricator.wikimedia.org/T224694 [12:11:11] <_joe_> at least to the canary hosts [12:11:18] <_joe_> which is like 10% of production [12:11:22] <_joe_> not one host [12:11:24] (03CR) 10Hashar: "i usually run the tests locally and serially with:" [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [12:11:32] <_joe_> so this helped reduce impact [12:11:36] well yes, the canary hosts is what I meant [12:11:38] <_joe_> but you can see it here [12:11:43] (03PS2) 10Jbond: pbuilder: disable Acquire::Check-Valid-Until on repos [puppet] - 10https://gerrit.wikimedia.org/r/514555 [12:11:51] <_joe_> https://grafana.wikimedia.org/d/GuHySj3mz/php7-transition?refresh=30s&panelId=14&fullscreen&orgId=1&from=now-30m&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver [12:12:19] <_joe_> 99.95 might not seem like a big deal, but it's user-noticeable [12:14:06] clarification question – did scap automatically revert the change on the canary hosts when it detected the high error rate, or were they only fixed by my second sync? [12:16:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks fine, two nits inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514689 (owner: 10Jbond) [12:16:22] I’ll start an incident report [12:16:41] I think it's only on your second sync with how we use scap in this way [12:16:50] okay, that’s good to kno [12:16:54] *know [12:16:59] I should have done that second sync sooner then [12:17:01] 10Operations, 10Analytics, 10User-Elukey: notebook1003:/srv/ 2% disk space left - https://phabricator.wikimedia.org/T224682 (10ayounsi) Back to warning: > DISK WARNING - free space: /srv 6010 MB (4% inode=83%): [12:17:03] (03CR) 10CDanis: [C: 03+1] types: do not pre-compile regex in SchemaRule [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 (owner: 10Volans) [12:17:13] Because scap doesn't really know what the "previous" state of the file(s) were [12:17:58] oh, right [12:18:20] In other ways we use scap (for deploying services and stuff)... it can do "rollbacks" [12:18:53] (03PS5) 10Jbond: installer: ensure facter3 and puppet5 components exists early [puppet] - 10https://gerrit.wikimedia.org/r/514689 [12:20:54] (03CR) 10Muehlenhoff: [C: 03+1] installer: ensure facter3 and puppet5 components exists early [puppet] - 10https://gerrit.wikimedia.org/r/514689 (owner: 10Jbond) [12:22:15] (03PS6) 10Jbond: installer: ensure facter3 and puppet5 components exists early [puppet] - 10https://gerrit.wikimedia.org/r/514689 [12:23:32] <_joe_> !log running puppet, restarting php-fpm on the canaries to pick up the new opcache size [12:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:58] PROBLEM - DPKG on wezen is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:29:03] (03CR) 10Jbond: [C: 03+2] installer: ensure facter3 and puppet5 components exists early [puppet] - 10https://gerrit.wikimedia.org/r/514689 (owner: 10Jbond) [12:29:44] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:31:02] ^^ looking at wezen [12:32:15] https://wikitech.wikimedia.org/wiki/Incident_documentation/20190606-wikibase [12:32:20] in a meeting for a few minutes, will be back afterwards [12:32:29] <_joe_> Lucas_WMDE: thanks! [12:33:07] (03PS1) 10Mathew.onipe: postgresql: change systemd unit name [cookbooks] - 10https://gerrit.wikimedia.org/r/514705 [12:37:44] RECOVERY - DPKG on wezen is OK: All packages OK [12:40:13] (03PS1) 10Giuseppe Lavagetto: mediawiki::php: increase opcache everywhere [puppet] - 10https://gerrit.wikimedia.org/r/514706 (https://phabricator.wikimedia.org/T224857) [12:40:15] (03PS8) 10Arturo Borrero Gonzalez: toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) [12:40:57] (03CR) 10Sbisson: [C: 03+2] GrowthExperiments (Beta): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514636 (owner: 10Kosta Harlan) [12:41:51] (03Merged) 10jenkins-bot: GrowthExperiments (Beta): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514636 (owner: 10Kosta Harlan) [12:42:07] (03CR) 10jenkins-bot: GrowthExperiments (Beta): Switch on mobile homepage feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514636 (owner: 10Kosta Harlan) [12:43:59] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10jbond) im going to reimage this server to test the following change https://gerrit.wikimedia.org/r/c/operations/puppet/+/514689 [12:44:14] !log reimage neodymium [12:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:43] bye bye neodymium, you were great [12:44:59] it will still be around just reimaging to test a change :) [12:45:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: migrate etcd code from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/514693 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [12:45:08] jbond42: nah, it will never be the same! [12:45:13] lol [12:47:19] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['neodymium.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906061246_... [12:50:52] (03PS1) 10CDanis: dbctl: validate the instance given to section set-master [software/conftool] - 10https://gerrit.wikimedia.org/r/514708 [12:51:35] (03PS1) 10Jbond: sarin: reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/514709 [12:53:56] (03CR) 10Muehlenhoff: [C: 03+1] sarin: reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/514709 (owner: 10Jbond) [12:54:11] (03CR) 10Jbond: [C: 03+2] sarin: reimage to stretch [puppet] - 10https://gerrit.wikimedia.org/r/514709 (owner: 10Jbond) [12:54:37] (03PS1) 10Arturo Borrero Gonzalez: etcd: make monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/514710 [12:55:34] (03CR) 10jerkins-bot: [V: 04-1] etcd: make monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/514710 (owner: 10Arturo Borrero Gonzalez) [12:56:56] (03PS2) 10Arturo Borrero Gonzalez: etcd: make monitoring optional [puppet] - 10https://gerrit.wikimedia.org/r/514710 [13:00:04] zeljkof: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - European version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190606T1300). [13:02:56] thanks jouncebot, but train is blocked on T225197 [13:02:56] T225197: "PHP Warning: Cannot modify header information - headers already sent" from /w/thumb.php - https://phabricator.wikimedia.org/T225197 [13:03:08] (03PS2) 10CDanis: dbctl: validate the instance given to section set-master [software/conftool] - 10https://gerrit.wikimedia.org/r/514708 [13:03:27] (03CR) 10Volans: "I agree with the approach, although it would have been nicer to implement this validation within the Section class, it doesn't have access" (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514708 (owner: 10CDanis) [13:04:10] (03CR) 10Volans: dbctl: validate the instance given to section set-master (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/514708 (owner: 10CDanis) [13:04:43] zeljkof: I don't really think that's actually a train blocker [13:06:01] Reedy: feel free to comment on the patch and/or remove it from blockers [13:06:08] I'm just double checking in logstash [13:06:16] It's basically transient and has mostly gone away [13:06:20] doesn't look serious to me too, but new logspam does block train [13:07:15] no results in the last hour though [13:08:21] should I send the 20190606-wikibase incident report to ops@ ? most of the other incidents at https://wikitech.wikimedia.org/wiki/Incident_documentation don’t seem to have been posted there as far as I can see [13:09:07] Reedy: could it be caused by T224516? [13:09:08] T224516: Database primary master failover on s4 (commonswiki) - https://phabricator.wikimedia.org/T224516 [13:09:13] both mention commonswiki :) [13:09:23] er what? [13:09:35] T224516 has not been done yet [13:09:39] ah no, that's for June 19 [13:09:42] zeljkof: how could something that has not happened yet cause something? [13:09:42] yeah :) [13:09:50] yeah, sorry, didn't read the date correctly :) [13:10:00] There were some other DB changes around the time they were showing, but I don't think they're related [13:10:08] of course [13:10:17] we do db changes every hour [13:10:18] I got triggered by commonswiki [13:10:24] jynus__: even when you sleep? ;) [13:10:37] so there is always the question [13:10:44] (03PS3) 10CDanis: dbctl: validate the instance given to section set-master [software/conftool] - 10https://gerrit.wikimedia.org/r/514708 [13:10:49] spoiler: it is never those changes!!! :-D [13:10:54] PROBLEM - Check the Netbox report-s- puppetdb for fail status. on netmon1002 is CRITICAL: puppetdb.PuppetDB CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:11:05] (03CR) 10CDanis: dbctl: validate the instance given to section set-master (034 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514708 (owner: 10CDanis) [13:16:53] (03CR) 10Gehel: [C: 03+2] wdqs: add WDQS restart cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [13:19:11] Reedy, _joe_: I suppose scap could also depool the canaries until the next sync? it doesn’t need the previous state of any files for that [13:19:22] though automatic repool on the next sync, just assuming that one fixes them again, is icky [13:19:48] actually, in that case you wouldn’t have a reference error rate for the next sync (unless you use a different set of canaries?) [13:19:52] probably a bad idea [13:20:03] that's kinda T104352 [13:20:04] T104352: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352 [13:20:08] <_joe_> yeah I was about to point that out [13:20:15] <_joe_> and there is a better, new ticket about that [13:20:17] <_joe_> :P [13:20:42] <_joe_> https://phabricator.wikimedia.org/T224857 [13:21:57] clean up your dupes then :P [13:22:24] <_joe_> it's related, not exactly the same thing [13:24:24] ok, so no more train blockers, starting the train in a few minutes [13:26:11] 10Operations, 10Deployments: Enable scap to roll back broken changes to MediaWiki - https://phabricator.wikimedia.org/T225207 (10Lucas_Werkmeister_WMDE) [13:31:36] !log gehel@cumin1001 START - Cookbook sre.wdqs.restart-wdqs [13:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:08] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.restart-wdqs (exit_code=0) [13:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:00] (03PS1) 10Zfilipin: all wikis to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514714 [13:33:02] (03CR) 10Zfilipin: [C: 03+2] all wikis to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514714 (owner: 10Zfilipin) [13:33:31] !log gehel@cumin1001 START - Cookbook sre.wdqs.restart-wdqs [13:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:00] (03Merged) 10jenkins-bot: all wikis to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514714 (owner: 10Zfilipin) [13:34:11] PROBLEM - Host mr1-eqiad.oob is DOWN: PING CRITICAL - Packet loss = 100% [13:34:14] (03CR) 10jenkins-bot: all wikis to 1.34.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514714 (owner: 10Zfilipin) [13:34:25] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.restart-wdqs (exit_code=0) [13:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:49] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:17] !log gehel@cumin1001 START - Cookbook sre.wdqs.restart-wdqs [13:35:17] !log gehel@cumin1001 END (FAIL) - Cookbook sre.wdqs.restart-wdqs (exit_code=99) [13:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:35] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:35:52] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.34.0-wmf.8 [13:36:16] !log gehel@cumin1001 START - Cookbook sre.wdqs.restart-wdqs [13:36:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:57] (03PS1) 10Awight: New configuration to pull from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514715 (https://phabricator.wikimedia.org/T224007) [13:37:17] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:38:21] (03CR) 10WMDE-Fisch: [C: 03+1] New configuration to pull from Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514715 (https://phabricator.wikimedia.org/T224007) (owner: 10Awight) [13:38:51] PROBLEM - Nginx local proxy to apache on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:39:09] PROBLEM - HHVM rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:39:25] PROBLEM - Apache HTTP on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:39:35] (03Abandoned) 10Mholloway: Add role/profile for wikifeeds service [puppet] - 10https://gerrit.wikimedia.org/r/514490 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [13:39:37] RECOVERY - Host mr1-eqiad.oob is UP: PING OK - Packet loss = 0%, RTA = 0.79 ms [13:40:05] RECOVERY - Nginx local proxy to apache on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.428 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:40:15] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 2.40 ms [13:40:23] RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 76219 bytes in 2.174 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:40:37] RECOVERY - Apache HTTP on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:42:23] train done, all wikis at wmf.8, nothing exploded. so far :P [13:42:43] \o/ [13:43:42] zeljkof: I see logspam caused by cirrus extension registration [13:44:12] dcausse: please create a task and add to next week's train as a blocker [13:44:22] or, is it serious? should I revert? [13:44:32] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [13:44:34] I have a patch for mw-config [13:44:35] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:50] !log rolling reboot of sessionstore hosts in codfw for kernel security update [13:44:52] (03PS3) 10DCausse: [cirrus] extension registration: don't assume default vars are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) [13:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:56] (03PS5) 10DCausse: [cirrus] Load cirrus using wfLoadExtension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513068 (https://phabricator.wikimedia.org/T87892) [13:44:58] (03PS4) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [1/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513605 (https://phabricator.wikimedia.org/T87892) [13:45:00] (03PS4) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [2/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513606 (https://phabricator.wikimedia.org/T87892) [13:45:02] (03PS5) 10DCausse: [cirrus] drop most wmgCirrusSearch* ephemeral config vars [3/3] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513607 (https://phabricator.wikimedia.org/T87892) [13:45:04] (03PS2) 10DCausse: [cirrus] remove unused wgCirrusSearchRequestEventSampling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513982 [13:45:06] (03PS3) 10DCausse: [cirrus] Enable UTR30 as a lookup method for ns prefixes on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512195 [13:45:16] (remember, no train next week) [13:45:22] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['neodymium.eqiad.wmnet'] ` Of which those **FAILED**: ` ['neodymium.eqiad.wmnet'] ` [13:45:28] ah, forgot about no train next week [13:45:37] well, train blocker for the week after then [13:45:50] or, if you can fix it by then... :) [13:45:50] zeljkof: this is causing huge spam: PHP Notice: Undefined variable: wgCirrusSearchPoolCounterKey in /srv/mediawiki/wmf-config/CirrusSearch-production.php on line 86 [13:46:00] dcausse: should I revert? [13:46:08] How will reverting help? [13:46:09] zeljkof: if I can swat this https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/513556 [13:46:10] It's in mw-cofnig [13:46:11] (deployment of wmf.8 to all wikis) [13:46:29] Reedy: it's caused by extension registration and the change in how globals are declared [13:46:51] dcausse: please do swat now if that will fix the problem [13:46:58] ok [13:48:25] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:48:41] <_joe_> something bad is going on [13:48:46] (03CR) 10DCausse: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [13:49:36] _joe_: dcausse is swatting a config change, cirrus caused log spam after train deployment [13:49:38] (03Merged) 10jenkins-bot: [cirrus] extension registration: don't assume default vars are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [13:53:57] <_joe_> dcausse: urls like https://en.wikipedia.org/api/rest_v1/page/related/The_Short-Tempered_Clavier_and_other_dysfunctional_works_for_keyboard are failing to render with "cirrussearch-too-busy-error" errors [13:54:02] (03CR) 10jenkins-bot: [cirrus] extension registration: don't assume default vars are set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/513556 (https://phabricator.wikimedia.org/T87892) (owner: 10DCausse) [13:54:09] <_joe_> that url is served from restbase [13:54:09] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-production.php: fix logspam (duration: 00m 47s) [13:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:28] _joe_: the poolcounter key was messed up [13:54:45] <_joe_> ok [13:54:50] <_joe_> so this should now be fixed? [13:55:09] yes it should unless I did something wrong [13:55:23] <_joe_> let's see if the 5xx go back to normal [13:55:34] The last error I saw was from .54 [13:55:44] the example from before is working for me now [13:56:52] <_joe_> yeah don't trust the first time you get a good result [13:57:11] logspam is gone [13:57:56] <_joe_> I still see errors [13:58:16] yeah, restbase still erroring [13:58:50] hmmm [13:59:35] still getting the same erorr from mw it seems [13:59:35] <_joe_> the alerts have superseded though [13:59:58] logstash is also not showing more errors so far [14:00:04] affecting only /api/rest_v1/page/related/ it seems? [14:00:11] looks so [14:00:22] there we go, more errors just now [14:00:59] <_joe_> why is this not superseded? [14:01:01] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:01:08] errors still persist [14:01:21] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['neodymium.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906061400_... [14:01:49] <_joe_> mobrovac: so requests to the api cluster spiked some minutes ago [14:01:49] you can see it ongoing at https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [14:01:59] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:02:43] <_joe_> https://grafana.wikimedia.org/d/000000327/apache-hhvm?panelId=27&fullscreen&orgId=1&from=now-1h&to=now [14:02:43] all pass/miss to https://en.wikipedia.org/api/rest_v1/page/related/* [14:03:07] <_joe_> jynus__: can you check if search works on the main site? [14:03:13] it works [14:03:15] _joe_: it works for me [14:03:17] oh [14:03:17] I tested the first thing [14:03:20] yeah [14:03:33] it is not cirrus, or at least not trivially [14:03:45] the related UI was broken in last version [14:03:48] (it could be a specific query, etc.) [14:03:54] so it just started to work again [14:04:02] <_joe_> it's the api telling restbase it's doing too many requests [14:04:09] <_joe_> dcausse: the UI of what? [14:04:25] so varnish needs warmup [14:04:27] User-Agent featured in the errors seems to be mostly WikipediaApp/2.7[...] [14:04:29] :-/ [14:04:41] dcausse: what was broken before? [14:04:42] _joe_: in mobile web the 3 related articles shown at the end of the page [14:04:44] yeah, all live requests [14:04:57] <_joe_> dcausse:oh ok so it's your change that caused this? [14:05:07] <_joe_> the requests seem to come from the mobile app though [14:05:15] <_joe_> not the mobile web [14:05:19] yes mobile web [14:05:22] the errors are now decreasing quite a lot [14:05:27] mobile app android specifically [14:05:35] <_joe_> the requests are going down too [14:05:46] so the pipeline varnish -> rb -> cirrus is warming up [14:06:12] and the error in pool counter (completely unrelated) really did not help [14:06:34] <_joe_> we're still having poolcounter errors though [14:06:35] (03Abandoned) 10Mholloway: Add nagios contact group for wikifeeds service [puppet] - 10https://gerrit.wikimedia.org/r/514489 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [14:07:06] <_joe_> "cirrussearch-too-busy-error" [14:07:09] <_joe_> still getting those [14:07:24] <_joe_> most are cached though [14:07:29] (03CR) 10Bstorm: [C: 03+1] "Looks good to me. Interested what Fsero thinks :)" [puppet] - 10https://gerrit.wikimedia.org/r/514710 (owner: 10Arturo Borrero Gonzalez) [14:07:39] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External), 10User-jijiki: Requesting access to deployment for Christoph Jauera (WMDE-Fisch) - https://phabricator.wikimedia.org/T211014 (10Ladsgroup) I just added @WMDE-Fisch to [[https://gerrit.wikimedia.org/... [14:08:06] _joe_: where do you see here errors? [14:08:16] s/here/these [14:08:28] <_joe_> on rb requests [14:08:30] dcausse: I just curled https://en.wikipedia.org/api/rest_v1/page/related/Andrasch_Starke and got "Search is currently too busy. Please try again later." [14:08:49] <_joe_> ema: now I got it correctly [14:08:59] <_joe_> the best way to see those dcausse is logstash [14:08:59] error rate now down to 6/s from 12/s [14:09:03] dcausse: https://logstash.wikimedia.org/goto/a03e506e5ac2afce741f81d8f7ff5274 [14:09:07] <_joe_> varnish request 5xx dashboard [14:09:07] PROBLEM - Host mr1-codfw.oob is DOWN: CRITICAL - Network Unreachable (216.117.46.36) [14:09:14] nice [14:09:19] <_joe_> it's oob [14:09:35] <_joe_> XioNoX ^^ [14:09:49] thx [14:10:03] not urgent/service impacting though [14:10:18] <_joe_> dcausse: next time we might want to go gradually across wikis probably [14:10:36] "CyrusOne carrier will be installing new OSP fiber from ZVC to CCA at the Dallas - Carrollton - Data Center " probably related [14:10:42] what are the expectations then? as soon as it is warmed up it will be fine? [14:10:57] marostegui: I think so [14:12:28] dcausse: is there anything that can be done to speed that up? [14:12:49] I downtimed the OOB alert for the time of the maitenance [14:13:07] I could increase the poolcounter for morelike (if elastic can survive) [14:13:18] <_joe_> dcausse: that's what I was about to ask [14:13:27] <_joe_> how do we regulate that? [14:13:39] <_joe_> it's recovering btw [14:13:52] <_joe_> I don't think it's needed /now/, but it would be good to know [14:13:52] poolcounter is the only component that saves elastic [14:14:10] so increasing it will push the load to elastic [14:14:59] CRITICAL: WebPageTest alerts ( https://grafana.wikimedia.org/d/000000318/webpagetest-alerts ) is alerting: First Visual Change Mobile [ALERT]. [14:15:22] could be related ^ [14:16:14] could? :-D [14:16:46] it makes sense that it is, but cannot be affirmative [14:16:51] <_joe_> it has been going on for the last couple days [14:16:55] <_joe_> on and off [14:16:59] ok [14:18:08] we are still getting errors, less, but still erroring [14:18:27] yup, going down to 4/s now [14:18:40] !log gehel@cumin1001 END (PASS) - Cookbook sre.wdqs.restart-wdqs (exit_code=0) [14:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:45] <_joe_> dcausse: it's always a balance [14:18:54] problem is that we already pushing elastic quite hard https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?panelId=54&fullscreen&orgId=1&from=now-1h&to=now [14:19:03] we could use codfw perhaps [14:19:15] <_joe_> use both you mean? [14:19:27] not I can easily use both for these queries [14:19:35] I think I can route morelike to codfw [14:19:44] <_joe_> ok [14:19:45] but not split 50/50 [14:19:57] <_joe_> that and increase the poolcounter for those? [14:19:59] if (random() < 0.5) :P [14:20:10] :) [14:20:22] <_joe_> one day I'll be able to move these percentages via confctl [14:20:40] <_joe_> hopefully within the next year or so [14:21:02] (03CR) 10Alexandros Kosiaris: "Actually, this still applies as our monitoring configuration will still be in puppet, regardless of where and how the wikifeeds service wi" [puppet] - 10https://gerrit.wikimedia.org/r/514489 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [14:21:05] i think you said something similar last year [14:21:06] (03Restored) 10Alexandros Kosiaris: Add nagios contact group for wikifeeds service [puppet] - 10https://gerrit.wikimedia.org/r/514489 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [14:21:07] I am not sure about poolcounter, I prefer rest having issues thatn both rest and search... [14:21:42] <_joe_> it's mobile, not rest :) [14:22:08] yeah, thay are all the same blob to me :-D [14:22:20] that thing I don''t know about [14:22:21] !log dcausse@deploy1001 Synchronized wmf-config/CirrusSearch-common.php: fix logspam (duration: 00m 48s) [14:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:28] btw we are having very small amount of errors as of now [14:22:43] safer to wait then? [14:22:51] ~3/s now [14:22:56] it's going down, but rather slowly [14:23:10] yes... [14:24:14] baseline is about 0.2/s FYI [14:24:52] (03PS2) 10Alexandros Kosiaris: Add nagios contact group for wikifeeds service [puppet] - 10https://gerrit.wikimedia.org/r/514489 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [14:26:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: increase opcache everywhere [puppet] - 10https://gerrit.wikimedia.org/r/514706 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [14:27:28] I think we can wait, we are now down to 1/2 [14:27:55] around ~1.5/s now [14:28:01] so 3/2 :P [14:28:04] (03PS6) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [14:28:32] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add nagios contact group for wikifeeds service [puppet] - 10https://gerrit.wikimedia.org/r/514489 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [14:28:45] dbrant: o/ apparently the UA involved in the errors i mentioned are mostly like WikipediaApp/2.7[...] [14:28:51] (03PS2) 10Giuseppe Lavagetto: mediawiki::php: increase opcache everywhere [puppet] - 10https://gerrit.wikimedia.org/r/514706 (https://phabricator.wikimedia.org/T224857) [14:28:52] elastic is still serving 500morelike/s, when warmed up it's around 130 morelike/s [14:29:07] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [14:29:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:29:30] but i understand from mobrovac that the issue has been pinned down to a CirrusSearch config change, so maybe there's something going on between the combination of the two [14:29:58] mdholloway: dbrant: the UA is WikipediaApp/2.7.50282-r-2019-05-24 (Android 9; Phone) Google Play for all of the logged errors i've seen [14:30:12] probably that's just the latest version [14:30:18] might be just a coincidence, though [14:30:21] yes it us [14:30:23] *is [14:30:32] <_joe_> it was a combination of the two yes [14:30:41] I'll start writing the incident report, morelike is used by mobileweb AND mobile apps [14:31:40] down to 1/s now [14:31:50] we should be out of the woods soon [14:34:20] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10StevenJ81) Is there a proposed timeline on this and T210752? Is there information I need to get from the contributor communities to help move... [14:36:37] ok i think the outage is over [14:37:26] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Tgr) @fsero this was feedback from modular.im support (and the modular.im config panel indeed checks for the .well-k... [14:37:33] uh not quite yet, false alarm, sorry [14:38:13] 10Operations, 10DNS, 10Matrix, 10Traffic, 10Wikimedia-Apache-configuration: Configure wikimedia.org to enable *:wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T223835 (10Tgr) [14:40:08] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10Urbanecm) @StevenJ81 Hi, all wiki creations are blocked on T212881. That's a technical problem, and a wiki cannot be technically created. [14:41:21] (03PS3) 10Mforns: analytics::refinery::job::refine Bump up refinery_jar_version [puppet] - 10https://gerrit.wikimedia.org/r/514616 [14:41:53] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 30.92 ms [14:43:13] !log restart mcrouter on mw2255 (codfw proxy) to pick up new config changes [14:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:56] !log updating qemu packages on ganeti hosts to deploy support for md_clear/MDS for Ganeti instances [14:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:34] 10Operations, 10Patch-For-Review, 10User-Ladsgroup, 10User-Urbanecm, 10Wiki-Setup (Create): Create Wikisource Hindi - https://phabricator.wikimedia.org/T218155 (10StevenJ81) Responded at T210752. [14:46:15] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational [14:48:43] (03PS1) 10Fsero: adding fake apache2modsec keyholder [labs/private] - 10https://gerrit.wikimedia.org/r/514736 [14:49:14] (03CR) 10Fsero: [V: 03+2 C: 03+2] adding fake apache2modsec keyholder [labs/private] - 10https://gerrit.wikimedia.org/r/514736 (owner: 10Fsero) [14:49:36] ok, it's over now for realz [14:49:48] the cirrus errors have ceased [14:51:05] (03PS4) 10Fsero: phabricator: Install php-mailparse [puppet] - 10https://gerrit.wikimedia.org/r/513713 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [14:51:35] (03CR) 10Fsero: "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/16903/phab1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/513713 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [14:52:46] 10Operations, 10ops-eqiad: Storage problems with new host db1133 - https://phabricator.wikimedia.org/T222731 (10Cmjohnson) You have successfully submitted request SR991779294. [14:53:55] (03CR) 10Fsero: [C: 03+2] phabricator: Install php-mailparse [puppet] - 10https://gerrit.wikimedia.org/r/513713 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [14:54:50] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::refine Bump up refinery_jar_version [puppet] - 10https://gerrit.wikimedia.org/r/514616 (owner: 10Mforns) [14:54:52] (03PS4) 10Elukey: analytics::refinery::job::refine Bump up refinery_jar_version [puppet] - 10https://gerrit.wikimedia.org/r/514616 (owner: 10Mforns) [14:55:02] 10Operations, 10ops-eqiad, 10User-fgiunchedi: ms-be1033 not powering up - https://phabricator.wikimedia.org/T223518 (10Cmjohnson) The HP technician will be her June 7 @1000 Ashburn time. [14:55:12] fsero: thanks! [14:57:18] !log T224850 update views on labsdb1012 [14:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:24] T224850: Offer alternate views of the comment and actor tables which only check for supression in a single table in the Wiki Replicas - https://phabricator.wikimedia.org/T224850 [14:57:58] twentyafterfour: thanks to you, i would appreciate if you can test if its working [14:57:59] 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, and 2 others: Phabricator email comments not posted - https://phabricator.wikimedia.org/T224752 (10fsero) merged and applied [14:59:17] 10Operations, 10ops-eqiad: eqiad: rack and setup (3) dual CPU servers - https://phabricator.wikimedia.org/T225219 (10Cmjohnson) [14:59:25] fsero of course [14:59:35] 10Operations, 10ops-eqiad: eqiad: rack and setup (3) dual CPU servers - https://phabricator.wikimedia.org/T225219 (10Cmjohnson) [15:00:34] I see puppet ran and created requisite files [15:01:19] (03PS1) 10Jhedden: openstack: allow designate to access puppet encapi [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) [15:01:25] (03PS2) 10Cmjohnson: Setting up mgmt ip for wmf5177/wmf5178 [dns] - 10https://gerrit.wikimedia.org/r/514328 (https://phabricator.wikimedia.org/T225219) [15:01:27] (03PS3) 10Cmjohnson: Setting up mgmt ip for wmf5177/wmf5178 [dns] - 10https://gerrit.wikimedia.org/r/514328 (https://phabricator.wikimedia.org/T225219) [15:01:37] 10Operations, 10Security-Team: apache modsec rules deployment with scap - https://phabricator.wikimedia.org/T224887 (10fsero) i also added a fake key on labs/private PCC runs where erroring because of this [15:01:54] (03CR) 10Cmjohnson: [C: 03+2] Setting up mgmt ip for wmf5177/wmf5178 [dns] - 10https://gerrit.wikimedia.org/r/514328 (https://phabricator.wikimedia.org/T225219) (owner: 10Cmjohnson) [15:02:46] <_joe_> !log rolling restart of php-fpm on {appservers,api} in eqiad, in groups of 4, staggered by 10 minutes, to pick up the new opcache settings [15:02:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:21] (03CR) 10Bstorm: [C: 03+2] "> Patch Set 4:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/511043 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [15:03:31] (03PS5) 10Bstorm: dologmsg: add -h/--help option [puppet] - 10https://gerrit.wikimedia.org/r/511043 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [15:04:09] (03PS7) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [15:05:48] !log rolling reboot of sessionstore hosts in eqiad for kernel security update [15:05:50] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['neodymium.eqiad.wmnet'] ` Of which those **FAILED**: ` ['neodymium.eqiad.wmnet'] ` [15:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:58] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:06:01] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:08] (03CR) 10Bstorm: "Since I merged the other one, this will need a local rebase conflict fix before it can be merged." [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [15:08:15] (03CR) 10Andrew Bogott: [C: 04-1] "I think you'll also need to update things in the eqiad1 profile tree: profile::openstack::eqiad1::puppetmaster::backend and profile::opens" [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) (owner: 10Jhedden) [15:09:14] (03CR) 10Bstorm: [C: 03+1] "I think this looks like it would work." [puppet] - 10https://gerrit.wikimedia.org/r/512338 (https://phabricator.wikimedia.org/T169287) (owner: 10Arturo Borrero Gonzalez) [15:09:54] (03PS5) 10Lucas Werkmeister (WMDE): dologmsg: extract variables from Toolforge dologmsg [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) [15:10:40] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [15:11:24] (03Abandoned) 10MSantos: Restore cpu ratio for maps1004 [puppet] - 10https://gerrit.wikimedia.org/r/485222 (owner: 10MSantos) [15:12:27] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:12:31] (03PS2) 10Lucas Werkmeister (WMDE): dologmsg: fix variable [puppet] - 10https://gerrit.wikimedia.org/r/511750 [15:14:50] (03CR) 10Bstorm: cumin: Allow Puppet DB backend to be used within Labs projects that use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [15:15:48] (03CR) 10Bstorm: "Since we moved to new hosts and roles (and added ferm!), I think this can be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/491007 (owner: 10Andrew Bogott) [15:16:29] (03Abandoned) 10Andrew Bogott: labsdb: add ::role::mariadb::ferm to the master role [puppet] - 10https://gerrit.wikimedia.org/r/491007 (owner: 10Andrew Bogott) [15:17:56] (03CR) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [15:18:30] 10Operations, 10ops-eqiad, 10DBA: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10Cmjohnson) The server is out of warrant and we will need to order more 600GB disks. [15:18:35] (03PS25) 10Alex Monk: cumin: Allow Puppet DB backend to be used within Labs projects that use it [puppet] - 10https://gerrit.wikimedia.org/r/437052 [15:19:59] (03PS1) 1020after4: Increase priority of php-mailparse [puppet] - 10https://gerrit.wikimedia.org/r/514742 (https://phabricator.wikimedia.org/T224752) [15:20:22] fsero: didn't work. Follow up with a fix: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/514742/ [15:20:27] yuo] [15:21:11] (03CR) 10jerkins-bot: [V: 04-1] Increase priority of php-mailparse [puppet] - 10https://gerrit.wikimedia.org/r/514742 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [15:21:47] (03CR) 10Paladox: [C: 04-1] Increase priority of php-mailparse (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514742 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [15:21:58] twentyafterfour: fix your commit msg please :) [15:23:06] (03PS1) 10Bstorm: wikireplicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514744 (https://phabricator.wikimedia.org/T224850) [15:23:12] (03PS2) 1020after4: Increase priority of php-mailparse [puppet] - 10https://gerrit.wikimedia.org/r/514742 (https://phabricator.wikimedia.org/T224752) [15:23:14] (03PS8) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [15:23:41] (03PS2) 10Jhedden: openstack: allow designate to access puppet encapi [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) [15:23:53] (03PS3) 1020after4: Increase priority of php-mailparse [puppet] - 10https://gerrit.wikimedia.org/r/514742 (https://phabricator.wikimedia.org/T224752) [15:24:28] (03CR) 1020after4: "fixed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514742 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [15:24:51] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [15:25:00] (03CR) 10Paladox: [C: 03+1] Increase priority of php-mailparse [puppet] - 10https://gerrit.wikimedia.org/r/514742 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [15:25:34] 10Operations, 10Traffic, 10Patch-For-Review: Rate limit requests to cache_upload - https://phabricator.wikimedia.org/T224884 (10ema) 05Open→03Resolved [15:25:36] (03CR) 10Alex Monk: [C: 03+1] openstack: allow designate to access puppet encapi [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) (owner: 10Jhedden) [15:27:38] (03CR) 10Fsero: [C: 03+2] Increase priority of php-mailparse [puppet] - 10https://gerrit.wikimedia.org/r/514742 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [15:28:03] (03CR) 1020after4: [C: 03+1] "puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/514742 (https://phabricator.wikimedia.org/T224752) (owner: 1020after4) [15:28:37] RECOVERY - Long running screen/tmux on sessionstore1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [15:28:51] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [15:28:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:23] 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, and 2 others: Phabricator email comments not posted - https://phabricator.wikimedia.org/T224752 (10mmodell) testing reply via email. [15:29:56] (03PS1) 10Elukey: profile::kerberos::kadminserver: add auth_users to rsync's module config [puppet] - 10https://gerrit.wikimedia.org/r/514749 (https://phabricator.wikimedia.org/T212257) [15:30:48] 10Operations, 10Mail, 10Phabricator, 10Patch-For-Review, and 2 others: Phabricator email comments not posted - https://phabricator.wikimedia.org/T224752 (10mmodell) 05Open→03Resolved a:03mmodell Working! Thanks @fsero [15:30:55] (03CR) 10Volans: "The compiler says that this will change the default backend on the labpuppetmaster hosts, see:" [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [15:31:03] (03PS2) 10Elukey: profile::kerberos::kadminserver: add auth_users to rsync's module config [puppet] - 10https://gerrit.wikimedia.org/r/514749 (https://phabricator.wikimedia.org/T212257) [15:31:51] (03CR) 10Andrew Bogott: [C: 03+1] "looks good! Let's see what the pcc says" [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) (owner: 10Jhedden) [15:33:10] (03CR) 10Elukey: [C: 03+2] profile::kerberos::kadminserver: add auth_users to rsync's module config [puppet] - 10https://gerrit.wikimedia.org/r/514749 (https://phabricator.wikimedia.org/T212257) (owner: 10Elukey) [15:36:08] (03CR) 10Alex Monk: "my reading of that labspuppetmaster diff is that it will create a new variable called default_backend and set it to openstack, which is wh" [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [15:39:50] (03PS1) 10Elukey: rsync::server::module: add check for secrets_file [puppet] - 10https://gerrit.wikimedia.org/r/514750 [15:41:23] (03CR) 10jerkins-bot: [V: 04-1] rsync::server::module: add check for secrets_file [puppet] - 10https://gerrit.wikimedia.org/r/514750 (owner: 10Elukey) [15:41:59] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2002.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906061541_gehel_1... [15:42:30] ah I missed tests [15:42:49] (03CR) 10Mholloway: "My main motivation for this right now was to set up a beta cluster instance. Are there docs on how to do this with a new service on k8s?" [puppet] - 10https://gerrit.wikimedia.org/r/514490 (https://phabricator.wikimedia.org/T170455) (owner: 10Mholloway) [15:43:52] (03CR) 10Arturo Borrero Gonzalez: "Here is a PCC run, with results as expected: https://puppet-compiler.wmflabs.org/compiler1001/16914/" [puppet] - 10https://gerrit.wikimedia.org/r/514710 (owner: 10Arturo Borrero Gonzalez) [15:44:51] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [15:46:24] (03PS2) 10Bstorm: wikireplicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514744 (https://phabricator.wikimedia.org/T224850) [15:47:42] (03CR) 10Jhedden: "PCC run looks good https://puppet-compiler.wmflabs.org/compiler1002/16912/labpuppetmaster1001.wikimedia.org/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) (owner: 10Jhedden) [15:48:23] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1010 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514744 (https://phabricator.wikimedia.org/T224850) (owner: 10Bstorm) [15:48:28] (03PS2) 10Gehel: postgresql: change systemd unit name [cookbooks] - 10https://gerrit.wikimedia.org/r/514705 (owner: 10Mathew.onipe) [15:49:02] 10Operations, 10ops-eqiad, 10DBA: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10jcrespo) I would suggest to take one out of the less important services and replace it here, I will see with @Marostegui where from. [15:50:20] (03PS1) 10Arturo Borrero Gonzalez: profile: etcd: make peer list configurable [puppet] - 10https://gerrit.wikimedia.org/r/514751 (https://phabricator.wikimedia.org/T215531) [15:50:46] (03CR) 10Gehel: [C: 03+2] postgresql: change systemd unit name [cookbooks] - 10https://gerrit.wikimedia.org/r/514705 (owner: 10Mathew.onipe) [15:55:54] (03CR) 10Elukey: "Need to fix tests and rubocop complaints." [puppet] - 10https://gerrit.wikimedia.org/r/514750 (owner: 10Elukey) [15:56:53] !log T224850 depooled labsdb1010 for view updates [15:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:57] T224850: Offer alternate views of the comment and actor tables which only check for supression in a single table in the Wiki Replicas - https://phabricator.wikimedia.org/T224850 [15:57:19] 10Operations, 10ops-eqiad, 10DBA: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10jcrespo) The failure is predictictive, it should hold for some time. I suggest to wait for db1068 switch T224852, and once that is resolved use one of its good disks for... [16:00:04] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190606T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:32] (03CR) 10Arturo Borrero Gonzalez: "PCC as expected: https://puppet-compiler.wmflabs.org/compiler1002/16915/" [puppet] - 10https://gerrit.wikimedia.org/r/514751 (https://phabricator.wikimedia.org/T215531) (owner: 10Arturo Borrero Gonzalez) [16:04:25] 10Operations, 10ops-codfw, 10Cloud-Services: rack/setup codfw: cloudbackup2001.codfw.wmnet and cloudbackup2002.codfw.wmnet - https://phabricator.wikimedia.org/T224528 (10Papaul) [16:07:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 42.86% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:09:44] (03CR) 10Smalyshev: [C: 03+1] Remove $wgLexemeDisableCirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514697 (https://phabricator.wikimedia.org/T225183) (owner: 10Reedy) [16:10:16] (03CR) 10Bstorm: dologmsg: extract variables from Toolforge dologmsg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [16:13:29] (03CR) 10Thcipriani: [C: 03+1] nagios_common: update members of the gerrit contact group [puppet] - 10https://gerrit.wikimedia.org/r/512292 (owner: 10Dzahn) [16:15:46] (03PS1) 10Jbond: install_late: make sure lsb_release is installed before we use it [puppet] - 10https://gerrit.wikimedia.org/r/514784 [16:19:22] (03CR) 10Jbond: [C: 03+2] install_late: make sure lsb_release is installed before we use it [puppet] - 10https://gerrit.wikimedia.org/r/514784 (owner: 10Jbond) [16:20:56] (03PS3) 10Fsero: mcrouter: page 7 days before certs got expired [puppet] - 10https://gerrit.wikimedia.org/r/511397 (https://phabricator.wikimedia.org/T221346) [16:21:00] (03PS1) 10Fsero: ldap-requests: Requesting access to Logstash for Cstone [puppet] - 10https://gerrit.wikimedia.org/r/514790 (https://phabricator.wikimedia.org/T225010) [16:22:30] (03PS1) 10Giuseppe Lavagetto: mediawiki::php::monitoring: add checks for opcache status [puppet] - 10https://gerrit.wikimedia.org/r/514799 (https://phabricator.wikimedia.org/T224857) [16:27:54] (03PS6) 10Lucas Werkmeister (WMDE): dologmsg: extract variables from Toolforge dologmsg [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) [16:28:33] (03CR) 10Lucas Werkmeister (WMDE): dologmsg: extract variables from Toolforge dologmsg (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [16:31:26] (03PS7) 10Lucas Werkmeister (WMDE): dologmsg: extract variables from Toolforge dologmsg [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) [16:35:07] (03PS3) 10Jhedden: onboarding: add jhedden to prod icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/514102 (https://phabricator.wikimedia.org/T224192) [16:35:24] (03PS2) 10Jhedden: onboarding: add jhedden contact info and groups [puppet] - 10https://gerrit.wikimedia.org/r/514195 (https://phabricator.wikimedia.org/T224192) [16:39:55] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 28 failures. Last run 6 minutes ago with 28 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[cpjobqueue/deploy],Exec[chown /srv/deployment/cpjobqueue for deploy-service],Package[recommendation-api/deploy] [16:45:22] 10Operations, 10Analytics, 10EventBus, 10Wikimedia-Logstash: Move eventgate logs to new logging infrastructure - https://phabricator.wikimedia.org/T225129 (10fdans) p:05Triage→03Normal [16:45:30] (03CR) 10Fsero: [C: 03+2] ldap-requests: Requesting access to Logstash for Cstone [puppet] - 10https://gerrit.wikimedia.org/r/514790 (https://phabricator.wikimedia.org/T225010) (owner: 10Fsero) [16:45:44] (03Abandoned) 10Fsero: mcrouter: page 7 days before certs got expired [puppet] - 10https://gerrit.wikimedia.org/r/511397 (https://phabricator.wikimedia.org/T221346) (owner: 10Fsero) [16:45:55] 10Operations, 10Analytics, 10netops: Move cloudvirtan* hardware out of CloudVPS back into production Analytics VLAN. - https://phabricator.wikimedia.org/T225128 (10fdans) p:05Triage→03High [16:46:00] (03PS2) 10Fsero: ldap-requests: Requesting access to Logstash for Cstone [puppet] - 10https://gerrit.wikimedia.org/r/514790 (https://phabricator.wikimedia.org/T225010) [16:47:25] (03PS4) 10CDanis: dbctl: validate the instance given to section set-master [software/conftool] - 10https://gerrit.wikimedia.org/r/514708 [16:47:43] (03CR) 10Bstorm: [C: 03+2] "> Patch Set 6:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [16:48:12] (03CR) 10Bstorm: "Before I merge lemme just make sure we didn't make a mistake with the compiler..." [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [16:48:38] (03CR) 10Lucas Werkmeister (WMDE): "Please do, I’m not a Puppet expert at all :)" [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [16:48:44] (03PS9) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [16:49:46] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [16:50:27] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:51:13] (03CR) 10Bstorm: "Evaluation Error: Error while evaluating a Function Call, Could not find template 'toolforge/dologmsg.erb' at /srv/jenkins-workspace/puppe" [puppet] - 10https://gerrit.wikimedia.org/r/510999 (https://phabricator.wikimedia.org/T222244) (owner: 10Lucas Werkmeister (WMDE)) [16:51:18] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['maps2002.codfw.wmnet'] ` [16:52:30] (03CR) 10Volans: [C: 03+2] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/514708 (owner: 10CDanis) [16:52:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/514790 (https://phabricator.wikimedia.org/T225010) (owner: 10Fsero) [16:55:21] (03Merged) 10jenkins-bot: dbctl: validate the instance given to section set-master [software/conftool] - 10https://gerrit.wikimedia.org/r/514708 (owner: 10CDanis) [16:55:54] (03CR) 10Effie Mouzeli: "LGTM, 1 minor comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514799 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [16:58:32] 10Operations, 10ops-eqiad, 10DBA: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10Marostegui) Yeah, let's use an used disk to replace this one. And we can schedule s7 failover after s4. The new server is ready in s7 as well. I scheduled s4 first caus... [16:59:06] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Logstash for Cstone - https://phabricator.wikimedia.org/T225010 (10fsero) Hi Christine, this should be done, let us know if you have any futher problem [16:59:17] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Logstash for Cstone - https://phabricator.wikimedia.org/T225010 (10fsero) 05Open→03Resolved a:03fsero [16:59:19] 10Operations, 10ops-eqiad, 10DBA: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10Marostegui) I think we should wait till the disk has fully failed [17:00:04] cscott, arlolra, subbu, and halfak: Your horoscope predicts another unfortunate Services – Graphoid / Parsoid / Citoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190606T1700). [17:00:12] nothing for parsoid today [17:00:19] (03PS2) 10Giuseppe Lavagetto: mediawiki::php::monitoring: add checks for opcache status [puppet] - 10https://gerrit.wikimedia.org/r/514799 (https://phabricator.wikimedia.org/T224857) [17:01:42] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10jcrespo) [17:01:46] 10Operations, 10ops-eqiad, 10DBA: db1062 (s7 db primary master) disk with predictive failure - https://phabricator.wikimedia.org/T224805 (10jcrespo) 05Open→03Stalled p:05High→03Normal [17:02:25] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/compiler1002/16919/mw1261.eqiad.wmnet/ makes sense, and I also tested the script on a real appserver." [puppet] - 10https://gerrit.wikimedia.org/r/514799 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [17:02:27] (03PS1) 10Fsero: ldap_requests: adding rmaung to wmf [puppet] - 10https://gerrit.wikimedia.org/r/514818 (https://phabricator.wikimedia.org/T224744) [17:02:44] (03CR) 10Giuseppe Lavagetto: mediawiki::php::monitoring: add checks for opcache status (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514799 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [17:03:42] (03CR) 10Effie Mouzeli: [C: 03+1] mediawiki::php::monitoring: add checks for opcache status [puppet] - 10https://gerrit.wikimedia.org/r/514799 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [17:03:50] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php::monitoring: add checks for opcache status [puppet] - 10https://gerrit.wikimedia.org/r/514799 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [17:03:57] (03CR) 10Fsero: [C: 03+2] ldap_requests: adding rmaung to wmf [puppet] - 10https://gerrit.wikimedia.org/r/514818 (https://phabricator.wikimedia.org/T224744) (owner: 10Fsero) [17:05:05] (03PS2) 10Fsero: ldap_requests: adding rmaung to wmf [puppet] - 10https://gerrit.wikimedia.org/r/514818 (https://phabricator.wikimedia.org/T224744) [17:05:22] (03PS1) 10Jbond: install_server: use echo as printf dosen't act as expected [puppet] - 10https://gerrit.wikimedia.org/r/514820 [17:05:53] (03PS2) 10Jbond: install_server: use echo as printf dosen't act as expected [puppet] - 10https://gerrit.wikimedia.org/r/514820 [17:07:03] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [17:08:08] (03Abandoned) 10Jbond: install_server: use echo as printf dosen't act as expected [puppet] - 10https://gerrit.wikimedia.org/r/514820 (owner: 10Jbond) [17:09:01] (03PS1) 10Giuseppe Lavagetto: mediawiki::php::monitoring: fix path [puppet] - 10https://gerrit.wikimedia.org/r/514822 [17:09:11] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2002.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906061709_gehel_3... [17:09:22] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['maps2002.codfw.wmnet'] ` [17:09:24] (03PS10) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [17:09:26] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki::php::monitoring: fix path [puppet] - 10https://gerrit.wikimedia.org/r/514822 (owner: 10Giuseppe Lavagetto) [17:09:28] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Request to add Rmaung to the ldap/wmf group - https://phabricator.wikimedia.org/T224744 (10fsero) 05Open→03Resolved you should have access now, comment here and reopen task otherwise :) [17:10:28] (03PS1) 10Jbond: install- late_command: quote printf paramater [puppet] - 10https://gerrit.wikimedia.org/r/514823 [17:10:49] (03PS2) 10Jbond: install- late_command: quote printf paramater [puppet] - 10https://gerrit.wikimedia.org/r/514823 [17:11:18] <_joe_> jbond42: hold a sec please :) [17:11:19] (03PS1) 10Giuseppe Lavagetto: mediawiki::monitoring::php: remove redundant reference to the runbook [puppet] - 10https://gerrit.wikimedia.org/r/514824 [17:11:52] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2002.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906061711_gehel_3... [17:12:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::monitoring::php: remove redundant reference to the runbook [puppet] - 10https://gerrit.wikimedia.org/r/514824 (owner: 10Giuseppe Lavagetto) [17:12:18] (03PS1) 10Fsero: ldap_requests: adding Thea Skaff [puppet] - 10https://gerrit.wikimedia.org/r/514825 (https://phabricator.wikimedia.org/T224928) [17:12:20] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki::monitoring::php: remove redundant reference to the runbook [puppet] - 10https://gerrit.wikimedia.org/r/514824 (owner: 10Giuseppe Lavagetto) [17:13:04] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Request to add Rmaung to the ldap/wmf group - https://phabricator.wikimedia.org/T224744 (10Rmaung) I'm in, thanks so much for your help! [17:14:18] <_joe_> ok I should stop working when so tired. [17:14:25] <_joe_> Third time's the charm? [17:14:37] :) [17:14:48] see you next month _joe_ [17:14:53] :P [17:15:35] (03PS2) 10Fsero: ldap_requests: adding Thea Skaff [puppet] - 10https://gerrit.wikimedia.org/r/514825 (https://phabricator.wikimedia.org/T224928) [17:17:29] (03PS1) 10Giuseppe Lavagetto: mediawiki::monitoring::php: fix path of source [puppet] - 10https://gerrit.wikimedia.org/r/514828 [17:18:01] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki::monitoring::php: fix path of source [puppet] - 10https://gerrit.wikimedia.org/r/514828 (owner: 10Giuseppe Lavagetto) [17:19:59] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) (owner: 10Effie Mouzeli) [17:20:31] (03CR) 10Fsero: [C: 03+2] ldap_requests: adding Thea Skaff [puppet] - 10https://gerrit.wikimedia.org/r/514825 (https://phabricator.wikimedia.org/T224928) (owner: 10Fsero) [17:20:43] (03PS3) 10Fsero: ldap_requests: adding Thea Skaff [puppet] - 10https://gerrit.wikimedia.org/r/514825 (https://phabricator.wikimedia.org/T224928) [17:21:31] (03CR) 10Jbond: [C: 03+2] install- late_command: quote printf paramater [puppet] - 10https://gerrit.wikimedia.org/r/514823 (owner: 10Jbond) [17:21:39] (03PS3) 10Jbond: install- late_command: quote printf paramater [puppet] - 10https://gerrit.wikimedia.org/r/514823 [17:22:23] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:22:23] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:22:24] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:22:24] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:22:25] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:22:45] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:22:52] er? [17:23:15] _joe_: is this you [17:23:21] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:23:23] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:23:29] <_joe_> yes but [17:23:33] <_joe_> it should recover now [17:23:45] ack [17:23:46] <_joe_> I disabled puppet, basically masking the failures [17:23:55] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:23:59] 👍 [17:24:03] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:24:09] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:24:21] <_joe_> sigh [17:24:24] <_joe_> so many? [17:24:39] <_joe_> just to be sure [17:24:44] <_joe_> I'm running puppet on one host [17:24:46] (03PS4) 10Fsero: ldap_requests: adding Thea Skaff [puppet] - 10https://gerrit.wikimedia.org/r/514825 (https://phabricator.wikimedia.org/T224928) [17:24:49] <_joe_> to verify it fixes things [17:24:52] <_joe_> and indeed it does [17:25:13] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:25:13] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:25:25] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:27:04] (03CR) 10Fsero: [V: 03+2 C: 03+2] ldap_requests: adding Thea Skaff [puppet] - 10https://gerrit.wikimedia.org/r/514825 (https://phabricator.wikimedia.org/T224928) (owner: 10Fsero) [17:27:06] ok nice [17:27:10] well maybe it's just running behind a bit [17:28:33] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:28:57] PROBLEM - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:29:37] one recovery [17:29:45] guess the rest must be close behind [17:30:23] RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:30:43] !log restart mcrouter on mw2271 (codfw proxy) to pick up new config changes [17:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:26] 10Operations, 10Maps: reimage of maps2002 fails to "run preseeded command" - https://phabricator.wikimedia.org/T225238 (10Gehel) [17:34:57] (03PS11) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [17:35:04] 10Operations, 10Maps: Maps[12]004 /srv disk space is critical - https://phabricator.wikimedia.org/T224395 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['maps2002.codfw.wmnet'] ` [17:35:17] PROBLEM - puppet last run on mw2283 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 14 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/nrpe_check_opcache] [17:36:15] still? [17:38:03] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:38:03] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:38:04] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:38:04] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:38:04] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:38:30] (03PS12) 10Effie Mouzeli: WIP: mediawiki: check and restart php7 if needed [puppet] - 10https://gerrit.wikimedia.org/r/514673 (https://phabricator.wikimedia.org/T224491) [17:38:31] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:39:09] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:39:43] RECOVERY - puppet last run on mw2153 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:39:49] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:39:55] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:40:37] RECOVERY - puppet last run on mw2283 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:41:03] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [17:41:13] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [17:48:09] (03PS1) 10CRusnov: Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 [17:50:38] PROBLEM - PHP opcache health on mw1221 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers [17:52:06] ^ that is expected [17:52:51] 10Operations, 10Wikimedia-Mailing-lists: Request mailing list Chad - https://phabricator.wikimedia.org/T225240 (10Abdallahbigboy) [17:53:42] PROBLEM - PHP opcache health on mw1255 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers [17:54:35] <_joe_> yeah it will go away when I merge my next patch :D [17:54:51] (03CR) 10jerkins-bot: [V: 04-1] Add Cumin backend for accessing Netbox [software/cumin] - 10https://gerrit.wikimedia.org/r/514840 (owner: 10CRusnov) [17:55:14] (03PS1) 10Giuseppe Lavagetto: mediawiki::php::monitoring: fix some numbers [puppet] - 10https://gerrit.wikimedia.org/r/514842 [17:56:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php::monitoring: fix some numbers [puppet] - 10https://gerrit.wikimedia.org/r/514842 (owner: 10Giuseppe Lavagetto) [17:58:20] PROBLEM - PHP opcache health on mw1223 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers [17:58:52] PROBLEM - PHP opcache health on mw1330 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers [17:59:14] <_joe_> jijiki: can you also run puppet on the servers you restart? [17:59:20] <_joe_> it will make these alerts go away [17:59:30] PROBLEM - PHP opcache health on mw1231 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers [18:00:05] MaxSem, RoanKattouw, and Niharika: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190606T1800). [18:00:05] No GERRIT patches in the queue for this window AFAICS. [18:01:20] RECOVERY - PHP opcache health on mw1330 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers [18:01:56] PROBLEM - PHP opcache health on mw1242 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers [18:02:00] RECOVERY - PHP opcache health on mw1231 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers [18:02:28] <_joe_> jijiki: are you doing the restarts or should I? [18:03:00] I have been yes [18:04:06] PROBLEM - PHP opcache health on mw1332 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers [18:04:06] <_joe_> have you logged it? [18:04:28] RECOVERY - PHP opcache health on mw1242 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers [18:04:33] we have logged rolling restarts in all eqiad [18:04:59] !log Continuing rolling restarts of php-fpm in eqiad [18:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:54] RECOVERY - PHP opcache health on mw1255 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers [18:07:24] RECOVERY - PHP opcache health on mw1221 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers [18:07:24] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): WDQS internal servers started lagging behind - https://phabricator.wikimedia.org/T224829 (10debt) [18:09:04] (03CR) 10Herron: "> ah yes now I see the change in id in commons.yaml, missed that!" [puppet] - 10https://gerrit.wikimedia.org/r/514361 (https://phabricator.wikimedia.org/T225005) (owner: 10Herron) [18:09:52] PROBLEM - PHP opcache health on mw1251 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers [18:09:54] PROBLEM - PHP opcache health on mw1322 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers [18:10:11] (03CR) 10Andrew Bogott: [C: 03+1] "Yep, pcc output looks good" [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) (owner: 10Jhedden) [18:11:10] _joe_: I will add puppet in the next set of servers [18:11:13] (03CR) 10Andrew Bogott: [C: 03+2] onboarding: add jhedden contact info and groups [puppet] - 10https://gerrit.wikimedia.org/r/514195 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [18:11:26] <_joe_> ok, I'm off now. [18:11:35] (03CR) 10Andrew Bogott: [C: 03+2] onboarding: add jhedden to prod icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/514102 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [18:12:18] bb joe [18:12:46] RECOVERY - PHP opcache health on mw1223 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers [18:14:31] (03PS3) 10Jhedden: openstack: allow designate to access puppet encapi [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) [18:14:44] (03PS3) 10Andrew Bogott: onboarding: add jhedden contact info and groups [puppet] - 10https://gerrit.wikimedia.org/r/514195 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [18:14:56] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1010 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514845 [18:15:08] (03PS2) 10Bstorm: Revert "wikireplicas: depool labsdb1010 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514845 [18:16:56] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1010 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514845 (owner: 10Bstorm) [18:17:27] (03PS4) 10Andrew Bogott: onboarding: add jhedden contact info and groups [puppet] - 10https://gerrit.wikimedia.org/r/514195 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [18:18:48] (03PS4) 10Jhedden: openstack: allow designate to access puppet encapi [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) [18:19:54] (03CR) 10Jhedden: [V: 03+2 C: 03+2] openstack: allow designate to access puppet encapi [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) (owner: 10Jhedden) [18:20:09] (03PS5) 10Jhedden: openstack: allow designate to access puppet encapi [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) [18:20:51] 10Operations, 10Continuous-Integration-Config: Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10hashar) I am assuming you will be able to talk about this during the SRE offsite next week? [18:21:17] (03CR) 10Jhedden: [V: 03+2 C: 03+2] openstack: allow designate to access puppet encapi [puppet] - 10https://gerrit.wikimedia.org/r/514738 (https://phabricator.wikimedia.org/T224981) (owner: 10Jhedden) [18:23:26] (03PS4) 10Andrew Bogott: onboarding: add jhedden to prod icinga contacts [puppet] - 10https://gerrit.wikimedia.org/r/514102 (https://phabricator.wikimedia.org/T224192) (owner: 10Jhedden) [18:24:15] !log T224850 repooled labsdb1010 after completing view run [18:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:21] T224850: Offer alternate views of the comment and actor tables which only check for supression in a single table in the Wiki Replicas - https://phabricator.wikimedia.org/T224850 [18:26:16] RECOVERY - PHP opcache health on mw1332 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers [18:32:28] PROBLEM - PHP opcache health on mw1249 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers [18:33:56] RECOVERY - PHP opcache health on mw1322 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers [18:38:13] !log shutting down backup2001 for 10G nic troubleshooting [18:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:06] !log mw1249 - sudo systemctl restart php7.2-fpm.service [18:39:06] RECOVERY - PHP opcache health on mw1249 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers [18:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:14] RECOVERY - PHP opcache health on mw1251 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers [18:47:38] RECOVERY - Host backup2001 is UP: PING OK - Packet loss = 0%, RTA = 36.17 ms [18:56:35] 10Operations, 10ops-codfw, 10decommission: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10jbond) going to re-image this server to stretch, testing changes to late_command.sh [18:58:01] !log reimage sarin to stretch [18:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:24] 10Operations, 10ops-codfw, 10decommission: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jbond on cumin1001.eqiad.wmnet for hosts: ` ['sarin.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201906061858_jbond_15... [19:11:18] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:37:32] (03PS3) 10Alex Monk: Puppet CAs: Make it easy to swap CAs by hiera change [puppet] - 10https://gerrit.wikimedia.org/r/506872 (https://phabricator.wikimedia.org/T220268) [19:42:38] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:47:48] !log performing rolling reboot of eqiad logstash hw for MDS security updates [19:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:08] (03PS1) 10Jbond: late_command: rollback puppet5 changes [puppet] - 10https://gerrit.wikimedia.org/r/514865 [19:49:16] PROBLEM - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/2773/ [19:49:48] PROBLEM - HTTPS-wikitech-static on wikitech-static.wikimedia.org is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/project/view/2773/ [20:03:34] RECOVERY - HTTPS-status-wikimedia-org on wikitech-static.wikimedia.org is OK: SSL OK - Certificate wikitech-static.wikimedia.org valid until 2019-06-23 23:01:36 +0000 (expires in 17 days) https://phabricator.wikimedia.org/project/view/2773/ [20:04:04] RECOVERY - HTTPS-wikitech-static on wikitech-static.wikimedia.org is OK: SSL OK - Certificate wikitech-static.wikimedia.org valid until 2019-06-23 23:01:36 +0000 (expires in 17 days) https://phabricator.wikimedia.org/project/view/2773/ [20:05:10] 10Operations, 10ops-codfw, 10decommission: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['sarin.codfw.wmnet'] ` Of which those **FAILED**: ` ['sarin.codfw.wmnet'] ` [20:06:00] (03PS5) 10Volans: types: do not pre-compile regex in SchemaRule [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 [20:06:02] (03PS4) 10Volans: dbconfig: use lists of dicts for sectionLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 [20:06:04] (03PS1) 10Volans: dbconfig: pretty-print get actions [software/conftool] - 10https://gerrit.wikimedia.org/r/514869 [20:06:06] (03PS1) 10Volans: dbconfig: save live config before updating it [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 [20:06:16] (03CR) 10jerkins-bot: [V: 04-1] dbconfig: use lists of dicts for sectionLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [20:08:21] (03PS2) 10Volans: dbconfig: save live config before updating it [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 [20:09:18] (03CR) 10Volans: "@_joe_ could you have a look at this latest PS?" [software/conftool] - 10https://gerrit.wikimedia.org/r/514544 (owner: 10Volans) [20:09:45] (03CR) 10Volans: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [20:11:07] (03PS2) 10Ottomata: Set LVS eventgate-* service to critical: true [puppet] - 10https://gerrit.wikimedia.org/r/514575 [20:11:09] (03PS1) 10Ottomata: Add monitoring::alerts::kafka_topic_throughput and use it for eventgate validation alerts [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) [20:11:26] (03CR) 10Volans: "replies inline" (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [20:12:18] (03CR) 10jerkins-bot: [V: 04-1] Add monitoring::alerts::kafka_topic_throughput and use it for eventgate validation alerts [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) (owner: 10Ottomata) [20:15:49] (03PS2) 10Ottomata: Add monitoring::alerts::kafka_topic_throughput and use it for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) [20:16:31] (03CR) 10jerkins-bot: [V: 04-1] Add monitoring::alerts::kafka_topic_throughput and use it for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) (owner: 10Ottomata) [20:22:24] (03PS3) 10Ottomata: Add monitoring::alerts::kafka_topic_throughput and use it for eventgate [puppet] - 10https://gerrit.wikimedia.org/r/514871 (https://phabricator.wikimedia.org/T225203) [20:26:20] (03PS1) 10Bstorm: wikireplicas: depool labsdb1011 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514882 (https://phabricator.wikimedia.org/T224850) [20:28:46] (03CR) 10Bstorm: [C: 03+2] wikireplicas: depool labsdb1011 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514882 (https://phabricator.wikimedia.org/T224850) (owner: 10Bstorm) [20:34:35] (03CR) 10Volans: [C: 03+1] "> Patch Set 25:" [puppet] - 10https://gerrit.wikimedia.org/r/437052 (owner: 10Alex Monk) [20:39:51] (03PS3) 10Jforrester: Wikibase: Drop backwards-compatibility for dataSquidMaxage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512460 [20:40:37] (03CR) 10Jforrester: [C: 03+2] Wikibase: Drop backwards-compatibility for dataSquidMaxage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512460 (owner: 10Jforrester) [20:41:45] (03CR) 10Volans: "LGTM, just a couple of typos and a optional question." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [20:42:08] (03PS5) 10Jforrester: De-duplicate …Squid variables now MW only uses the …Cdn ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496850 (https://phabricator.wikimedia.org/T104148) [20:44:19] (03PS4) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [20:45:46] (03Merged) 10jenkins-bot: Wikibase: Drop backwards-compatibility for dataSquidMaxage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512460 (owner: 10Jforrester) [20:49:05] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: Drop backwards-compatibility for dataSquidMaxage (duration: 00m 48s) [20:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:12] (03CR) 10Jforrester: [C: 03+2] De-duplicate …Squid variables now MW only uses the …Cdn ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496850 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [20:50:42] (03Merged) 10jenkins-bot: De-duplicate …Squid variables now MW only uses the …Cdn ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496850 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [20:54:33] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:54:51] (03CR) 10CDanis: [C: 03+2] dbconfig: pretty-print get actions [software/conftool] - 10https://gerrit.wikimedia.org/r/514869 (owner: 10Volans) [20:55:09] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Stop setting wgUseSquid or using wgSquidServersNoPurge, duplicate existing values (duration: 00m 48s) [20:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:10] (03CR) 10CDanis: [C: 03+1] "oops, lg but one nit" (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/514869 (owner: 10Volans) [20:56:55] (03PS5) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [20:57:14] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Stop setting wgSquidMaxage, MW now uses wgCdnMaxAge (duration: 00m 46s) [20:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:56] !log jforrester@deploy1001 Synchronized wmf-config/reverse-proxy.php: Stop setting wgSquidServersNoPurge, MW now uses wgCdnServersNoPurge (duration: 00m 47s) [20:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:59] (03PS6) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [21:01:29] !log T224850 depooled labsdb1011 [21:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:34] T224850: Offer alternate views of the comment and actor tables which only check for supression in a single table in the Wiki Replicas - https://phabricator.wikimedia.org/T224850 [21:01:57] (03CR) 10Jbond: icinga: Add a script to parse and query the status.dat file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [21:02:31] (03PS2) 10Volans: dbconfig: pretty-print get actions [software/conftool] - 10https://gerrit.wikimedia.org/r/514869 [21:02:33] (03PS3) 10Volans: dbconfig: save live config before updating it [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 [21:05:17] (03CR) 10Jbond: icinga: Add a script to parse and query the status.dat file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [21:08:17] (03CR) 10Volans: dbconfig: pretty-print get actions (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/514869 (owner: 10Volans) [21:10:37] 10Operations, 10Access-Policy: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10Iflorez) [21:13:21] (03CR) 10jenkins-bot: Wikibase: Drop backwards-compatibility for dataSquidMaxage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/512460 (owner: 10Jforrester) [21:13:23] 10Operations, 10SRE-Access-Requests: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10Krenair) hieradata/role/common/mediawiki/maintenance.yaml `admin::groups: - restricted - deployment - ldap-admins - maintenance-log-readers - perf-roots` What data are you trying to get to exactly? [21:14:03] (03CR) 10Volans: [C: 03+1] "LGTM, see reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [21:14:19] (03CR) 10CDanis: [C: 03+1] "looks great!" [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [21:18:53] (03CR) 10jenkins-bot: De-duplicate …Squid variables now MW only uses the …Cdn ones [mediawiki-config] - 10https://gerrit.wikimedia.org/r/496850 (https://phabricator.wikimedia.org/T104148) (owner: 10Jforrester) [21:22:17] (03PS7) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [21:23:09] (03CR) 10jerkins-bot: [V: 04-1] icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [21:23:18] (03CR) 10Jbond: icinga: Add a script to parse and query the status.dat file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/514459 (owner: 10Jbond) [21:24:31] (03PS8) 10Jbond: icinga: Add a script to parse and query the status.dat file [puppet] - 10https://gerrit.wikimedia.org/r/514459 [21:28:56] (03PS1) 10Brion VIBBER: Specify the fluidsynth paths for TMH MIDI conversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514960 (https://phabricator.wikimedia.org/T135597) [21:30:20] (03PS1) 10Brion VIBBER: List deps for MIDI to Ogg/MP3 conversion for video scalers [puppet] - 10https://gerrit.wikimedia.org/r/514962 (https://phabricator.wikimedia.org/T135597) [21:54:19] (03CR) 10CDanis: [C: 03+1] dbconfig: save live config before updating it (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 (owner: 10Volans) [21:56:08] (03CR) 10Jforrester: [C: 03+2] Specify the fluidsynth paths for TMH MIDI conversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514960 (https://phabricator.wikimedia.org/T135597) (owner: 10Brion VIBBER) [22:03:03] (03PS1) 10Andrew Bogott: wikitech-static: remove --group=dump arg [wikitech-static] - 10https://gerrit.wikimedia.org/r/514965 [22:04:26] (03PS1) 10Andrew Bogott: get_images: remove group=dump [wikitech-static] - 10https://gerrit.wikimedia.org/r/514966 [22:05:16] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] get_images: remove group=dump [wikitech-static] - 10https://gerrit.wikimedia.org/r/514966 (owner: 10Andrew Bogott) [22:05:25] (03Abandoned) 10Andrew Bogott: wikitech-static: remove --group=dump arg [wikitech-static] - 10https://gerrit.wikimedia.org/r/514965 (owner: 10Andrew Bogott) [22:07:35] (03PS1) 10Andrew Bogott: import-wikitech.sh: /fully/qualify path to service command [wikitech-static] - 10https://gerrit.wikimedia.org/r/514967 [22:08:28] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] import-wikitech.sh: /fully/qualify path to service command [wikitech-static] - 10https://gerrit.wikimedia.org/r/514967 (owner: 10Andrew Bogott) [22:11:52] 10Operations, 10SRE-Access-Requests: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10Iflorez) Hello @Krenair, Interested in translation data on Google Translated articles, including: translation_id, date(translation_last_updated_timestamp), translation_source_language, translatio... [22:16:46] 10Operations, 10SRE-Access-Requests: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10Krenair) That's from ContentTranslation right? You should be able to get all this stuff already by being in the researcher group. I don't think the maintenance hosts store this either, they just have acces... [22:17:44] (03PS1) 10CDanis: dbctl config: remove comment cruft [software/conftool] - 10https://gerrit.wikimedia.org/r/514968 [22:18:48] 10Operations, 10Traffic, 10User-notice: Rate limit requests in violation of User-Agent policy more aggressively - https://phabricator.wikimedia.org/T224891 (10Quiddity) TechNews: I've [[https://meta.wikimedia.org/w/index.php?title=Tech/News/2019/24&diff=19140176&oldid=19140169&diffmode=source|added it to the... [22:20:05] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:28:12] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942 (10colewhite) Latest dashboard audit: 'varnish\..+\.backends' * "Media" * "API frontend summary" * "Experimental - backend 5xx" * "Maps performances" *... [22:28:26] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1011 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514969 [22:29:18] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1011 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514969 (owner: 10Bstorm) [22:41:08] 10Operations, 10SRE-Access-Requests: Access Q re maint1002 - https://phabricator.wikimedia.org/T225253 (10Iflorez) That's helpful, thank you [22:42:50] !log T224850 repooled labsdb1011 [22:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:55] T224850: Offer alternate views of the comment and actor tables which only check for supression in a single table in the Wiki Replicas - https://phabricator.wikimedia.org/T224850 [22:52:12] (03PS1) 10Jhedden: wikireplicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514971 (https://phabricator.wikimedia.org/T224850) [22:54:47] (03CR) 10Bstorm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/514971 (https://phabricator.wikimedia.org/T224850) (owner: 10Jhedden) [22:57:14] (03CR) 10Jhedden: [C: 03+2] wikireplicas: depool labsdb1009 for view updates [puppet] - 10https://gerrit.wikimedia.org/r/514971 (https://phabricator.wikimedia.org/T224850) (owner: 10Jhedden) [23:00:05] MaxSem, RoanKattouw, and Niharika: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190606T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:01:19] 10Operations, 10Maps: reimage of maps2002 fails to "run preseeded command" - https://phabricator.wikimedia.org/T225238 (10Gehel) Looking around at maps2002, I see an invalid apt source list (P8595) during late command: The problematic file is: ` root@maps2002:~# cat /etc/apt/sources.list.d/component-facter3.l... [23:03:20] !log T224850 depooled labsdb1009 [23:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:25] T224850: Offer alternate views of the comment and actor tables which only check for supression in a single table in the Wiki Replicas - https://phabricator.wikimedia.org/T224850 [23:13:21] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:20:35] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:20:38] (03PS1) 10Bstorm: Revert "wikireplicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514982 [23:23:59] (03PS5) 10Volans: dbconfig: use lists of dicts for sectionLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 [23:24:01] (03PS3) 10Volans: dbconfig: pretty-print get actions [software/conftool] - 10https://gerrit.wikimedia.org/r/514869 [23:24:03] (03PS4) 10Volans: dbconfig: save live config before updating it [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 [23:24:05] (03PS1) 10Volans: dbconfig: add config restore action [software/conftool] - 10https://gerrit.wikimedia.org/r/514983 [23:24:10] (03CR) 10Volans: "done" (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 (owner: 10Volans) [23:28:27] (03CR) 10CDanis: [C: 03+1] dbconfig: save live config before updating it (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/514870 (owner: 10Volans) [23:28:56] (03CR) 10CDanis: [C: 03+1] dbconfig: use lists of dicts for sectionLoads [software/conftool] - 10https://gerrit.wikimedia.org/r/514633 (owner: 10Volans) [23:29:31] (03CR) 10CDanis: [C: 03+2] dbconfig: pretty-print get actions [software/conftool] - 10https://gerrit.wikimedia.org/r/514869 (owner: 10Volans) [23:39:08] (03CR) 10CDanis: [C: 03+2] dbconfig: add config restore action [software/conftool] - 10https://gerrit.wikimedia.org/r/514983 (owner: 10Volans) [23:51:02] (03PS2) 10Jforrester: Remove $wgLexemeDisableCirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514697 (https://phabricator.wikimedia.org/T225183) (owner: 10Reedy) [23:52:20] (03CR) 10Jforrester: [C: 03+2] Remove $wgLexemeDisableCirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514697 (https://phabricator.wikimedia.org/T225183) (owner: 10Reedy) [23:52:46] (03CR) 10Jforrester: [C: 03+2] "…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514960 (https://phabricator.wikimedia.org/T135597) (owner: 10Brion VIBBER) [23:53:30] (03Merged) 10jenkins-bot: Remove $wgLexemeDisableCirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514697 (https://phabricator.wikimedia.org/T225183) (owner: 10Reedy) [23:53:40] (03Merged) 10jenkins-bot: Specify the fluidsynth paths for TMH MIDI conversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514960 (https://phabricator.wikimedia.org/T135597) (owner: 10Brion VIBBER) [23:53:45] (03CR) 10jenkins-bot: Remove $wgLexemeDisableCirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514697 (https://phabricator.wikimedia.org/T225183) (owner: 10Reedy) [23:53:54] (03CR) 10jenkins-bot: Specify the fluidsynth paths for TMH MIDI conversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514960 (https://phabricator.wikimedia.org/T135597) (owner: 10Brion VIBBER) [23:54:06] (03PS2) 10Jforrester: Remove unused preference 'T47877-buster' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514193 (owner: 10Bartosz Dziewoński) [23:54:11] (03CR) 10Jforrester: [C: 03+2] Remove unused preference 'T47877-buster' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514193 (owner: 10Bartosz Dziewoński) [23:56:01] !log jforrester@deploy1001 Synchronized wmf-config/Wikibase.php: Remove T225183 (duration: 00m 48s) [23:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:07] T225183: $wgLexemeUseCirrus and $wgLexemeDisableCirrus both set to true for production - https://phabricator.wikimedia.org/T225183 [23:56:57] (03CR) 10Bstorm: [C: 03+2] Revert "wikireplicas: depool labsdb1009 for view updates" [puppet] - 10https://gerrit.wikimedia.org/r/514982 (owner: 10Bstorm) [23:57:26] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Specify the fluidsynth paths for TMH MIDI conversion T135597 (duration: 00m 47s) [23:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:31] T135597: Move MIDI to audio conversion from Score into TimedMediaHandler - https://phabricator.wikimedia.org/T135597 [23:58:14] (03Abandoned) 10Jforrester: [DNM] Rename JADE to Jade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480284 (https://phabricator.wikimedia.org/T212182) (owner: 10Awight) [23:59:06] (03Merged) 10jenkins-bot: Remove unused preference 'T47877-buster' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514193 (owner: 10Bartosz Dziewoński) [23:59:20] (03CR) 10jenkins-bot: Remove unused preference 'T47877-buster' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/514193 (owner: 10Bartosz Dziewoński) [23:59:52] (03PS2) 10Jforrester: flaggedrevs: Declare cswikinews permissions in the standard way [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511932 (owner: 10Legoktm)