[01:27:30] 10Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851 (10TTO) [02:25:12] 10Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851 (10GeoffreyT2000) >>! In T135851#2312924, @jcrespo wrote: > -1 disagreeing with the solution. > > This can happen also on master failover-which your solution will not protec... [02:28:51] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.13) (duration: 11m 41s) [02:28:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:03:13] !log l10nupdate@deploy1001 scap sync-l10n completed (1.32.0-wmf.14) (duration: 15m 33s) [03:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:42] !log l10nupdate@deploy1001 ResourceLoader cache refresh completed at Mon Jul 30 03:13:42 UTC 2018 (duration 10m 29s) [03:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:18:57] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 [03:26:48] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 906.74 seconds [03:28:18] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 22 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:33:27] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 13 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:43:48] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 273.72 seconds [04:30:19] need to report an error to site admins [04:30:30] Error 500 on this link [04:30:33] https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Jupiter_diagram.svg/5000px-Jupiter_diagram.svg.png [04:31:07] Request from xxxxxx (removed IP) via cp1064 cp1064, Varnish XID 272915322 Error: 500, Internal Server Error at Mon, 30 Jul 2018 04:27:37 GMT [04:32:12] 10Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851 (10GeoffreyT2000) 05Open>03Resolved a:03GeoffreyT2000 This is more of a MySQL bug than a MediaWiki bug. Anyway, this bug was actually originally reported in 2003 as [[h... [04:46:32] (03PS1) 10Smalyshev: Revert "Revert "Enable kafka poller on test hosts"" [puppet] - 10https://gerrit.wikimedia.org/r/449109 (https://phabricator.wikimedia.org/T189458) [05:00:28] !log Deploy schema change on db1062 (s7 primary master) T144010 T51190 T199368 [05:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:38] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:00:39] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [05:00:39] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [05:05:45] 10Operations, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10Legoktm) [05:08:02] 10Operations, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10Legoktm) [05:11:32] 10Operations, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10Legoktm) [05:13:36] (03CR) 10Legoktm: [C: 04-1] [WIP] Add php72 base and web images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [05:14:12] (03CR) 10Legoktm: [C: 04-1] "> E: Unable to locate package toollabs-webservice" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [05:21:59] RECOVERY - mysqld processes on pc2006 is OK: PROCS OK: 1 process with command name mysqld [05:25:37] 10Operations, 10ops-codfw, 10DBA: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Marostegui) a:03Papaul Looks like it had some memory errors: ``` /admin1/system1/logs1/log1-> show record1 properties CreationTimestamp = 20180729082203.000000-300 ElementName = System Event Log En... [05:28:25] !log Deploy schema change on db1068 (commonswiki master) - T51190 [05:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:29] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [05:32:03] a [05:32:04] a [05:34:21] 10Operations, 10DBA, 10Epic: DB meta task for next DC failover issues - https://phabricator.wikimedia.org/T189107 (10Marostegui) [05:34:31] Hello paravoid :) [05:34:34] 10Operations, 10DBA, 10Patch-For-Review: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501 (10Marostegui) [05:39:12] (03PS1) 10Muehlenhoff: Extend access for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/449111 [06:11:58] (03PS2) 10Muehlenhoff: Extend access for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/449111 [06:12:08] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 38 probes of 331 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:13:31] (03CR) 10Muehlenhoff: [C: 032] Extend access for jdcc [puppet] - 10https://gerrit.wikimedia.org/r/449111 (owner: 10Muehlenhoff) [06:17:17] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 0 probes of 331 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:28:39] (03PS1) 10Muehlenhoff: Record extented MOU date for mkroetzsch [puppet] - 10https://gerrit.wikimedia.org/r/449116 [06:30:05] (03CR) 10Muehlenhoff: [C: 032] Record extented MOU date for mkroetzsch [puppet] - 10https://gerrit.wikimedia.org/r/449116 (owner: 10Muehlenhoff) [06:31:35] (03PS1) 10Marostegui: db1120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/449118 (https://phabricator.wikimedia.org/T196376) [06:32:07] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/mysql-ps1.sh] [06:33:10] (03PS2) 10Marostegui: db1120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/449118 (https://phabricator.wikimedia.org/T196376) [06:33:33] (03PS1) 10Jcrespo: Test MariaDB 10.3 on core test hosts [puppet] - 10https://gerrit.wikimedia.org/r/449120 (https://phabricator.wikimedia.org/T193224) [06:33:49] (03CR) 10Marostegui: [C: 032] db1120: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/449118 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [06:34:14] (03PS2) 10Jcrespo: Test MariaDB 10.3 on core test hosts [puppet] - 10https://gerrit.wikimedia.org/r/449120 (https://phabricator.wikimedia.org/T193224) [06:35:01] (03CR) 10Jcrespo: [C: 032] Test MariaDB 10.3 on core test hosts [puppet] - 10https://gerrit.wikimedia.org/r/449120 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [06:39:44] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db1120 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449123 (https://phabricator.wikimedia.org/T196376) [06:42:49] (03PS1) 10Jcrespo: Revert "Test MariaDB 10.3 on core test hosts" [puppet] - 10https://gerrit.wikimedia.org/r/449124 [06:43:08] (03CR) 10Jcrespo: "Package not ready yet for wmf-production." [puppet] - 10https://gerrit.wikimedia.org/r/449124 (owner: 10Jcrespo) [06:43:59] (03CR) 10Jcrespo: [C: 032] Revert "Test MariaDB 10.3 on core test hosts" [puppet] - 10https://gerrit.wikimedia.org/r/449124 (owner: 10Jcrespo) [06:47:49] (03PS4) 10Jcrespo: Packages for MySQL 8.0.12 and MariaDB 10.3.8 [software] - 10https://gerrit.wikimedia.org/r/448854 [06:57:38] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:15:30] (03CR) 10Muehlenhoff: "Looks good, two comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/448503 (owner: 10Jcrespo) [07:23:12] (03CR) 10Volans: "I agree with the change and thanks for taking care of this. There are only a couple of side-effects to take into account, not a blocker." [puppet] - 10https://gerrit.wikimedia.org/r/448503 (owner: 10Jcrespo) [07:25:32] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db1120 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449123 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [07:26:31] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1120 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449123 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [07:28:13] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Add db1120 to x1 T196376 (duration: 00m 59s) [07:28:15] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1120 to x1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449123 (https://phabricator.wikimedia.org/T196376) (owner: 10Marostegui) [07:28:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:18] T196376: Productionize old/temporary eqiad sanitariums - https://phabricator.wikimedia.org/T196376 [07:29:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Add db1120 to x1 T196376 (duration: 00m 54s) [07:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:15] !log Deploy schema change on db2045 (s8 codfw master) this will generate lag on s8 codfw T144010 T51190 T199368 [07:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:21] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [07:32:21] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [07:32:21] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [07:36:34] (03PS1) 10Marostegui: db-eqiad.php: Pool db1120 with some weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449129 [07:36:48] 10Operations, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10zhuyifei1999) I wonder if T159892 & T197930 should be done instead [07:39:03] !log installing mercurial security updates [07:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:46] 10Operations, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10Legoktm) >>! In T200660#4460684, @zhuyifei1999 wrote: > I wonder if T159892 & T197930 should be done instead Oh, I wasn't aware of that. Is there an estimate of when that will be resolved? [07:43:59] !log installing opencv security updates [07:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool db1120 with some weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449129 (owner: 10Marostegui) [07:51:23] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1120 with some weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449129 (owner: 10Marostegui) [07:51:42] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1120 with some weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449129 (owner: 10Marostegui) [07:52:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give some traffic to db1120 to on x1 (duration: 00m 53s) [07:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:27] 10Operations, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10Legoktm) We discussed a bit on IRC. Moving to the official kubernetes python client needs both backporting work as well as development work in the `webservice` command. Since that seems like... [07:56:42] 10Operations, 10Toolforge: Please add php-imagick and php-redis packages to apt.wikimedia.org thirdparty/php72 - https://phabricator.wikimedia.org/T200666 (10Legoktm) [07:58:22] (03CR) 10Legoktm: [C: 04-1] [WIP] Add php72 base and web images (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/449033 (https://phabricator.wikimedia.org/T188318) (owner: 10Legoktm) [07:58:22] (03PS2) 10Muehlenhoff: Move declaration of diamond package out of diamond class [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) [08:07:08] !log elukey@deploy1001 Started deploy [eventlogging/analytics@54d43e4]: Band aid for T200630 [08:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:12] T200630: Eventlogging's processors stopped working - https://phabricator.wikimedia.org/T200630 [08:07:13] !log elukey@deploy1001 Finished deploy [eventlogging/analytics@54d43e4]: Band aid for T200630 (duration: 00m 05s) [08:07:15] (03PS1) 10Volans: Fix the -o/--output option (bytes->str conversion) [software/cumin] - 10https://gerrit.wikimedia.org/r/449130 (https://phabricator.wikimedia.org/T200622) [08:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:53] (03PS1) 10Volans: Fix debugging log message conversion [software/cumin] - 10https://gerrit.wikimedia.org/r/449131 [08:08:24] RECOVERY - Check status of defined EventLogging jobs on eventlog1002 is OK: OK: All defined EventLogging jobs are runnning. [08:08:32] (03CR) 10Volans: "Thanks for reporting this issue and the quick fix, I've sent a separate CR as it required fixing both the json and txt choices of the -o/-" [software/cumin] - 10https://gerrit.wikimedia.org/r/448985 (https://phabricator.wikimedia.org/T200622) (owner: 10Alex Monk) [08:08:47] elukey: \o/ [08:09:21] fingers crossed [08:09:29] works nicely in labs [08:09:32] * volans waiting the -1 for the 2 CRs above [08:09:34] nice! [08:10:34] (03CR) 10jerkins-bot: [V: 04-1] Fix the -o/--output option (bytes->str conversion) [software/cumin] - 10https://gerrit.wikimedia.org/r/449130 (https://phabricator.wikimedia.org/T200622) (owner: 10Volans) [08:10:34] (03CR) 10jerkins-bot: [V: 04-1] Fix debugging log message conversion [software/cumin] - 10https://gerrit.wikimedia.org/r/449131 (owner: 10Volans) [08:12:07] (03CR) 10Volans: "Tests are passing, the failing prospector is due to https://github.com/PyCQA/prospector/issues/263 but I'd prefer to wait few days and see" [software/cumin] - 10https://gerrit.wikimedia.org/r/449130 (https://phabricator.wikimedia.org/T200622) (owner: 10Volans) [08:12:14] (03CR) 10Volans: "Tests are passing, the failing prospector is due to https://github.com/PyCQA/prospector/issues/263 but I'd prefer to wait few days and see" [software/cumin] - 10https://gerrit.wikimedia.org/r/449131 (owner: 10Volans) [08:22:54] (03CR) 10Gehel: [C: 031] "LGTM" [software/cumin] - 10https://gerrit.wikimedia.org/r/449131 (owner: 10Volans) [08:29:31] (03CR) 10Gehel: [C: 031] "LGTM. Minor comment about test inline." (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/449130 (https://phabricator.wikimedia.org/T200622) (owner: 10Volans) [08:32:33] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) Thanks @Krinkle! Indeed still another case of the 32nd bit flipping, interestingly on a codfw host where we haven't been seeing this yet: ``` sb_... [08:33:23] (03PS2) 10ArielGlenn: move hewiki to 'big wikis' list for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/447220 (https://phabricator.wikimedia.org/T200146) [08:34:38] (03CR) 10ArielGlenn: [C: 032] move hewiki to 'big wikis' list for xml/sql dumps [puppet] - 10https://gerrit.wikimedia.org/r/447220 (https://phabricator.wikimedia.org/T200146) (owner: 10ArielGlenn) [08:37:41] (03CR) 10Volans: "I've left some comments on the changes, but in general the script needs to be reviewed all and tested with Py3 because 2to3 can automatica" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/441209 (owner: 10Dzahn) [08:44:12] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10ArielGlenn) Google gets updates from us more than once a day; I don't know how their updat... [08:49:36] !log Deploy schema change on dbstore1002:s8 T144010 T51190 T199368 [08:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:46] T51190: Truncate SHA-1 indexes - https://phabricator.wikimedia.org/T51190 [08:49:47] T144010: Drop eu_touched in production - https://phabricator.wikimedia.org/T144010 [08:49:47] T199368: Convert UNIQUE INDEX to PK in Production - https://phabricator.wikimedia.org/T199368 [08:59:34] RECOVERY - Filesystem available is greater than filesystem size on ms-be2040 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [09:00:24] that's me ^ [09:00:42] and sadly a lie, the filesystem is unmounted now [09:01:56] :( [09:05:58] (03CR) 10Volans: "Thanks for the review, see my reply inline" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/449130 (https://phabricator.wikimedia.org/T200622) (owner: 10Volans) [09:07:14] !log akosiaris@deploy1001 scap-helm mathoid upgrade -h [namespace: mathoid, clusters: eqiad,codfw] [09:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:19] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [09:07:20] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [09:07:20] !log akosiaris@deploy1001 scap-helm mathoid finished [09:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:53] heh [09:07:59] this shouldn't have logged [09:08:01] anyway... [09:08:18] heads up I am gonna do a mathoid chart upgrade from 0.0.5 => 0.0.9. The thing has been tested locally and works fine, but you never know [09:08:34] !log upgrade mathoid helm chart from 0.0.5 to 0.0.9 [09:08:34] ack [09:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:09:12] !log akosiaris@deploy1001 scap-helm mathoid upgrade production stable/mathoid [namespace: mathoid, clusters: eqiad,codfw] [09:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:09] (03CR) 10Volans: [V: 032 C: 032] Fix the -o/--output option (bytes->str conversion) [software/cumin] - 10https://gerrit.wikimedia.org/r/449130 (https://phabricator.wikimedia.org/T200622) (owner: 10Volans) [09:13:38] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [09:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:40] (03CR) 10Volans: [V: 032 C: 032] Fix debugging log message conversion [software/cumin] - 10https://gerrit.wikimedia.org/r/449131 (owner: 10Volans) [09:13:54] (03PS2) 10Volans: Updated PyPI URLs to the new website [software/cumin] - 10https://gerrit.wikimedia.org/r/447408 [09:14:28] (03CR) 10jenkins-bot: Fix the -o/--output option (bytes->str conversion) [software/cumin] - 10https://gerrit.wikimedia.org/r/449130 (https://phabricator.wikimedia.org/T200622) (owner: 10Volans) [09:14:56] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [09:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:14] (03CR) 10jenkins-bot: Fix debugging log message conversion [software/cumin] - 10https://gerrit.wikimedia.org/r/449131 (owner: 10Volans) [09:16:35] (03CR) 10jerkins-bot: [V: 04-1] Updated PyPI URLs to the new website [software/cumin] - 10https://gerrit.wikimedia.org/r/447408 (owner: 10Volans) [09:17:08] (03PS9) 10Jcrespo: dbtree: move dbtree outside of mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) [09:18:16] (03CR) 10Volans: [V: 032 C: 032] "Tests are passing, the failing prospector is due to https://github.com/PyCQA/prospector/issues/263 but I'd prefer to wait few days and see" [software/cumin] - 10https://gerrit.wikimedia.org/r/447408 (owner: 10Volans) [09:18:35] (03CR) 10Vgutierrez: [C: 031] Initial structure [software/spicerack] - 10https://gerrit.wikimedia.org/r/448046 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [09:19:07] (03CR) 10Jcrespo: [C: 032] dbtree: move dbtree outside of mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/445597 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [09:20:45] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet operation_type={create_container,run_podsandbox,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:21:14] (03CR) 10jerkins-bot: [V: 04-1] Updated PyPI URLs to the new website [software/cumin] - 10https://gerrit.wikimedia.org/r/447408 (owner: 10Volans) [09:21:25] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet operation_type={stop_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:21:35] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet operation_type={create_container,remove_container,start_container,stop_podsandbox} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:21:53] (03CR) 10Jdlrobson: [C: 031] "I plan to deploy this sometime this week (possibly as 2 patches as a safety precaution), given this feature doesn't seem to be applicable " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/442738 (https://phabricator.wikimedia.org/T173949) (owner: 10Jdlrobson) [09:21:54] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:21:55] PROBLEM - Disk space on elastic1020 is CRITICAL: DISK CRITICAL - free space: /srv 50700 MB (10% inode=99%) [09:22:35] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:22:45] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:25:25] RECOVERY - Disk space on elastic1020 is OK: DISK OK [09:25:30] !log akosiaris@deploy1001 scap-helm mathoid finished [09:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:02] (03CR) 10jenkins-bot: Updated PyPI URLs to the new website [software/cumin] - 10https://gerrit.wikimedia.org/r/447408 (owner: 10Volans) [09:27:48] (03PS5) 10Gehel: Delete unused code in elasticsearch module [puppet] - 10https://gerrit.wikimedia.org/r/445320 (owner: 10EBernhardson) [09:29:31] (03CR) 10Gehel: [C: 032] Delete unused code in elasticsearch module [puppet] - 10https://gerrit.wikimedia.org/r/445320 (owner: 10EBernhardson) [09:30:24] so dbtree is currently not working [09:30:27] this is expected [09:30:54] !log migrate dbtree to dbmonitor1001 [09:30:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:13] !log uploaded intel-microcode 20180703 for jessie-wikimedia/stretch-wikimedia to apt.wikimedia.org (tested successfully on a number of canary hosts for approx two weeks) [09:31:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:41] (03PS6) 10Gehel: Split elasticsearch::log::hot_threads into two pieces [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [09:32:41] PROBLEM - puppet last run on dbmonitor1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 38 seconds ago with 2 failures. Failed resources (up to 3 shown): File[/srv/dbtree] [09:34:19] 10Operations, 10ops-codfw, 10DBA: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Peachey88) [09:35:21] PROBLEM - puppet last run on elastic2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/diamond/collectors/WMFElastic/WMFElastic.py] [09:39:01] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 49.18, 33.23, 19.52 [09:42:21] PROBLEM - High CPU load on API appserver on mw1230 is CRITICAL: CRITICAL - load average: 45.31, 36.07, 23.65 [09:43:11] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 38.25, 35.35, 23.62 [09:47:13] (03CR) 10Gehel: [C: 032] "Looks good and puppet compiler agrees: https://puppet-compiler.wmflabs.org/compiler02/11901/" [puppet] - 10https://gerrit.wikimedia.org/r/447565 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [09:47:57] elukey: FYI high load on the APIs again, but I see that the dashboard I was looking at on grafana doesn't exist anymore :( [09:48:02] not sure where has been renamed/merged [09:48:32] PROBLEM - puppet last run on dbmonitor2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 22 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/dbtree] [09:48:51] RECOVERY - High CPU load on API appserver on mw1230 is OK: OK - load average: 10.60, 24.39, 23.30 [09:49:04] ^^ jynus related to what you're doing (dbtree) [09:49:41] volans: no idea either :) [09:50:27] I am seeing a lot of dashboards with "API" in their names, maybe it was split into multiple ones? [09:50:32] RECOVERY - puppet last run on elastic2011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:50:43] I checked all of them and didn't find the same graph I was looking for [09:50:51] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 17.95, 25.24, 23.67 [09:51:16] the old dashboard was called api-requests, I just wanted to see if there was the same correlation with that parsoid metric [09:51:33] ah yeah [09:52:00] gehel: yes [09:52:02] working on it [09:52:06] (03PS1) 10Jcrespo: mariadb: Fix mwdeploy user on tendril/dbtree [puppet] - 10https://gerrit.wikimedia.org/r/449136 (https://phabricator.wikimedia.org/T192092) [09:52:09] jynus: ok, thanks! [09:52:39] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Fix mwdeploy user on tendril/dbtree [puppet] - 10https://gerrit.wikimedia.org/r/449136 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [09:53:17] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/449136 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [09:54:48] (03PS1) 10Ladsgroup: Enable reading from change_tag_def everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449137 (https://phabricator.wikimedia.org/T199334) [09:57:46] (03PS2) 10Jcrespo: mariadb: Fix mwdeploy user on tendril/dbtree [puppet] - 10https://gerrit.wikimedia.org/r/449136 (https://phabricator.wikimedia.org/T192092) [09:59:11] PROBLEM - Host kubestage1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:59:17] !log reboot kubestage1002 for kubernetes upgrade [09:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:50] RECOVERY - Host kubestage1002 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [10:01:22] !log reboot kubestage1001 for kubernetes upgrade [10:01:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:17] (03Abandoned) 10Addshore: Revert "Disable search integration with Article Placeholder temporarily" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436247 (https://phabricator.wikimedia.org/T195751) (owner: 10Addshore) [10:02:42] (03CR) 10Ema: [C: 032] varnish: drop -Werror from cc_command [puppet] - 10https://gerrit.wikimedia.org/r/448525 (https://phabricator.wikimedia.org/T200445) (owner: 10Ema) [10:02:54] (03PS5) 10Ema: varnish: drop -Werror from cc_command [puppet] - 10https://gerrit.wikimedia.org/r/448525 (https://phabricator.wikimedia.org/T200445) [10:03:28] (03CR) 10Jcrespo: [C: 032] mariadb: Fix mwdeploy user on tendril/dbtree [puppet] - 10https://gerrit.wikimedia.org/r/449136 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [10:03:36] !log upgrade kubernetes staging cluster to 1.9.9 [10:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:00] (03PS6) 10Ema: varnish: drop -Werror from cc_command [puppet] - 10https://gerrit.wikimedia.org/r/448525 (https://phabricator.wikimedia.org/T200445) [10:04:17] (03PS1) 10Volans: CHANGELOG: add changelogs for release v3.0.2 [software/cumin] - 10https://gerrit.wikimedia.org/r/449140 [10:04:34] !log akosiaris@deploy1001 scap-helm help [namespace: help, clusters: eqiad,codfw] [10:04:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:40] great [10:04:49] need to fix this... [10:04:52] !log akosiaris@deploy1001 scap-helm help cluster eqiad completed [10:04:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:56] !log akosiaris@deploy1001 scap-helm help cluster codfw completed [10:04:56] !log akosiaris@deploy1001 scap-helm help finished [10:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:12] it should not be logging the help command... and it seems to get stuck as well [10:07:13] (03CR) 10jerkins-bot: [V: 04-1] CHANGELOG: add changelogs for release v3.0.2 [software/cumin] - 10https://gerrit.wikimedia.org/r/449140 (owner: 10Volans) [10:07:44] (03PS1) 10Marostegui: db-eqiad.php: Depool all the hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449141 (https://phabricator.wikimedia.org/T183585) [10:08:01] RECOVERY - puppet last run on dbmonitor1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:08:25] (03CR) 10Volans: [V: 032 C: 032] "Tests are passing, the failing prospector is due to https://github.com/PyCQA/prospector/issues/263 but I'd prefer to wait few days and see" [software/cumin] - 10https://gerrit.wikimedia.org/r/449140 (owner: 10Volans) [10:09:39] (03CR) 10Vgutierrez: "I got nothing but a nitpick, Gehel already did an amazing review :)" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/448047 (https://phabricator.wikimedia.org/T199079) (owner: 10Volans) [10:09:43] (03CR) 10jenkins-bot: CHANGELOG: add changelogs for release v3.0.2 [software/cumin] - 10https://gerrit.wikimedia.org/r/449140 (owner: 10Volans) [10:10:30] (03PS2) 10Marostegui: db-eqiad.php: Depool all the hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449141 (https://phabricator.wikimedia.org/T183585) [10:10:47] (03CR) 10Marostegui: [C: 04-2] "To be deployed on Tuesday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449141 (https://phabricator.wikimedia.org/T183585) (owner: 10Marostegui) [10:12:21] 10Operations, 10media-storage, 10User-fgiunchedi: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 (10fgiunchedi) I've inquired upstream, one of the suggested approaches is to run with page poisoning. I'll do that on one host in codfw, also this issue will be l... [10:14:41] (03PS1) 10Jcrespo: dbtree: Make dbtree work again on debmonitor active host [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) [10:15:59] (03CR) 10Volans: [C: 04-1] "typo inline?" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [10:17:31] (03PS3) 10Prtksxna: Remove obsolete $wgPopupsBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/444574 [10:22:20] PROBLEM - swift-account-server on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [10:22:20] PROBLEM - swift-object-auditor on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [10:22:21] PROBLEM - swift-object-server on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [10:22:21] PROBLEM - swift-container-auditor on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:22:31] PROBLEM - swift-container-updater on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [10:22:31] PROBLEM - swift-object-updater on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [10:22:31] PROBLEM - swift-account-replicator on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [10:22:40] PROBLEM - swift-account-auditor on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [10:22:40] PROBLEM - swift-container-replicator on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [10:22:46] sorry that's me, expired downtime [10:22:50] PROBLEM - swift-container-server on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [10:25:23] !log run xfs_repair on sdc1 on ms-be2040 - T199198 [10:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:27] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [10:27:10] (03PS1) 10Volans: Upstream release v3.0.2 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/449146 [10:29:10] RECOVERY - puppet last run on dbmonitor2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:29:13] (03CR) 10Volans: [C: 04-1] "sorry, missed one typo in the commit meesage" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [10:30:01] (03CR) 10jerkins-bot: [V: 04-1] Upstream release v3.0.2 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/449146 (owner: 10Volans) [10:32:32] (03CR) 10Volans: [V: 032 C: 032] "Debian package build correctly and tests during the package build pass. The tox failure is known and due to a bug upstream" [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/449146 (owner: 10Volans) [10:35:31] (03CR) 10jerkins-bot: [V: 04-1] Upstream release v3.0.2 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/449146 (owner: 10Volans) [10:36:18] (03PS1) 10Ema: cp2006: reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/449148 (https://phabricator.wikimedia.org/T200445) [10:38:10] (03CR) 10Ema: [C: 032] cp2006: reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/449148 (https://phabricator.wikimedia.org/T200445) (owner: 10Ema) [10:39:36] (03CR) 10Lucas Werkmeister (WMDE): [C: 031] "WikibaseQualityConstraints part looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449017 (owner: 10Matěj Suchánek) [10:40:28] (03Abandoned) 10Ladsgroup: Add federation-related configs for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/409622 (https://phabricator.wikimedia.org/T186955) (owner: 10Ladsgroup) [10:40:49] 10Operations, 10ops-codfw: wtp2011 memory correctable errors - https://phabricator.wikimedia.org/T200678 (10fgiunchedi) [10:41:41] PROBLEM - puppet last run on puppetmaster1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:41:44] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp2006.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201807... [10:50:07] !log uploaded cumin_3.0.2-1_amd64.deb to apt.wikimedia.org jessie-wikimedia [10:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:10] RECOVERY - puppet last run on puppetmaster1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:00:04] jan_drewniak: It is that lovely time of the day again! You are hereby commanded to deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180730T1100). [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180730T1100). [11:00:05] tgr, matej_suchanek, CFisch_WMDE, and Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:12] o/ [11:00:27] o/ [11:00:50] I can SWAT today [11:00:55] \o/ [11:01:01] tgr, Amir1: the two of you are deployers, right? want to deploy your commits? [11:01:07] I can [11:01:27] Amir1: go ahead then, let us know when you are done [11:01:31] !log upgrading cumin to 3.0.2 on sarin [11:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:37] cool [11:01:38] can do it as well [11:02:00] tgr: ok, you are after Amir1 then, please stand by [11:02:36] I'll start merging the core backports in the meantime, that tends to take forever [11:02:52] tgr: yes, good idea [11:02:53] (03CR) 10Ladsgroup: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449137 (https://phabricator.wikimedia.org/T199334) (owner: 10Ladsgroup) [11:04:12] (03Merged) 10jenkins-bot: Enable reading from change_tag_def everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449137 (https://phabricator.wikimedia.org/T199334) (owner: 10Ladsgroup) [11:04:27] (03CR) 10Jcrespo: "LOL" [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [11:07:36] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2006.codfw.wmnet'] ``` and were **ALL** successful. [11:08:27] argh, `scap pull` [11:08:38] argh, `scap pull` failed at mwdebug1002 :/ [11:08:41] (03CR) 10jenkins-bot: Enable reading from change_tag_def everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449137 (https://phabricator.wikimedia.org/T199334) (owner: 10Ladsgroup) [11:09:22] Amir1: does `scap pull` work for you at mwdebug1002? [11:09:33] yup, I just did [11:09:50] Amir1: hm, maybe it's just me then :/ [11:10:18] maybe we hit race condition [11:10:25] could be [11:10:30] anyway, mine is all goat, moving forward [11:10:49] !log upgrading cumin to 3.0.2 on the remaining cumin masters (neodymium, labpuppetmaster*) [11:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:54] (03PS5) 10Gergő Tisza: Configure group management for interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440676 [11:11:41] (03PS2) 10Jcrespo: dbtree: Make dbtree work again on debmonitor active host [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) [11:12:05] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:449137|Enable reading from change_tag_def everywhere (T199334)]] (duration: 00m 55s) [11:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:09] T199334: Temporarily add config and use it to use change_tag_def table instead of change_tag table for Special:Tags - https://phabricator.wikimedia.org/T199334 [11:12:43] (03CR) 10Gergő Tisza: [C: 032] Configure group management for interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440676 (owner: 10Gergő Tisza) [11:12:47] I'm done [11:13:06] Amir1: great [11:13:16] tgr: ready to deploy your commits? [11:13:37] I'm ready, jenkins is not [11:13:42] :D [11:13:46] (03PS3) 10Jcrespo: dbtree: Make dbtree work again on debmonitor active host [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) [11:13:58] (03Merged) 10jenkins-bot: Configure group management for interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440676 (owner: 10Gergő Tisza) [11:14:03] all the other SWAT tasks are core patches though so I don't think they can be parallelized [11:14:17] (03CR) 10jenkins-bot: Configure group management for interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/440676 (owner: 10Gergő Tisza) [11:15:01] tgr: yes, looks like this swat is just two config changes, four are backports [11:15:10] tgr mine is on an extension if that makes a difference [11:15:28] I think some of them are in extensions, so it would be safe to merge them in parallel and deploy one by one [11:16:35] extension patches also generate a core patch so not sure how that would work out [11:16:35] (03PS4) 10Jcrespo: dbtree: Make dbtree work again on debmonitor active host [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) [11:17:01] CFisch_WMDE: looks like your patch and matej_suchanek's are in different extensions, I'll review them and merge, it will take some time for CI [11:17:25] zeljkof: great, thanks [11:17:31] fine [11:17:51] tgr: a commit in an extension creates a commit in core? I was not aware of that [11:18:31] yeah, extensions are submodules of core, you need a core commit to update the commit id of the submodule [11:18:42] jenkins does that automatically [11:18:53] this is only true for wmf branches of course [11:19:11] (03PS5) 10Jcrespo: dbtree: Make dbtree work again on dbmonitor1001 [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) [11:19:26] tgr: in that case I'll wait for you then [11:19:44] let me know when you are done and I'll continue [11:22:31] (03CR) 10Volans: [C: 031] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [11:24:27] (03PS6) 10Jcrespo: dbtree: Make dbtree work again on dbmonitor1001 [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) [11:25:57] (03PS10) 10Gehel: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [11:26:06] (03CR) 10Jcrespo: [C: 032] dbtree: Make dbtree work again on dbmonitor1001 [puppet] - 10https://gerrit.wikimedia.org/r/449142 (https://phabricator.wikimedia.org/T192092) (owner: 10Jcrespo) [11:27:38] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Deskana) There's conflicting information about how Google updates their index. On the one... [11:28:53] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Deskana) This isn't about search engine optimisation in the strictest sense, but... [11:30:47] (03CR) 10Gehel: [C: 032] "This is a noop, and ppc agrees: https://puppet-compiler.wmflabs.org/compiler02/11904/" [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [11:30:59] (03PS11) 10Gehel: Make cirrus specific elasticsearch profile [puppet] - 10https://gerrit.wikimedia.org/r/447566 (https://phabricator.wikimedia.org/T198351) (owner: 10EBernhardson) [11:33:11] !log tgr@deploy1001 Started scap: T190015 Create separate user group for editing sitewide CSS/JavaScript that does not include administrators by default [11:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:01] (03PS8) 10Gehel: Split per-cluster config out of elasticsearch::curator [puppet] - 10https://gerrit.wikimedia.org/r/447567 (https://phabricator.wikimedia.org/T180807) (owner: 10EBernhardson) [11:39:15] !log upgrading intel-microcode to 20180703 on the servers which have microcode updates enabled (reboots to pick up the new version will be coordinated separately) [11:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:54] !log pool cp2006, upgraded to stretch T200445 [11:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:58] T200445: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 [11:41:04] (03CR) 10Volans: "I agree with the idea of getting the primary from conftool and that is not worth to integrate a confctl client here as this is temporary a" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo) [11:42:19] tgr: just checking, what's your status? any ETA on when you will be done? [11:42:44] (03CR) 10Gehel: [C: 032] "This is a noop and puppet compiler agrees: https://puppet-compiler.wmflabs.org/compiler02/11905/" [puppet] - 10https://gerrit.wikimedia.org/r/447567 (https://phabricator.wikimedia.org/T180807) (owner: 10EBernhardson) [11:46:51] zeljkof: scap is running [11:46:53] (03CR) 10Jcrespo: "Will do some of the changes proposed and get a compiler output, although most of puppet refactoring is trivial, and not related to the mai" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo) [11:47:07] tgr: cool, thanks [11:47:19] I've heard it takes about 15 minutes these days so should be done any time soon [11:47:50] 15 minutes for swat to deploy a patch? it's usually a minute or so [11:47:54] zeljkof: you can merge the other patches (sorry, should have said so when I started scap) [11:48:13] I needed a full scap, since it includes new interface messages [11:48:25] scap sync, I mean [11:48:34] tgr: ok, can I merge two patches from two extensions at the same time, or should I merge them one by one? [11:49:46] you can merge them at the same time, that means they will be both committed in core, which makes a revert more difficult [11:50:05] but a revert is pretty much never needed so probably worth the risk [11:52:05] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Imarlier) @ArielGlenn @Deskana AFAICT, they're using ChangeProp to update the inf... [11:52:23] thanks, we are short on time, so I would like to get them both, I'll risk the revert and merge them both [11:58:42] zeljkof: I can finish those patches if you are in a hurry [11:59:23] (03PS1) 10Vgutierrez: varnishkafka: pin package to stretch-backports on stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/449163 (https://phabricator.wikimedia.org/T200445) [12:00:10] tgr: I mean, swat window is about to end, I'm not in a hurry, but thanks, I'll finish SWAT, as soon as CI is done [12:00:25] (03PS2) 10Vgutierrez: varnishkafka: pin librdkafka1 package to stretch-backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/449163 (https://phabricator.wikimedia.org/T200445) [12:01:18] zeljkof: there's no other window for the next 5 hours though [12:01:48] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/449163 (https://phabricator.wikimedia.org/T200445) (owner: 10Vgutierrez) [12:01:52] tgr: I need to catch up on wmf.14 :/ [12:02:09] oh, right, forgot about that [12:02:20] super fast reviewer badge for moritzm <3 [12:02:39] should be a minute or two, but with all the recent problems I'm not holding my breath [12:03:43] hehe :-) [12:06:00] *meow* [12:06:41] (03CR) 10Ema: [C: 031] varnishkafka: pin librdkafka1 package to stretch-backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/449163 (https://phabricator.wikimedia.org/T200445) (owner: 10Vgutierrez) [12:07:35] matej_suchanek, CFisch_WMDE: please stand by, your commits should be merged soon(tm) [12:07:49] Wohooo :-) [12:07:56] PROBLEM - puppet last run on db2052 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[intel-microcode] [12:08:14] thanks for your hardwork! [12:10:24] matej_suchanek, CFisch_WMDE: both commits are merged, I'll ping you when they are at mwdebug1002 for testing, let me know if you need help testing there [12:11:17] PROBLEM - puppet last run on es2016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[intel-microcode] [12:11:51] I'm asking for help [12:13:06] RECOVERY - puppet last run on db2052 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:16:22] 10Operations, 10Wikimedia-Mailing-lists: Change digest function of wikimedia-l@ so it send emails only once a day - https://phabricator.wikimedia.org/T141566 (10Aklapper) I emailed the three list admins on 2018-06-21 and have not seen any reply yet. [12:16:27] RECOVERY - puppet last run on es2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:16:37] tgr: are you still around? `git rebase` failed, security patch could not be rebased :/ [12:16:49] matej_suchanek: I'll send you the docs [12:17:06] zeljkof: here [12:17:17] is that something I did? [12:17:17] Reedy: are you around? security patch failed to rebase :( [12:18:03] Cant' be that bad [12:18:03] tgr: I don't know, the error message is `error: unable to create backing store for newly created file languages/i18n/qqq.json` [12:18:29] That doesn't sound like a git error [12:18:42] tgr, Reedy: I've merged these two https://gerrit.wikimedia.org/r/c/mediawiki/extensions/AbuseFilter/+/448856 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FileImporter/+/448801 [12:18:57] scap passed the canaries so I'm probably not syncing anything obviously wrong like a merge conflict [12:19:24] I'm at this step https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#mediawiki/extensions_and_mediawiki/skins [12:19:31] doing `git rebase` [12:19:45] `you@deploy1001:/srv/mediawiki-staging/php-[VERSION]$ git rebase` [12:20:06] (03PS11) 10Filippo Giunchedi: Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:20:35] Reedy: what's the best way to share my terminal output, private phab paste? [12:20:44] do you need it at all? [12:20:47] (03CR) 10Vgutierrez: [C: 032] varnishkafka: pin librdkafka1 package to stretch-backports on stretch [puppet] - 10https://gerrit.wikimedia.org/r/449163 (https://phabricator.wikimedia.org/T200445) (owner: 10Vgutierrez) [12:21:07] you'll probably get the same error message as I did [12:21:20] ooh, didn't know about that page, it looks handy [12:21:39] not sure how safe it is to rebase while scap runs, though [12:21:45] yeah... [12:21:56] (03CR) 10Filippo Giunchedi: [C: 032] Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:22:56] tgr: is scap running? while I was rebasing? [12:23:10] (03PS12) 10Filippo Giunchedi: Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:23:14] should I just abort the rebase and try again? [12:23:18] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] Front Thumbor instances with Haproxy [puppet] - 10https://gerrit.wikimedia.org/r/417233 (https://phabricator.wikimedia.org/T187765) (owner: 10Gilles) [12:23:23] yeah, that would be best [12:23:34] (03PS3) 10Alexandros Kosiaris: grafana: Allow skipping instantiation of grafana-admin [puppet] - 10https://gerrit.wikimedia.org/r/442313 (https://phabricator.wikimedia.org/T170150) [12:23:39] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] grafana: Allow skipping instantiation of grafana-admin [puppet] - 10https://gerrit.wikimedia.org/r/442313 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [12:23:50] ok, aborted, trying again [12:24:11] zeljkof: I mean, try again when scap is done [12:24:52] it's taking super long, not sure what's up with that [12:24:53] tgr: I'm not running scap, if that is what you mean. how do I know if it's running? [12:25:04] tgr: you are running scap? [12:25:20] yeah, scap sends an IRC message when it is done [12:25:38] tgr: ah, I have missed that :/ ok, waiting then [12:26:13] normally it is 15-20 mins (or at least was when I last used it), has been running for almost an hour now :( [12:26:16] matej_suchanek, CFisch_WMDE: sorry, more delay, I have to wait for scap run to finish before I can continue [12:26:35] zeljkof: ok no worries [12:27:09] same here, take what you need [12:27:19] matej_suchanek: the docs on how to test at mwdebug1002 https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug#Staging_changes [12:28:14] Reedy: are you familiar with scap internals? is rsync the only stage where it reads from /srv/mediawiki-staging? [12:28:28] I'm honeslty not familiar [12:28:41] the docs make it look more complicated than it is, you just need to install a browser extension, enable it and pick mwdebug1002 in the list of servers, then you just go to a wmf wiki and you will be redirected (by the extension) to mwdebug1002, you can test the fix, then disable the extension [12:28:48] matej_suchanek: ^ [12:29:41] ok, I've already installed it [12:32:39] matej_suchanek: let me know if you have questions, as soon as scap is done I'll continue [12:33:16] I have prepared a test case [12:33:22] !log tgr@deploy1001 Finished scap: T190015 Create separate user group for editing sitewide CSS/JavaScript that does not include administrators by default (duration: 60m 10s) [12:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:05] zeljkof: ^ done, sorry for taking forever [12:34:13] (03PS1) 10BBlack: wikimediafounation.org A TTLs: 60s for move today [dns] - 10https://gerrit.wikimedia.org/r/449171 (https://phabricator.wikimedia.org/T198922) [12:34:21] (03PS3) 10Alexandros Kosiaris: grafana-admin: Remove from production [puppet] - 10https://gerrit.wikimedia.org/r/442312 (https://phabricator.wikimedia.org/T170150) [12:34:22] in hindsight I should probably have asked for a separate deploy window [12:34:25] tgr: no problem, thanks for letting me know! continuing with swat [12:34:27] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] grafana-admin: Remove from production [puppet] - 10https://gerrit.wikimedia.org/r/442312 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [12:34:37] although it's still abnormal that scap takes so long [12:35:39] tgr, Reedy: I still get rebase error :/ [12:35:55] It sounds like a file system error [12:36:24] can you paste it somewhere? [12:36:34] there is no rebase conflict as far as I can see [12:36:48] (03PS1) 10Filippo Giunchedi: haproxy: drop default puppet server [puppet] - 10https://gerrit.wikimedia.org/r/449175 (https://phabricator.wikimedia.org/T187765) [12:36:49] tgr: what's the best way to share the output? private phab paste? [12:36:59] git status says "all conflicts fixed" [12:37:14] public paste works fine too [12:37:17] (03PS1) 10Alexandros Kosiaris: Remove grafana-admin.wikimedia.org virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/449176 (https://phabricator.wikimedia.org/T170150) [12:37:22] (03CR) 10Jcrespo: "I have some initial question, but I really would need an example config to evaluate how good it fits the proposed needs (or a trade off be" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422373 (https://phabricator.wikimedia.org/T197531) (owner: 10Giuseppe Lavagetto) [12:37:30] tgr: it mentions security patch [12:37:42] oh, yeah, you are right [12:37:46] that's why I'm reluctant to make it public [12:37:46] private paste then [12:38:00] ok, pasting [12:38:58] RECOVERY - swift-object-auditor on ms-be2040 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [12:39:08] RECOVERY - swift-container-updater on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [12:39:08] RECOVERY - swift-account-replicator on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [12:39:08] RECOVERY - swift-object-updater on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [12:39:08] RECOVERY - swift-account-auditor on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [12:39:18] RECOVERY - swift-container-replicator on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [12:39:37] RECOVERY - swift-object-server on ms-be2040 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [12:39:38] RECOVERY - swift-account-server on ms-be2040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [12:39:44] (03CR) 10Filippo Giunchedi: [C: 031] "NOOP on dbproxy: https://puppet-compiler.wmflabs.org/compiler02/11910/" [puppet] - 10https://gerrit.wikimedia.org/r/449175 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi) [12:39:47] RECOVERY - swift-container-auditor on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:40:31] (03PS2) 10Alexandros Kosiaris: icinga: Bump max_concurrent_checks to 10k [puppet] - 10https://gerrit.wikimedia.org/r/445375 (https://phabricator.wikimedia.org/T199413) [12:40:41] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "Merging anyway, shouldn't not hurt." [puppet] - 10https://gerrit.wikimedia.org/r/445375 (https://phabricator.wikimedia.org/T199413) (owner: 10Alexandros Kosiaris) [12:40:43] !log reboot ms-be2040 with page_poisoning=1 - T199198 [12:40:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:47] T199198: Some swift filesystems reporting negative disk usage - https://phabricator.wikimedia.org/T199198 [12:40:58] (03PS2) 10Alexandros Kosiaris: Remove grafana-admin.wikimedia.org virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/449176 (https://phabricator.wikimedia.org/T170150) [12:41:03] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove grafana-admin.wikimedia.org virtualhost [puppet] - 10https://gerrit.wikimedia.org/r/449176 (https://phabricator.wikimedia.org/T170150) (owner: 10Alexandros Kosiaris) [12:42:00] tgr, Reedy: https://phabricator.wikimedia.org/P7400 [12:42:19] error: insufficient permission for adding an object to repository database .git/objects [12:42:20] error: unable to create backing store for newly created file languages/i18n/qqq.json [12:42:25] File permissions are wrong in .git/objects [12:42:31] Someone has a bad umask, presumably [12:42:58] Reedy: what do I do now? [12:43:18] RECOVERY - swift-container-server on ms-be2040 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [12:43:27] tgr: Looks like it might be (partially) your fault [12:43:27] drwxr-sr-x 2 tgr wikidev 4096 Jul 30 11:25 84 [12:43:46] There's a handful with no group write [12:43:48] PROBLEM - haproxy alive on thumbor2001 is CRITICAL: CRITICAL check_alive invalid response [12:43:49] can somebody from ops fix it? cc robh [12:44:09] tgr should be able to fix them [12:44:26] drwxr-sr-x 2 tgr wikidev 4096 Jul 30 11:25 07 [12:44:26] drwxr-sr-x 2 tgr wikidev 4096 Jul 30 11:25 3e [12:44:27] drwxr-sr-x 2 tgr wikidev 4096 Jul 30 11:25 4c [12:44:27] drwxr-sr-x 2 tgr wikidev 4096 Jul 30 11:25 4f [12:44:27] drwxr-sr-x 2 tgr wikidev 4096 Jul 30 11:25 6a [12:44:27] I can take a look too [12:44:29] drwxr-sr-x 2 tgr wikidev 4096 Jul 30 11:25 84 [12:44:31] ok, great; tgr do you know how to fix the problem? [12:44:41] godog: thanks! [12:44:46] godog: chmod -R g+w /srv/mediawiki-staging/php-1.32.0-wmf.14/.git/objects [12:44:47] if someone with root can do it, that's easier [12:44:53] which host is this? [12:44:55] deploy1001 [12:45:18] {{done}} [12:45:31] !log fix group write on deploy1001 /srv/mediawiki-staging/php-1.32.0-wmf.14/.git/objects [12:45:31] thx [12:45:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:41] np [12:45:47] (03PS1) 10Andrew Bogott: Local.php: Add $wgSitenotice for when no notice is specified on Wikitech [wikitech-static] - 10https://gerrit.wikimedia.org/r/449178 (https://phabricator.wikimedia.org/T200479) [12:45:59] godog: thanks! [12:46:10] ok, so I just abort the rebase and try again? [12:46:13] my umask is 0002, FWIW [12:46:29] (03CR) 10Andrew Bogott: [V: 032 C: 032] Local.php: Add $wgSitenotice for when no notice is specified on Wikitech [wikitech-static] - 10https://gerrit.wikimedia.org/r/449178 (https://phabricator.wikimedia.org/T200479) (owner: 10Andrew Bogott) [12:46:52] except in screen it's apparently not :/ [12:46:55] ok, rebase worked fine this time! thanks godog cc tgr Reedy [12:47:08] yw zeljkof [12:47:32] matej_suchanek, CFisch_WMDE: problems solved, the commits will be ready for testing in a minute or two [12:47:46] yipee [12:48:14] tgr: Old screen session? [12:48:22] so apparently umask is set in profile.d instead of bashrc; I'll file a bug [12:48:36] thanks :) [12:48:49] not that I am aware [12:48:49] (03CR) 10BBlack: [C: 032] wikimediafounation.org A TTLs: 60s for move today [dns] - 10https://gerrit.wikimedia.org/r/449171 (https://phabricator.wikimedia.org/T198922) (owner: 10BBlack) [12:48:57] is there a way to check the age? [12:49:21] screen -ls [12:49:54] it says it was started today [12:50:05] OTOH if I manually start a screen, the umask is correct [12:50:09] matej_suchanek, CFisch_WMDE: your commits are at mwdebug1002, please test and let me know if I can deploy them [12:50:32] well, a mystery for another time [12:50:44] (03PS1) 10Jcrespo: haproxy: Fix bug on /run directory creation [puppet] - 10https://gerrit.wikimedia.org/r/449179 [12:50:55] PROBLEM - haproxy process on thumbor2001 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy [12:51:22] zeljkof: Thanks, checked, works, can be deployed! [12:53:01] CFisch_WMDE: ok, deploying [12:53:08] matej_suchanek: still around? [12:53:34] yes... but unfortunately I can't get my test case pass [12:54:26] ah, finally [12:54:59] !log zfilipin@deploy1001 Synchronized php-1.32.0-wmf.14/extensions/FileImporter/: SWAT: [[gerrit:448801|Fix flipped array indexes in template removal code (T200406)]] (duration: 00m 57s) [12:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:03] T200406: Template removal removes too much - https://phabricator.wikimedia.org/T200406 [12:55:15] (03PS1) 10Jcrespo: dbproxy-master: Fix hieradata reference typo [puppet] - 10https://gerrit.wikimedia.org/r/449180 [12:55:20] CFisch_WMDE: deployed! please test and thanks for deploying with #releng :) [12:55:36] matej_suchanek: the test passed? can I deploy? [12:55:58] yes, I had to bypass my browser cache since a resource file was updated [12:56:07] (03CR) 10Jcrespo: [C: 031] "Checked unused roles, too." [puppet] - 10https://gerrit.wikimedia.org/r/449175 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi) [12:56:42] matej_suchanek: just to make it clear, ok to deploy? [12:56:58] zeljkof: Works like a charm, thanks staying with us and finishing the patches :-)! [12:57:09] zeljkof: yes, it is [12:57:15] CFisch_WMDE: no problemo :) [12:57:19] matej_suchanek: ok, deploying [12:57:24] (03CR) 10Filippo Giunchedi: [C: 032] haproxy: drop default puppet server [puppet] - 10https://gerrit.wikimedia.org/r/449175 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi) [12:57:31] (03PS2) 10Filippo Giunchedi: haproxy: drop default puppet server [puppet] - 10https://gerrit.wikimedia.org/r/449175 (https://phabricator.wikimedia.org/T187765) [12:58:17] !log zfilipin@deploy1001 Synchronized php-1.32.0-wmf.14/extensions/AbuseFilter/: SWAT: [[gerrit:448856|Fix jQuery selector when editing filters (T200604)]] (duration: 00m 55s) [12:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:21] T200604: Cannot add a change tag to AbuseFilter - https://phabricator.wikimedia.org/T200604 [12:58:49] matej_suchanek: it's deployed! please test it again on production (disable the extension) and thanks for deploying with #releng ;) [12:59:13] (03PS2) 10Jcrespo: haproxy: Fix bug on /run directory creation [puppet] - 10https://gerrit.wikimedia.org/r/449179 [12:59:31] !log EU SWAT finished [12:59:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:49] zeljkof: it's working, thank you very much [13:01:34] RECOVERY - haproxy process on thumbor2001 is OK: PROCS OK: 2 processes with command name haproxy [13:06:21] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Begin execution of non-forward-secret ciphers deprecation - https://phabricator.wikimedia.org/T192555 (10Vgutierrez) [13:14:05] PROBLEM - haproxy alive on thumbor2001 is CRITICAL: CRITICAL check_alive invalid response [13:14:45] that's me ^ known [13:14:54] (03PS1) 10Filippo Giunchedi: haproxy: add stats socket to default config [puppet] - 10https://gerrit.wikimedia.org/r/449183 (https://phabricator.wikimedia.org/T187765) [13:17:09] (03CR) 10Filippo Giunchedi: [C: 032] haproxy: add stats socket to default config [puppet] - 10https://gerrit.wikimedia.org/r/449183 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi) [13:17:18] (03PS2) 10Filippo Giunchedi: haproxy: add stats socket to default config [puppet] - 10https://gerrit.wikimedia.org/r/449183 (https://phabricator.wikimedia.org/T187765) [13:18:36] (03PS1) 10BBlack: cacheproxy: enable scsi_mod.use_blk_mq [puppet] - 10https://gerrit.wikimedia.org/r/449184 (https://phabricator.wikimedia.org/T195923) [13:20:15] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:22:34] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:25:48] (03PS2) 10BBlack: cacheproxy: enable scsi_mod.use_blk_mq [puppet] - 10https://gerrit.wikimedia.org/r/449184 (https://phabricator.wikimedia.org/T195923) [13:25:54] (03CR) 10BBlack: [C: 032] cacheproxy: enable scsi_mod.use_blk_mq [puppet] - 10https://gerrit.wikimedia.org/r/449184 (https://phabricator.wikimedia.org/T195923) (owner: 10BBlack) [13:26:44] RECOVERY - haproxy alive on thumbor2001 is OK: OK check_alive uptime 314s [13:27:06] yay! [13:27:32] \o/ [13:27:38] rolling out to the rest [13:36:09] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10herron) a:03RStallman-legalteam [13:40:40] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10herron) 05Open>03declined Since there has been no activity here for several weeks I'll transition this to declined for now. If/whe... [13:42:57] jouncebot: next [13:42:57] In 3 hour(s) and 17 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180730T1700) [13:43:25] looks like there are no deployments in hours, I would like to catch up on .14 deployment right now if nobody has objections [13:44:54] (03PS1) 10Jcrespo: WMFMariaDB refactoring and adding tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/449185 [13:45:30] (03CR) 10jerkins-bot: [V: 04-1] WMFMariaDB refactoring and adding tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/449185 (owner: 10Jcrespo) [13:47:36] (03PS1) 10BBlack: cp1075: add to conftool/hieradata node lists [puppet] - 10https://gerrit.wikimedia.org/r/449187 (https://phabricator.wikimedia.org/T195923) [13:47:59] (03CR) 10BBlack: [C: 032] cp1075: add to conftool/hieradata node lists [puppet] - 10https://gerrit.wikimedia.org/r/449187 (https://phabricator.wikimedia.org/T195923) (owner: 10BBlack) [13:50:46] (03CR) 10Filippo Giunchedi: Create prometheus::resource_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/446687 (owner: 10EBernhardson) [13:53:48] 10Operations, 10LDAP-Access-Requests, 10User-Addshore: Give access to graphite and grafana-admin to Aleksey Bekh-Ivanov (WMDE) - https://phabricator.wikimedia.org/T199233 (10Addshore) Ping @Aleksey_WMDE [13:56:26] (03PS1) 10Zfilipin: all wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449188 [13:56:28] (03CR) 10Zfilipin: [C: 032] all wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449188 (owner: 10Zfilipin) [13:57:56] (03Merged) 10jenkins-bot: all wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449188 (owner: 10Zfilipin) [13:58:24] (03CR) 10jenkins-bot: all wikis to 1.32.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449188 (owner: 10Zfilipin) [13:58:54] !log zfilipin@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.32.0-wmf.14 [13:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:11] PROBLEM - HHVM rendering on mw2174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:01] RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 81784 bytes in 0.277 second response time [14:02:39] 10Operations, 10Wikimedia-General-or-Unknown: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10Reedy) [14:02:41] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp1075 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:41] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1075 is CRITICAL: connect to address 10.64.0.130 and port 3125: Connection refused [14:02:41] PROBLEM - eventlogging Varnishkafka log producer on cp1075 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:02:54] hello cp1075! [14:03:06] 10Operations, 10monitoring, 10Patch-For-Review: Evaluate Grafana's LDAP group options and deprecate grafana-admin if possible - https://phabricator.wikimedia.org/T170150 (10akosiaris) 05stalled>03Resolved grafana-admin.wikimedia.org fully deprecated. I am resolving this. [14:03:19] 10Operations, 10ops-codfw, 10DBA: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Marostegui) p:05Triage>03Normal [14:03:23] sorry! [14:03:35] also, there's some problem with stretch and the mtail package [14:03:44] uh? [14:04:10] E: There were unauthenticated packages and -y was used without --allow-unauthenticated [14:04:20] (it thinks mtail is unauth) [14:04:21] PROBLEM - Confd template for /etc/varnish/directors.frontend.vcl on cp1075 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:21] PROBLEM - puppet last run on cp1075 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:29] (03PS1) 10Filippo Giunchedi: WIP logstash: add 'id' to syslog input [puppet] - 10https://gerrit.wikimedia.org/r/449189 [14:04:32] ? [14:04:40] unauthenticated packageS? [14:04:42] RECOVERY - Confd template for /etc/varnish/directors.backend.vcl on cp1075 is OK: No errors detected [14:04:42] RECOVERY - eventlogging Varnishkafka log producer on cp1075 is OK: PROCS OK: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/eventlogging.conf [14:04:43] how ? [14:05:20] bblack: mtail did install fine on cp2006 [14:05:22] RECOVERY - Confd template for /etc/varnish/directors.frontend.vcl on cp1075 is OK: No errors detected [14:05:41] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1075 is OK: HTTP OK: HTTP/1.1 200 OK - 502 bytes in 0.001 second response time [14:06:07] ema: well it probably will install fine here too, if I do it manually and answer yes to "Install these packages without verification? [y/N]" [14:06:12] but otherwise it doesn't want to [14:06:35] Candidate: 3.0.0~rc5-1~bpo9+1 -> stretch-backports [14:06:42] PROBLEM - IPsec on cp1075 is CRITICAL: Strongswan CRITICAL - ok: 30 not-conn: cp2013_v4, cp2013_v6, cp2016_v4, cp2016_v6, cp2019_v4, cp2019_v6, cp3031_v4, cp3031_v6, cp3032_v4, cp3032_v6, cp3042_v4, cp3042_v6, cp4027_v4, cp4027_v6, cp4029_v4, cp4029_v6, cp4030_v4, cp4030_v6, cp4032_v4, cp4032_v6, cp5007_v4, cp5007_v6, cp5009_v4, cp5009_v6 [14:07:19] (03CR) 10Filippo Giunchedi: "PCC looks good https://puppet-compiler.wmflabs.org/compiler02/11912/logstash1007.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/449189 (owner: 10Filippo Giunchedi) [14:07:52] PROBLEM - traffic-pool service on cp1075 is CRITICAL: CRITICAL - Expecting active but unit traffic-pool is inactive [14:08:51] RECOVERY - IPsec on cp1075 is OK: Strongswan OK - 54 ESP OK [14:09:10] (03PS1) 10Gehel: [WIP] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [14:09:21] RECOVERY - puppet last run on cp1075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:10:44] (03PS4) 10Muehlenhoff: Decommission terbium [puppet] - 10https://gerrit.wikimedia.org/r/445423 (https://phabricator.wikimedia.org/T192092) [14:12:03] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [14:12:08] bblack: I've ran `apt update`, after which puppet installed mtail properly. Some race perhaps? [14:14:11] (03PS2) 10Gehel: [WIP] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [14:15:10] ah [14:15:21] "puppet agent -t" on cmdline doesn't run the apt commands first from run-puppet-agent [14:15:24] that's why [14:16:41] (03PS1) 10Andrew Bogott: rough draft of etcd for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/449192 [14:17:25] (03CR) 10jerkins-bot: [V: 04-1] rough draft of etcd for wmcs [puppet] - 10https://gerrit.wikimedia.org/r/449192 (owner: 10Andrew Bogott) [14:17:39] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [14:21:31] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2012.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073014... [14:24:37] (03PS1) 10Jcrespo: mariadb: Test MariaDB 10.3 on db1118 (core test host) [puppet] - 10https://gerrit.wikimedia.org/r/449196 (https://phabricator.wikimedia.org/T193224) [14:27:20] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10mark) I am a bit confused by this RFC/proposal as it stands now, as I feel it doesn't really reflect the discussion... [14:28:02] 10Operations, 10ops-codfw, 10DBA: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10jcrespo) There was alread a BIOS upgrade at T139714, I would contact directly support as suggested by robh here: https://phabricator.wikimedia.org/T139283#2430289 [14:32:18] (03PS5) 10Muehlenhoff: Decommission terbium [puppet] - 10https://gerrit.wikimedia.org/r/445423 (https://phabricator.wikimedia.org/T192092) [14:33:31] 10Operations, 10ops-codfw, 10DBA: pc2006 rebooted itself - https://phabricator.wikimedia.org/T200641 (10Marostegui) That sounds good to me. This is a racadm getsel so it can be sent to support: ``` /admin1-> racadm getsel Record: 1 Date/Time: Source: system Severity: Ok Descripti... [14:34:28] (03PS1) 10BBlack: optimized mke2fs for stretch cp installs [puppet] - 10https://gerrit.wikimedia.org/r/449201 (https://phabricator.wikimedia.org/T200445) [14:34:30] (03CR) 10Muehlenhoff: [C: 032] Decommission terbium [puppet] - 10https://gerrit.wikimedia.org/r/445423 (https://phabricator.wikimedia.org/T192092) (owner: 10Muehlenhoff) [14:34:34] (03PS1) 10BBlack: cp1075-99: define storage size [puppet] - 10https://gerrit.wikimedia.org/r/449202 (https://phabricator.wikimedia.org/T195923) [14:35:13] (03CR) 10jerkins-bot: [V: 04-1] optimized mke2fs for stretch cp installs [puppet] - 10https://gerrit.wikimedia.org/r/449201 (https://phabricator.wikimedia.org/T200445) (owner: 10BBlack) [14:36:23] (03PS1) 10Ema: cp2012: reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/449203 (https://phabricator.wikimedia.org/T200445) [14:36:39] (03PS1) 10Andrew Bogott: Designate: install memcached on labtest designate host [puppet] - 10https://gerrit.wikimedia.org/r/449204 [14:38:34] (03CR) 10Ema: [C: 032] cp2012: reimage as stretch [puppet] - 10https://gerrit.wikimedia.org/r/449203 (https://phabricator.wikimedia.org/T200445) (owner: 10Ema) [14:38:53] (03PS2) 10Andrew Bogott: Designate: install memcached on labtest designate host [puppet] - 10https://gerrit.wikimedia.org/r/449204 [14:39:45] (03PS2) 10BBlack: optimized mke2fs for stretch cp installs [puppet] - 10https://gerrit.wikimedia.org/r/449201 (https://phabricator.wikimedia.org/T200445) [14:39:47] (03PS2) 10BBlack: cp1075-99: define storage size [puppet] - 10https://gerrit.wikimedia.org/r/449202 (https://phabricator.wikimedia.org/T195923) [14:40:45] (03PS2) 10Muehlenhoff: Remove terbium for tendril grants [puppet] - 10https://gerrit.wikimedia.org/r/445590 [14:41:05] (03CR) 10BBlack: [C: 032] optimized mke2fs for stretch cp installs [puppet] - 10https://gerrit.wikimedia.org/r/449201 (https://phabricator.wikimedia.org/T200445) (owner: 10BBlack) [14:41:14] (03CR) 10BBlack: [C: 032] cp1075-99: define storage size [puppet] - 10https://gerrit.wikimedia.org/r/449202 (https://phabricator.wikimedia.org/T195923) (owner: 10BBlack) [14:41:23] (03CR) 10Andrew Bogott: [C: 032] Designate: install memcached on labtest designate host [puppet] - 10https://gerrit.wikimedia.org/r/449204 (owner: 10Andrew Bogott) [14:46:04] (03PS3) 10BBlack: optimized mke2fs for stretch cp installs [puppet] - 10https://gerrit.wikimedia.org/r/449201 (https://phabricator.wikimedia.org/T200445) [14:46:10] (03CR) 10BBlack: [V: 032 C: 032] optimized mke2fs for stretch cp installs [puppet] - 10https://gerrit.wikimedia.org/r/449201 (https://phabricator.wikimedia.org/T200445) (owner: 10BBlack) [14:46:25] (03PS3) 10BBlack: cp1075-99: define storage size [puppet] - 10https://gerrit.wikimedia.org/r/449202 (https://phabricator.wikimedia.org/T195923) [14:46:28] (03CR) 10BBlack: [V: 032 C: 032] cp1075-99: define storage size [puppet] - 10https://gerrit.wikimedia.org/r/449202 (https://phabricator.wikimedia.org/T195923) (owner: 10BBlack) [14:46:37] (03CR) 10Herron: "FWIW testing with the example ID value "input/syslog/10514" works in both logstash and the prometheus exporter." [puppet] - 10https://gerrit.wikimedia.org/r/449189 (owner: 10Filippo Giunchedi) [14:48:44] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2012.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['cp2012.codfw.wmnet'] ``` [14:48:58] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2012.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073014... [14:49:11] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2012.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['cp2012.codfw.wmnet'] ``` [14:50:00] 10Operations, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10bd808) Do we actually use pycube from the deb or are we using the version that is embedded in https://phabricator.wikimedia.org/diffusion/OSTW/browse/master/submodules/ ? [14:50:30] (03PS1) 10Muehlenhoff: Remove conditionals for jessie in mediawiki_maintenance profile [puppet] - 10https://gerrit.wikimedia.org/r/449207 [14:51:21] (03PS3) 10Marostegui: db-eqiad.php: Depool all the hosts in row B [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449141 (https://phabricator.wikimedia.org/T183585) [14:52:46] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on sarin.codfw.wmnet for hosts: ``` cp2012.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/2018073014... [14:54:16] (03CR) 10Jcrespo: [C: 032] mariadb: Test MariaDB 10.3 on db1118 (core test host) [puppet] - 10https://gerrit.wikimedia.org/r/449196 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [14:54:19] (03CR) 10Volans: "Thanks for the fixes, added a couple of comments, nothing major." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [14:54:24] (03PS2) 10Jcrespo: mariadb: Test MariaDB 10.3 on db1118 (core test host) [puppet] - 10https://gerrit.wikimedia.org/r/449196 (https://phabricator.wikimedia.org/T193224) [14:56:18] (03PS1) 10Andrew Bogott: Designate: ipv6 access for memcached [puppet] - 10https://gerrit.wikimedia.org/r/449210 [14:56:48] (03CR) 10Volans: netbox: add psql dump cron and back it up (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [14:57:45] (03CR) 10Andrew Bogott: [C: 032] Designate: ipv6 access for memcached [puppet] - 10https://gerrit.wikimedia.org/r/449210 (owner: 10Andrew Bogott) [14:57:53] (03CR) 10Muehlenhoff: "PCC: http://puppet-compiler.wmflabs.org/11913/" [puppet] - 10https://gerrit.wikimedia.org/r/449207 (owner: 10Muehlenhoff) [14:58:01] (03PS3) 10Jcrespo: mariadb: Test MariaDB 10.3 on db1118 (core test host) [puppet] - 10https://gerrit.wikimedia.org/r/449196 (https://phabricator.wikimedia.org/T193224) [14:58:03] (03PS2) 10Muehlenhoff: Remove conditionals for jessie in mediawiki_maintenance profile [puppet] - 10https://gerrit.wikimedia.org/r/449207 [14:59:51] (03CR) 10Muehlenhoff: [C: 032] Remove conditionals for jessie in mediawiki_maintenance profile [puppet] - 10https://gerrit.wikimedia.org/r/449207 (owner: 10Muehlenhoff) [14:59:58] (03PS3) 10Muehlenhoff: Remove conditionals for jessie in mediawiki_maintenance profile [puppet] - 10https://gerrit.wikimedia.org/r/449207 [15:00:17] (03Abandoned) 10Dzahn: switch terbium to a spare system [puppet] - 10https://gerrit.wikimedia.org/r/448816 (https://phabricator.wikimedia.org/T192092) (owner: 10Dzahn) [15:00:47] (03CR) 10Muehlenhoff: [V: 032 C: 032] Remove conditionals for jessie in mediawiki_maintenance profile [puppet] - 10https://gerrit.wikimedia.org/r/449207 (owner: 10Muehlenhoff) [15:01:00] can I merge, moritzm? [15:01:05] yes, please [15:01:18] done [15:01:32] thx [15:01:38] (finished now) [15:01:45] I need a patch applied to an extension https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180730T1800 [15:02:06] is it fine in the extension or does it need to be somewhere else? [15:04:12] (03PS1) 10Muehlenhoff: Remove mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/449216 [15:04:51] 10Operations, 10Performance-Team, 10Traffic, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Deskana) @Imarlier That's good info, thanks! [15:05:07] davidwbarratt: I hope someone with better release process knowledge can help you, if not I see it you have it scheduled at 18 UTC, so people may be able to help you then, in 3 hours [15:05:50] jynus no problem, just wanted to be prepared. :) [15:05:51] !log running ladsgroup@mwmaint1001:~$ mwscript extensions/ORES/maintenance/PurgeScoreCache.php on all wikis [15:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:41] davidwbarratt: I am going to guess that whatever is on the release need to pull/chery pick the patch, but that is what I would expect- do not know how they handle that [15:06:56] (03PS1) 10Cmjohnson: Adding dns for cloudvirt1023-24 [dns] - 10https://gerrit.wikimedia.org/r/449217 (https://phabricator.wikimedia.org/T199125) [15:07:11] (03PS1) 10Andrew Bogott: Move memcached from labtest to labtestn [puppet] - 10https://gerrit.wikimedia.org/r/449218 [15:07:21] (03PS1) 10Muehlenhoff: mediawiki::php: Remove support for PHP 5 [puppet] - 10https://gerrit.wikimedia.org/r/449219 [15:07:22] jynus yeah that's what I would figure as well, but I'm not sure [15:08:04] (03CR) 10jerkins-bot: [V: 04-1] Move memcached from labtest to labtestn [puppet] - 10https://gerrit.wikimedia.org/r/449218 (owner: 10Andrew Bogott) [15:08:06] !log upgrade hp raid firmware on ms-be1028 - T141756 [15:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:10] T141756: audit / test / upgrade hp smartarray P840 firmware - https://phabricator.wikimedia.org/T141756 [15:11:12] (03PS2) 10Andrew Bogott: Move memcached from labtest to labtestn [puppet] - 10https://gerrit.wikimedia.org/r/449218 [15:11:20] (03PS2) 10Cmjohnson: Adding dns for cloudvirt1023-24 [dns] - 10https://gerrit.wikimedia.org/r/449217 (https://phabricator.wikimedia.org/T199125) [15:11:22] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) Thank you for the summary, @mark. I am interested in this perspective that we can get the same user experien... [15:12:08] (03CR) 10Andrew Bogott: [C: 032] Move memcached from labtest to labtestn [puppet] - 10https://gerrit.wikimedia.org/r/449218 (owner: 10Andrew Bogott) [15:12:44] (03PS3) 10Gehel: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [15:13:02] (03CR) 10Cmjohnson: [C: 032] Adding dns for cloudvirt1023-24 [dns] - 10https://gerrit.wikimedia.org/r/449217 (https://phabricator.wikimedia.org/T199125) (owner: 10Cmjohnson) [15:14:16] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Cmjohnson) [15:14:54] !log reboot cp1075 [15:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:06] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10Cmjohnson) [15:15:19] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) @mark, I think you have raise some good points here. I think there is a point of confusion around the "hop... [15:15:39] (03CR) 10jerkins-bot: [V: 04-1] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [15:15:41] (03PS1) 10Bstorm: osmdb: failing over to labsdb1006 [dns] - 10https://gerrit.wikimedia.org/r/449220 (https://phabricator.wikimedia.org/T197246) [15:16:13] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team, 10Patch-For-Review: rack/setup/install cloudelastic100[1-4].eqiad.wmnet systems - https://phabricator.wikimedia.org/T194186 (10Cmjohnson) [15:17:37] (03PS4) 10Gehel: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [15:18:51] RECOVERY - traffic-pool service on cp1075 is OK: OK - traffic-pool is active [15:19:24] (03PS1) 10Andrew Bogott: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/449221 [15:20:02] PROBLEM - Memcached on labtestservices2001 is CRITICAL: connect to address 208.80.153.48 and port 11000: Connection refused [15:20:32] (03CR) 10Bstorm: [C: 032] osmdb: failing over to labsdb1006 [dns] - 10https://gerrit.wikimedia.org/r/449220 (https://phabricator.wikimedia.org/T197246) (owner: 10Bstorm) [15:20:34] (03CR) 10jerkins-bot: [V: 04-1] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [15:20:42] (03PS2) 10Bstorm: osmdb: failing over to labsdb1006 [dns] - 10https://gerrit.wikimedia.org/r/449220 (https://phabricator.wikimedia.org/T197246) [15:20:46] (03PS2) 10Andrew Bogott: labtestn designate memcached: firewall fixes [puppet] - 10https://gerrit.wikimedia.org/r/449221 [15:22:00] (03CR) 10Andrew Bogott: [C: 032] labtestn designate memcached: firewall fixes [puppet] - 10https://gerrit.wikimedia.org/r/449221 (owner: 10Andrew Bogott) [15:23:58] (03PS2) 10Dzahn: cluster::management: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448787 [15:26:45] (03PS1) 10Cmjohnson: Adding mgmt/prodcution dns rdb1009/10 [dns] - 10https://gerrit.wikimedia.org/r/449223 (https://phabricator.wikimedia.org/T196685) [15:27:01] (03PS1) 10Gehel: Fix integration tests setup. [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 [15:27:31] !log restart and upgrade mariadb at db1118 [15:27:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:59] (03CR) 10Dzahn: [C: 032] cluster::management: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448787 (owner: 10Dzahn) [15:28:37] (03PS2) 10Cmjohnson: Adding mgmt/prodcution dns rdb1009/10 [dns] - 10https://gerrit.wikimedia.org/r/449223 (https://phabricator.wikimedia.org/T196685) [15:29:29] (03CR) 10jerkins-bot: [V: 04-1] Fix integration tests setup. [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 (owner: 10Gehel) [15:29:34] (03CR) 10Cmjohnson: [C: 032] Adding mgmt/prodcution dns rdb1009/10 [dns] - 10https://gerrit.wikimedia.org/r/449223 (https://phabricator.wikimedia.org/T196685) (owner: 10Cmjohnson) [15:31:41] (03PS1) 10BBlack: Revert "optimized mke2fs for stretch cp installs" [puppet] - 10https://gerrit.wikimedia.org/r/449232 (https://phabricator.wikimedia.org/T200445) [15:31:43] (03PS1) 10BBlack: avoid data=writeback on nvme formatted w/o journal [puppet] - 10https://gerrit.wikimedia.org/r/449233 (https://phabricator.wikimedia.org/T195923) [15:32:05] (03CR) 10BBlack: [C: 032] Revert "optimized mke2fs for stretch cp installs" [puppet] - 10https://gerrit.wikimedia.org/r/449232 (https://phabricator.wikimedia.org/T200445) (owner: 10BBlack) [15:32:31] (03CR) 10BBlack: [V: 032 C: 032] Revert "optimized mke2fs for stretch cp installs" [puppet] - 10https://gerrit.wikimedia.org/r/449232 (https://phabricator.wikimedia.org/T200445) (owner: 10BBlack) [15:32:40] (03PS3) 10Dzahn: sca/scb: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448583 [15:35:13] (03CR) 10Dzahn: [C: 032] sca/scb: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448583 (owner: 10Dzahn) [15:35:46] (03CR) 10BBlack: [C: 032] avoid data=writeback on nvme formatted w/o journal [puppet] - 10https://gerrit.wikimedia.org/r/449233 (https://phabricator.wikimedia.org/T195923) (owner: 10BBlack) [15:35:46] (03PS2) 10BBlack: avoid data=writeback on nvme formatted w/o journal [puppet] - 10https://gerrit.wikimedia.org/r/449233 (https://phabricator.wikimedia.org/T195923) [15:35:50] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Marostegui) >>! In T200297#4461793, @Halfak wrote: > > Past evidence (e.g. see Flow) suggests that it is not reali... [15:36:11] 10Operations: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918 (10MoritzMuehlenhoff) 05Open>03declined This is obsolete, these servers are gone for a while. [15:37:22] (03PS1) 10Jcrespo: mariadb: Better support MariaDB 10.2 and 10.3 config on production [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) [15:38:00] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Better support MariaDB 10.2 and 10.3 config on production [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [15:38:08] 10Operations, 10Traffic, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2012.codfw.wmnet'] ``` and were **ALL** successful. [15:42:06] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10User-Nikerabbit, and 2 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10Nikerabbit) >>! In T195293#4414985, @jcrespo wrote... [15:42:48] (03PS2) 10Dzahn: docker::registry: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448810 [15:42:56] (03Abandoned) 10Alex Monk: Python 3: Fix another bytes vs. string problem breaking JSON formatting [software/cumin] - 10https://gerrit.wikimedia.org/r/448985 (https://phabricator.wikimedia.org/T200622) (owner: 10Alex Monk) [15:43:23] 10Operations, 10MediaWiki-extensions-Translate, 10Language-2018-July-September, 10User-Nikerabbit, and 2 others: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) - https://phabricator.wikimedia.org/T195293 (10jcrespo) Hey, you had an excuse, the rest of the p... [15:43:49] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack/setup/install rdb10[09|10].eqiad.wmnet - https://phabricator.wikimedia.org/T196685 (10Cmjohnson) [15:47:45] (03PS5) 10Gehel: Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 [15:47:47] (03PS2) 10Gehel: Fix integration tests setup. [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 [15:49:27] (03CR) 10Dzahn: [C: 032] docker::registry: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448810 (owner: 10Dzahn) [15:50:33] (03CR) 10jerkins-bot: [V: 04-1] Extract progress bars from clustershell event handling. [software/cumin] - 10https://gerrit.wikimedia.org/r/449191 (owner: 10Gehel) [15:50:39] (03CR) 10jerkins-bot: [V: 04-1] Fix integration tests setup. [software/cumin] - 10https://gerrit.wikimedia.org/r/449224 (owner: 10Gehel) [15:51:26] (03PS2) 10Jcrespo: mariadb: Better support MariaDB 10.2 and 10.3 config on production [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) [15:52:11] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Better support MariaDB 10.2 and 10.3 config on production [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [15:54:11] (03PS1) 10Cmjohnson: Adding mgmt/production dns authdns1001 [dns] - 10https://gerrit.wikimedia.org/r/449236 (https://phabricator.wikimedia.org/T196693) [15:54:48] 10Operations, 10ops-eqiad, 10DNS, 10Traffic, 10Patch-For-Review: rack/setup/install authdns1001.wikimedia.org - https://phabricator.wikimedia.org/T196693 (10Cmjohnson) [15:55:00] (03PS2) 10Dzahn: jobqueue_redis: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448809 [15:55:20] (03PS2) 10Cmjohnson: Adding mgmt/production dns authdns1001 [dns] - 10https://gerrit.wikimedia.org/r/449236 (https://phabricator.wikimedia.org/T196693) [15:56:26] (03CR) 10Dzahn: [C: 032] jobqueue_redis: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448809 (owner: 10Dzahn) [15:57:09] !log pool cp2012, upgraded to stretch T200445 [15:57:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:13] T200445: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 [15:59:11] (03CR) 10Jcrespo: [C: 04-1] "Need to change those = into ==s (lang mixup)" [puppet] - 10https://gerrit.wikimedia.org/r/449234 (https://phabricator.wikimedia.org/T193224) (owner: 10Jcrespo) [16:01:12] (03PS7) 10Dzahn: netbox: add psql dump cron and back it up [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) [16:10:42] (03CR) 10EBernhardson: Create prometheus::resource_config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/446687 (owner: 10EBernhardson) [16:10:44] (03PS3) 10EBernhardson: Create prometheus::resource_config [puppet] - 10https://gerrit.wikimedia.org/r/446687 [16:10:48] (03PS1) 10Imarlier: dumps: datahub no longer exists [puppet] - 10https://gerrit.wikimedia.org/r/449238 (https://phabricator.wikimedia.org/T200705) [16:11:24] (03CR) 10jerkins-bot: [V: 04-1] Create prometheus::resource_config [puppet] - 10https://gerrit.wikimedia.org/r/446687 (owner: 10EBernhardson) [16:23:06] (03PS1) 10Eevans: deployment-prep: Upgrade RESTBase Cassandra nodes to 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/449242 (https://phabricator.wikimedia.org/T186750) [16:25:02] (03PS4) 10EBernhardson: Create prometheus::resource_config [puppet] - 10https://gerrit.wikimedia.org/r/446687 [16:49:07] (03PS1) 10Mobrovac: Beta: RESTBase: Add Proton URI [puppet] - 10https://gerrit.wikimedia.org/r/449246 (https://phabricator.wikimedia.org/T186748) [16:50:16] (03Abandoned) 10Alex Monk: Move some production apache config files to templates [puppet] - 10https://gerrit.wikimedia.org/r/322602 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [16:50:59] (03CR) 10Dzahn: "rebased manually. then: No changes between HEAD and origin/production. Submitting for review" [puppet] - 10https://gerrit.wikimedia.org/r/445580 (owner: 10Muehlenhoff) [16:51:02] (03Abandoned) 10Alex Monk: Use production apache config on beta [puppet] - 10https://gerrit.wikimedia.org/r/322603 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [16:51:24] (03Abandoned) 10Alex Monk: Get rid of old beta_sites class now just containing a load of ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/322604 (https://phabricator.wikimedia.org/T1256) (owner: 10Alex Monk) [16:52:11] (03Abandoned) 10Dzahn: Remove conditionals for older distros in mediawiki_maintenance profile [puppet] - 10https://gerrit.wikimedia.org/r/445580 (owner: 10Muehlenhoff) [16:52:13] (03CR) 1020after4: [C: 031] phabricator: set smtp-host to localhost [puppet] - 10https://gerrit.wikimedia.org/r/440910 (https://phabricator.wikimedia.org/T196916) (owner: 10Herron) [16:53:21] (03CR) 1020after4: [C: 031] Scap: scap_source correct gid [puppet] - 10https://gerrit.wikimedia.org/r/361796 (owner: 10Thcipriani) [16:54:58] _joe_, still around? [16:55:12] Krenair: he's still out today [16:55:20] oh ok [16:55:38] (03PS3) 10Dzahn: Disable Diamond on multatuli [puppet] - 10https://gerrit.wikimedia.org/r/445988 (owner: 10Muehlenhoff) [16:56:30] 10Operations, 10ops-eqiad: rack/setup/install syslog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10RobH) p:05Triage>03Normal [16:56:32] 10Operations, 10Wikimedia-Apache-configuration, 10Patch-For-Review, 10User-Joe: Re-organize the apache configuration for MediaWiki in puppet - https://phabricator.wikimedia.org/T196968 (10Krenair) Some of them are so similar to each other I believe we can do a huge merge: https://gerrit.wikimedia.org/r/#/c... [16:56:43] (03CR) 10Alex Monk: [C: 04-1] "See also https://phabricator.wikimedia.org/T196968" [puppet] - 10https://gerrit.wikimedia.org/r/322425 (owner: 10Alex Monk) [16:57:47] (03CR) 10Dzahn: [C: 032] "i'll try it. multatuli is just a test host" [puppet] - 10https://gerrit.wikimedia.org/r/445988 (owner: 10Muehlenhoff) [16:57:47] mobrovac, I think I already did https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449246/ somewhere in horizon [16:58:02] maybe under prefixes [16:58:07] mutante: no, this needs a different patch merged first [16:58:21] https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/446242/ [16:58:26] (03CR) 10ArielGlenn: [C: 031] "yay!" [puppet] - 10https://gerrit.wikimedia.org/r/449216 (owner: 10Muehlenhoff) [16:58:38] to fix puppet errors across 4 deployment-prep hosts [16:58:39] couldn't find it Krenair, but opted for putting it there to have all rb stuff in one place (i want to eventually move it all to horizon) [16:58:43] ok [16:58:52] moritzm: ow.. ok! [16:59:14] we are working now on deployment-restbase0x hosts btw, so ignore any alerts/problems there [16:59:17] I found it with PCC, but I think the test run is cleaned up by now [17:00:04] gehel: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180730T1700). [17:00:27] jouncebot: o/ [17:00:57] !log gehel@deploy1001 Started deploy [wdqs/wdqs@137780f]: new version of wdqs GUI (wdqs1009 only) [17:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:25] moritzm: i'll compile that other patch too [17:01:27] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@137780f]: new version of wdqs GUI (wdqs1009 only) (duration: 00m 30s) [17:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:42] !log gehel@deploy1001 Started deploy [wdqs/wdqs@137780f]: new version of wdqs GUI [17:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:29] i like that it's already "beta-picked" [17:04:00] (03CR) 10ArielGlenn: mediawiki::php: Remove support for PHP 5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449219 (owner: 10Muehlenhoff) [17:05:01] 10Operations, 10ops-eqiad: rack/setup/install centrallog1001.eqiad.wmnet - https://phabricator.wikimedia.org/T200706 (10RobH) [17:05:18] (03CR) 1020after4: phabricator: Use the mysql native driver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443045 (owner: 10Alexandros Kosiaris) [17:07:05] (03CR) 10Paladox: [C: 031] phabricator: Use the mysql native driver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/443045 (owner: 10Alexandros Kosiaris) [17:08:24] (03PS2) 10Gehel: Enable constraints fetching on internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/447741 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [17:08:27] (03CR) 10Dzahn: "partially another duplicate. rebasing to check" [puppet] - 10https://gerrit.wikimedia.org/r/438167 (owner: 10Muehlenhoff) [17:08:35] (03PS3) 10Dzahn: Remove obsolete compat code for PHP 5 [puppet] - 10https://gerrit.wikimedia.org/r/438167 (owner: 10Muehlenhoff) [17:09:30] (03PS4) 10Dzahn: Delete mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/438167 (owner: 10Muehlenhoff) [17:10:06] (03Abandoned) 10Dzahn: Delete mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/438167 (owner: 10Muehlenhoff) [17:11:12] (03CR) 10Dzahn: [C: 031] Remove mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/449216 (owner: 10Muehlenhoff) [17:11:31] (03PS1) 10Ayounsi: Repool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/449250 [17:11:42] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) Hi, it's great to see the activity on this RFC, thank you all for the input. To expand on our answer about... [17:11:59] !log gehel@deploy1001 Finished deploy [wdqs/wdqs@137780f]: new version of wdqs GUI (duration: 09m 17s) [17:12:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:04] (03CR) 10Ayounsi: [C: 032] Repool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/449250 (owner: 10Ayounsi) [17:12:23] (03CR) 10Gehel: [C: 032] Enable constraints fetching on internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/447741 (https://phabricator.wikimedia.org/T192567) (owner: 10Smalyshev) [17:12:25] (03PS2) 10Ayounsi: Repool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/449250 [17:12:36] (03CR) 10Ayounsi: [V: 032 C: 032] Repool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/449250 (owner: 10Ayounsi) [17:12:52] !log repool ulsfo [17:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:47] (03CR) 10Dzahn: "what Ariel said, let's remove that outer "if stretch" as well, since we remove the inner one?" [puppet] - 10https://gerrit.wikimedia.org/r/449219 (owner: 10Muehlenhoff) [17:16:25] SMalyshev: new gui deployment completed, tests are green. Constraints loading enabled for internal cluster, logs looking good [17:16:43] gehel: great, thanks! [17:16:53] (03CR) 10Dzahn: [C: 031] Remove terbium for tendril grants [puppet] - 10https://gerrit.wikimedia.org/r/445590 (owner: 10Muehlenhoff) [17:17:45] (03CR) 10Krinkle: [C: 031] Remove terbium for tendril grants [puppet] - 10https://gerrit.wikimedia.org/r/445590 (owner: 10Muehlenhoff) [17:19:39] Anyone know what the plan is for icinga upgrades? [17:19:43] (03CR) 10Krinkle: mediawiki::php: Remove support for PHP 5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449219 (owner: 10Muehlenhoff) [17:20:40] (03CR) 10Krinkle: [C: 031] Remove mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/449216 (owner: 10Muehlenhoff) [17:21:02] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @Marostegui, I'd like to explore how we might be able to use x1 for storage. We aren't currently consideri... [17:21:04] (03CR) 10Dzahn: [C: 04-1] "compiling this on multatuli after having merge the change to remove diamond there. now: Duplicate declaration: File[/etc/diamond/diamond.c" [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [17:21:55] (03CR) 10Dzahn: [C: 04-1] "http://puppet-compiler.wmflabs.org/11914/multatuli.wikimedia.org/change.multatuli.wikimedia.org.err" [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [17:27:45] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [17:29:21] ^ looks related to repooling ulsfo [17:30:17] XioNoX [17:31:57] i saw this specific one "70% GET drop" before [17:32:06] and then i looked at the graph and zoomed out a bit [17:32:16] and it didn't seem like an unsual pattern there [17:32:21] (03PS2) 10Mobrovac: deployment-prep: Upgrade RESTBase Cassandra nodes to 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/449242 (https://phabricator.wikimedia.org/T186750) (owner: 10Eevans) [17:33:19] also the alert says “GET drop” and the graph says “GET % diff, per DC” [17:42:16] looking [17:42:24] but yeah, that would be my guess [17:43:15] yeah, repooling ulsfo meant codfw traffic dropped bellow the alert threshold, it will recover automatically in ~30min [17:44:36] cool thanks for looking at it [17:44:54] thx for the ping, dunno why I didn't see it right away [17:46:58] (03CR) 10Dzahn: [C: 04-1] "the File /etc/diamond/diamond.conf also has to be moved out of the class" [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [17:47:24] (03PS2) 10Ayounsi: Facter: add a v4 and v6 default routes fact [puppet] - 10https://gerrit.wikimedia.org/r/437771 [17:49:50] (03CR) 10Ayounsi: [C: 032] Facter: add a v4 and v6 default routes fact [puppet] - 10https://gerrit.wikimedia.org/r/437771 (owner: 10Ayounsi) [17:52:17] (03PS3) 10Dzahn: Move declaration of diamond package and config out of diamond class [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [17:56:16] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180730T1800). [18:00:04] davidwbarratt, AaronSchulz, and brion: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:09] \o/ [18:00:17] here! [18:02:03] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) There's another interesting point here, from @mark, > According to T196547 there seems to be the expectatio... [18:03:03] (03CR) 10Dzahn: "amended to also move config file out of class, in addition to package" [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [18:05:07] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) I want to explicitly ask something that has come up among our team: Can we agree on guidelines for new cont... [18:05:23] (03PS3) 10Mobrovac: deployment-prep: Upgrade RESTBase Cassandra nodes to 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/449242 (https://phabricator.wikimedia.org/T186750) (owner: 10Eevans) [18:06:10] I can SWAT [18:07:14] ok [18:07:18] davidwbarratt: could you +1 this backport if all looks good to you: https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/449258/ [18:07:46] (03CR) 10Ppchelko: deployment-prep: Upgrade RESTBase Cassandra nodes to 3.11.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/449242 (https://phabricator.wikimedia.org/T186750) (owner: 10Eevans) [18:08:49] brion: seems like I'm getting some conflicts using the gerrit web interface to cherry-pick https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/TimedMediaHandler/+/449030/ to wmf.14, could you make a cherry pick for that manually please? [18:09:29] sure [18:09:54] (03PS5) 10Thcipriani: Make all wikis write to both nutcracker and mcrouter (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447819 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [18:09:59] thank you! [18:10:00] !log krinkle@deploy1001 Started deploy [performance/navtiming@f79b313]: (no justification provided) [18:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:05] !log krinkle@deploy1001 Finished deploy [performance/navtiming@f79b313]: (no justification provided) (duration: 00m 05s) [18:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:10] (03CR) 10Dzahn: [C: 04-1] "still no.. duplicates fixed but it changes the content of diamond.conf http://puppet-compiler.wmflabs.org/11916/mwdebug1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [18:10:22] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447819 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [18:10:23] thcipriani looks good to me! [18:10:41] davidwbarratt: awesome, I've +2'd now we just need to wait on jenkins to work it's magic [18:11:15] (03PS4) 10Mobrovac: deployment-prep: Upgrade RESTBase Cassandra nodes to 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/449242 (https://phabricator.wikimedia.org/T186750) (owner: 10Eevans) [18:11:42] (03Merged) 10jenkins-bot: Make all wikis write to both nutcracker and mcrouter (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447819 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [18:11:47] thcipriani great! [18:11:58] (03CR) 10jenkins-bot: Make all wikis write to both nutcracker and mcrouter (3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/447819 (https://phabricator.wikimedia.org/T198239) (owner: 10Aaron Schulz) [18:12:23] thcipriani: crap, another patch in the middle didn't make the branch cutoff. :D let me adjust [18:12:25] (03CR) 10Krinkle: [C: 04-1] "Because we're not knowingly going to deploy a security weakness." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt) [18:12:46] (03PS1) 10Dzahn: Revert "Disable Diamond on multatuli" [puppet] - 10https://gerrit.wikimedia.org/r/449259 [18:12:54] AaronSchulz: your change is live on mwdebug1002, check please (if possible) [18:13:05] (03CR) 10Dbarratt: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt) [18:13:51] (03CR) 10Dzahn: [C: 032] "still needs a fix like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/446242/ which doesn't work yet" [puppet] - 10https://gerrit.wikimedia.org/r/449259 (owner: 10Dzahn) [18:14:28] (03CR) 10Dbarratt: "Fixed in https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/WikimediaMessages/+/449177" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt) [18:15:13] (03PS2) 10Dzahn: Revert "Disable Diamond on multatuli" [puppet] - 10https://gerrit.wikimedia.org/r/449259 [18:16:45] thcipriani: ok there's like 3 extra patches that need cherry picking on TimedMediaHandler :( ok to do them now or should I delay it to the next swat? [18:17:35] (no worries from my end if i have to wait!) [18:18:56] (03PS5) 10Dzahn: deployment-prep: Upgrade RESTBase Cassandra nodes to 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/449242 (https://phabricator.wikimedia.org/T186750) (owner: 10Eevans) [18:19:06] brion: I'd prefer if you could move them to the later window as I'll have to leave right after this window (so I can't handle anything overrunning this window) [18:19:20] thcipriani: ok! [18:19:28] thank you, sorry for the delay :( [18:19:30] i'll update the deployment page [18:19:34] (03PS6) 10Dzahn: deployment-prep: Upgrade RESTBase Cassandra nodes to 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/449242 (https://phabricator.wikimedia.org/T186750) (owner: 10Eevans) [18:19:34] no worries! [18:20:42] (03CR) 10Dzahn: [C: 032] deployment-prep: Upgrade RESTBase Cassandra nodes to 3.11.2 [puppet] - 10https://gerrit.wikimedia.org/r/449242 (https://phabricator.wikimedia.org/T186750) (owner: 10Eevans) [18:21:19] (03PS2) 10Dzahn: puppetmaster: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448587 [18:23:26] thcipriani: seems fine [18:23:56] AaronSchulz: thanks for checking, deploying everywhere [18:24:31] (ok added the full set of patches to calendar for the later window; off to make last breakfast!) [18:25:06] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:25:15] (03CR) 10Dzahn: [C: 032] puppetmaster: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/448587 (owner: 10Dzahn) [18:25:51] !log thcipriani@deploy1001 Synchronized wmf-config/mc.php: SWAT: [[gerrit:447819|Make all wikis write to both nutcracker and mcrouter (3)]] T198239 (duration: 00m 48s) [18:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:55] T198239: Rollout use of mcrouter for MediaWiki in production - https://phabricator.wikimedia.org/T198239 [18:25:57] ^ AaronSchulz live everywhere [18:26:16] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:26:16] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/translation/articles/{source}{/seed} (normal source and target with seed) is CRITICAL: Test normal source and target with seed returned the unexpected status 404 (expecting: 200) [18:27:00] Krinkle: is your comment on https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/448146/ addressed? [18:27:06] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:06] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:27:06] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:27:15] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:35] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:27:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster=cache_text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:56] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:28:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:28:36] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy [18:28:36] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy [18:28:45] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:28:46] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:29:04] AaronSchulz: ^ could this be related to your patch? [18:29:55] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:29:59] I just see a CirrusSearch increase in logs [18:30:05] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [18:30:05] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:30:23] thcipriani is Jenkins still busy? [18:30:26] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [18:30:28] AaronSchulz: looks like it might be: https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?refresh=5m&orgId=1 [18:30:36] AaronSchulz: I'm going to rollback, ok? [18:30:55] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:31:05] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:31:25] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:31:45] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:31:46] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:31:56] thcipriani: do you see anything in logstash? [18:32:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:32:15] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:32:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:32:37] (03PS2) 10Dzahn: Beta: RESTBase: Add Proton URI [puppet] - 10https://gerrit.wikimedia.org/r/449246 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [18:32:38] well...now it seems to have calmed down without me doing anything [18:32:52] LS and browsing around still show nothing interesting [18:32:55] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:32:55] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:33:05] AaronSchulz: nothing particularly, just a 5xx spike that was heavily correlated with deploy [18:33:07] I would expect any mc issues to cause some backend logging [18:33:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:33:38] it could only effect the frontend to the extent that it would the backend indirectly [18:34:06] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 52003 MB (10% inode=99%) [18:34:35] AaronSchulz: looks like it's back to normal now, sorry for the false alarm [18:34:47] (03CR) 10Dzahn: [C: 032] Beta: RESTBase: Add Proton URI [puppet] - 10https://gerrit.wikimedia.org/r/449246 (https://phabricator.wikimedia.org/T186748) (owner: 10Mobrovac) [18:36:14] what's going on? [18:36:40] I zee, mcrouter I guess [18:36:54] or not? [18:37:24] (03PS1) 10Mobrovac: Beta: RESTBase: Add Cassandra TLS config [puppet] - 10https://gerrit.wikimedia.org/r/449267 (https://phabricator.wikimedia.org/T186750) [18:37:35] RECOVERY - Disk space on elastic1027 is OK: DISK OK [18:37:56] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:38:35] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [18:38:35] bblack: I deployed a change to mc.php there was a spike of 5xx errors, I started to rollback and by the time I got everything prepped, graphite errrors started to recover, so I didn't rollback [18:39:15] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [18:39:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:40:16] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:40:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:40:29] bblack: it started after "Make all wikis write to both nutcracker and mcrouter" and has then recovered shortly after [18:40:52] ok [18:40:57] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 51765 MB (10% inode=99%) [18:40:59] for reference: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/447819/ [18:41:15] so the 5xx's, we think it's just a transitive thing from the change? [18:41:36] ^ AaronSchulz is that possible? [18:42:02] either that or crazy coincidence, but then we should look at what the random 5xx spike was [18:42:51] the graph seems to suggest so.. yes [18:43:22] it matches the deployment of the change [18:43:50] thcipriani: I wouldn't expect that, just a temporary bump in some cache regenerations due to working set size adjustment. I'm not sure what else it would be though. [18:44:17] (03PS2) 10Mobrovac: Beta: RESTBase: Add Cassandra TLS config [puppet] - 10https://gerrit.wikimedia.org/r/449267 (https://phabricator.wikimedia.org/T186750) [18:46:06] (03CR) 10Ppchelko: [C: 031] Beta: RESTBase: Add Cassandra TLS config [puppet] - 10https://gerrit.wikimedia.org/r/449267 (https://phabricator.wikimedia.org/T186750) (owner: 10Mobrovac) [18:46:46] does seem like that was the only thing that changed at that moment, so yes, likely transitive spike related to that change. [18:47:28] thcipriani: though as I mentioned, I'd expect some LS entries if that were the case. Mysql Grafana is a bit interesting. [18:48:56] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 52467 MB (10% inode=99%) [18:49:12] * ebernhardson wishes elasticsearch were smarter about where to put things wrt disk space ... [18:49:22] it will resolve itself, it's just noisy :P [18:49:47] AaronSchulz: I do see a lot of "Pool error on {key}: {error}" in the logs, FWIW [18:49:51] open connections had a quick spike at 18:25 [18:50:16] thcipriani: that is cirrus, right? [18:50:29] ah, yeah it is [18:53:02] davidwbarratt: your WikimediaMessages change is on mwdebug1002, can you check please? I'm afraid that's probably all that I have time left to deploy in this window if that looks alright on mwdebug. [18:53:28] umm, well I can't test that without the config change as well [18:53:35] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 52059 MB (10% inode=99%) [18:53:36] the config change enables the banner [18:53:47] but if you don't have time, no problem, I'll move it to the next window [18:54:24] davidwbarratt: please do. thank you, I will go ahead and deploy the WikimediaMessages change if it's behind a feature flag. [18:54:40] thcipriani yes it is, thanks! [18:54:57] thcipriani that will make the next deploy simpler. :) [18:55:28] (03CR) 10Ppchelko: [C: 04-1] "We already have `<% if env == 'beta' %>tls: { ca: '/dev/null' }<% endif %>` in the config template, so this is not needed." [puppet] - 10https://gerrit.wikimedia.org/r/449267 (https://phabricator.wikimedia.org/T186750) (owner: 10Mobrovac) [18:56:47] thcipriani: for those cirrus errors, i dunno what went on but our query rate for fulltext went from ~450/s to a peak of 4.2k/s, and we rejected a bunch of them [18:57:12] starting at :21 and ending at :29 [18:57:55] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.14/extensions/WikimediaMessages/WikimediaMessages.hooks.php: SWAT: [[gerrit:449177|Escape Special:Block Feedback Request Message]] T194301 (duration: 00m 49s) [18:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:59] T194301: Introduce temporary element on Special:Block UI to invite users to participate in the Partial Block consultation - https://phabricator.wikimedia.org/T194301 [18:58:02] ^ davidwbarratt change is live [18:58:30] thcipriani YAY! thanks! [18:58:42] sure thing [18:59:12] thcipriani: seems highly likely to be related to the deploy based on timing, but i have no clue why... [19:00:16] PROBLEM - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is CRITICAL: CRITICAL: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is alerting: 70% GET drop in 30min alert. [19:01:29] yeah, I'm not clear why that would happen either [19:01:56] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:02:36] RECOVERY - https://grafana.wikimedia.org/dashboard/db/varnish-http-requests grafana alert on einsteinium is OK: OK: https://grafana.wikimedia.org/dashboard/db/varnish-http-requests is not alerting. [19:02:43] VE is broken on beta, is it known? e.g. https://en.wikipedia.beta.wmflabs.org/wiki/T197213?veaction=edit gives me an error box with "Error loading data from server: HTTP 503." [19:02:54] the URL failing with a 503 is this: https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/T197213?redirect=false [19:03:03] Request from 185.157.12.102 via deployment-cache-text04 deployment-cache-text04, Varnish XID 85242916 [19:03:03] Error: 503, Backend fetch failed at Mon, 30 Jul 2018 19:02:08 GMT [19:08:16] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 51904 MB (10% inode=99%) [19:08:25] (03Abandoned) 10Mobrovac: Beta: RESTBase: Add Cassandra TLS config [puppet] - 10https://gerrit.wikimedia.org/r/449267 (https://phabricator.wikimedia.org/T186750) (owner: 10Mobrovac) [19:08:59] (03PS2) 10Dzahn: Remove mediawiki::packages::php5 [puppet] - 10https://gerrit.wikimedia.org/r/449216 (owner: 10Muehlenhoff) [19:09:38] MatmaRex: yes, RB is down, should be up in a matter of minutes [19:11:43] okay, thanks [19:12:01] MatmaRex: should be all good now [19:12:26] indeed! [19:12:31] yay [19:13:03] (03CR) 10Dzahn: [C: 032] "nothing includes this class anymore" [puppet] - 10https://gerrit.wikimedia.org/r/449216 (owner: 10Muehlenhoff) [19:13:28] D modules/mediawiki/manifests/packages/php5.pp :) [19:16:25] PROBLEM - Disk space on elastic1027 is CRITICAL: DISK CRITICAL - free space: /srv 52490 MB (10% inode=99%) [19:16:25] RECOVERY - Disk space on elastic1018 is OK: DISK OK [19:17:44] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@7aa39b7]: (no justification provided) [19:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:15] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@7aa39b7]: (no justification provided) (duration: 01m 32s) [19:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:55] !log ppchelko@deploy1001 Started deploy [restbase/deploy@9f9685b] (dev-cluster): Language variants for summaries T198465 [19:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:00] T198465: Enable language variants support for summary - https://phabricator.wikimedia.org/T198465 [19:21:05] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@b1e18ca]: (no justification provided) [19:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:34] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@b1e18ca]: (no justification provided) (duration: 01m 28s) [19:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:00] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@9f9685b] (dev-cluster): Language variants for summaries T198465 (duration: 03m 05s) [19:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:56] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10jcrespo) > Can we agree on guidelines for new content extensions, so that nobody needs to go through this discussio... [19:26:03] !log ppchelko@deploy1001 Started deploy [restbase/deploy@9f9685b]: Language variants for summaries T198465 [19:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:07] T198465: Enable language variants support for summary - https://phabricator.wikimedia.org/T198465 [19:26:15] (03PS6) 10Ayounsi: Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) [19:34:22] (03PS1) 10MarcoAurelio: Enable $wgCiteResponsiveReferences for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449277 (https://phabricator.wikimedia.org/T200707) [19:34:55] (03CR) 10Dzahn: "general comment on this, the code changes have been written by a machine, the 2to3 script" [puppet] - 10https://gerrit.wikimedia.org/r/441209 (owner: 10Dzahn) [19:36:01] (03PS2) 10MarcoAurelio: Enable $wgCiteResponsiveReferences for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449277 (https://phabricator.wikimedia.org/T200707) [19:36:46] (03CR) 10jerkins-bot: [V: 04-1] Enable $wgCiteResponsiveReferences for Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449277 (https://phabricator.wikimedia.org/T200707) (owner: 10MarcoAurelio) [19:37:05] RECOVERY - Disk space on elastic1027 is OK: DISK OK [19:39:13] (03CR) 10MarcoAurelio: "Warning: fork failed - Cannot allocate memory in /srv/composer/vendor/symfony/console/Application.php on line 959" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449277 (https://phabricator.wikimedia.org/T200707) (owner: 10MarcoAurelio) [19:39:50] thcipriani: AaronSchulz: this is what I saw at 18:23- queries on enwiki only multiplied by 4 for 5 minutes, but they seemed fast queries, not slow ones [19:40:35] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@9f9685b]: Language variants for summaries T198465 (duration: 14m 31s) [19:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:39] T198465: Enable language variants support for summary - https://phabricator.wikimedia.org/T198465 [19:40:39] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/449277 (https://phabricator.wikimedia.org/T200707) (owner: 10MarcoAurelio) [19:40:52] !log ppchelko@deploy1001 Started deploy [restbase/deploy@9f9685b]: Language variants for summaries T198465 attempt 2 [19:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:58] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@9f9685b]: Language variants for summaries T198465 attempt 2 (duration: 05m 06s) [19:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:03] T198465: Enable language variants support for summary - https://phabricator.wikimedia.org/T198465 [19:47:27] !log ppchelko@deploy1001 Started deploy [restbase/deploy@9f9685b]: Language variants for summaries T198465 attempt 3 [19:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:30] (03CR) 10Dzahn: postgresql: add class to create db backups (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [19:49:32] (03PS6) 10Dzahn: postgresql: add class to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) [19:50:00] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Remove expiry date from Morten Warncke-Wang's production shell access - https://phabricator.wikimedia.org/T200723 (10Neil_P._Quinn_WMF) [19:50:57] jynus: at that same time search queries increased by about 4k/s, they would probably issue a couple index lookups before trying to query search (and being rejected because there were 10x more than usual) [19:51:12] (03CR) 10Paladox: postgresql: add class to create db backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [19:51:19] i mean a couple mysql index lookups, to see if the search term is a known title [19:51:59] i suppose 4k/s and say 4 or 5 wouldn't be enough to cause that much spike. maybe lots more things effected than search [19:52:53] in QPS it went from 76K/s to 286.9K/s [19:53:52] !log ppchelko@deploy1001 Finished deploy [restbase/deploy@9f9685b]: Language variants for summaries T198465 attempt 3 (duration: 06m 25s) [19:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:56] T198465: Enable language variants support for summary - https://phabricator.wikimedia.org/T198465 [19:54:10] jynus: wowzers [19:54:31] note QPS is a bit meaningless [19:54:45] sure, the work per query varies significantly [19:55:01] SET autocommit = 0; is a query the same than SELECT * FROM table LIMIT 1000; [19:55:17] but i mean an extra 4k search req/s couldn't possibly account for 200k extra mysql queries, maybe 20k :) [19:55:28] jouncebot: next [19:55:29] In 0 hour(s) and 4 minute(s): Services – Parsoid / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180730T2000) [19:55:39] oh interesting [19:59:28] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krenair) [19:59:42] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krenair) [19:59:52] although for whay I can see, most of those where selects [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180730T2000). [20:00:47] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1080&var-port=9104&panelId=16&fullscreen&from=1532970927428&to=1532979440742 [20:03:16] (03PS7) 10Dzahn: postgresql: add class to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) [20:03:19] (03CR) 10Dzahn: postgresql: add class to create db backups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [20:04:00] (03CR) 10jerkins-bot: [V: 04-1] postgresql: add class to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [20:05:52] (03PS1) 10Herron: WIP: prometheus: add logstash exporter and gather logstash metrics [puppet] - 10https://gerrit.wikimedia.org/r/449283 (https://phabricator.wikimedia.org/T200362) [20:07:09] paladox: WARNING variable contains an uppercase letter (variable_is_lowercase) :... sad_trombone.wav [20:07:12] hehe [20:07:16] (03CR) 10Herron: [C: 04-2] "In addition to WIP this needs blacklisting of undesirable logstash config IDs" [puppet] - 10https://gerrit.wikimedia.org/r/449283 (https://phabricator.wikimedia.org/T200362) (owner: 10Herron) [20:08:33] (03PS8) 10Dzahn: postgresql: add class to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) [20:08:35] 10Operations, 10LDAP-Access-Requests: Add Lea Voget (WMDE) & Bmueller to the WMDE LDAP group - https://phabricator.wikimedia.org/T199967 (10RStallman-legalteam) Will reach out to them individually to get the NDAs in place and circle back once those are signed and on file. Thanks! [20:08:41] but it was from " [20:08:41] On the Bleeding Edge of Puppet [20:08:43] :p [20:09:00] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@61c5f47]: ship missing bc alias for deployed daemon [20:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:52] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krenair) [20:10:35] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@61c5f47]: ship missing bc alias for deployed daemon (duration: 01m 35s) [20:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:23] 10Operations, 10Analytics, 10Discovery-Search (Current work), 10Patch-For-Review, 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10Pchelolo) [20:14:35] PROBLEM - Host db1055 is DOWN: PING CRITICAL - Packet loss = 100% [20:15:18] uh... what's up? [20:15:44] mmmh is not in tendril [20:16:08] ahh, being decom T194118 [20:16:08] T194118: Decommission db1055 - https://phabricator.wikimedia.org/T194118 [20:16:12] volans: not db1055! [20:17:00] herron: ? [20:17:46] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Jenkins, 10Patch-For-Review: Upgrade deployment-prep deployment servers to stretch - https://phabricator.wikimedia.org/T192561 (10Krenair) thcipriani, where are we with deployment-deploy01? I looked at it Fr... [20:17:49] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10Halfak) @Jcrespo, I can't find the policy you are referencing. Can you link to the specific lines that apply here?... [20:17:56] was kidding, since you said what’s up and it’s down. acking the alert [20:18:04] :D [20:18:10] it shouldn't be there though [20:18:29] I guess puppet has not run on the icinga host yet [20:19:47] ACKNOWLEDGEMENT - Host db1055 is DOWN: PING CRITICAL - Packet loss = 100% Herron Being decomissioned https://phabricator.wikimedia.org/T194118 [20:19:51] hmm. it's been removed from site.pp many days ago [20:19:56] robh: cc ^ [20:20:20] hrmm [20:20:27] i totally did the steps and pupept node clean/deactivated [20:20:30] it shouldnt be in there [20:20:32] it should have .. yea. that [20:20:43] i mean, i am suppppper careful about my checklist [20:20:46] because of this very reason, heh [20:20:56] i wonder if something else added it back, that randomly happens sometimes [20:21:01] removing again [20:21:10] actually checking my history to double check [20:22:08] well, i somehow missed it. good reason I +1'd that checklist. [20:22:14] removing now and doublechecking the rest of the list [20:22:29] Can someone check whether https://phabricator.wikimedia.org/T192893 is still active? [20:23:07] heh, i totally even removed the debmonitor for db1055, just missed the puppet deactivate [20:23:13] puppet clean was even done [20:23:28] robh: ack no prob [20:24:01] Krenair: i dont have a reminder about it until the first [20:24:14] so i suppose it is, are you asking so its disabled ont he first? [20:24:42] mainly just interested in whether it got pulled early or left to expire [20:25:06] there is no expire [20:25:14] well, ok [20:25:17] i have to go in and spend a bunch of time individually removing them ;] [20:25:24] 'virtually' expire by way of calendar entries having people disable them eventually :) [20:25:28] unless someone else did it already, which i doubt, but checking [20:26:23] (03PS1) 10Herron: admin: remove expiry attributes of user nettrom [puppet] - 10https://gerrit.wikimedia.org/r/449290 (https://phabricator.wikimedia.org/T200723) [20:26:24] oh, i also stupidly didnt make a set of them for this user (which would have made removing them easier later) [20:26:25] heh [20:26:37] man i hate google search console, its utter shit [20:26:51] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Remove expiry date from Morten Warncke-Wang's production shell access - https://phabricator.wikimedia.org/T200723 (10herron) p:05Triage>03Normal [20:26:53] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Remove expiry date from Morten Warncke-Wang's production shell access - https://phabricator.wikimedia.org/T200723 (10herron) a:03herron [20:27:08] to be fair I'd imagine that we have the busiest google search console view out there [20:27:10] :) [20:27:13] Krenair: so i just checked en.wikipedia.org and they are still there [20:27:19] with the sheer amount of traffic and domains [20:27:23] were they supposed to be removed early? [20:27:31] it's not my place to say [20:27:36] but I was wondering if someone else had said that or not [20:27:51] the task points out removal on 2018-08-01 [20:27:57] and no one said differently to me [20:28:04] (unless they do, im removing them then) [20:28:35] interesting anyway, thanks [20:29:00] yeah may as well reopen it now to find out if removal on 1st is still ok [20:29:06] i was just going to wait until then but i have it open now heh [20:31:00] 10Operations, 10SRE-Access-Requests: Access to Google Search Console for Go Fish Digital - https://phabricator.wikimedia.org/T192893 (10RobH) 05Resolved>03Open a:05RobH>03Deskana Ok, this is set to expire on 2018-08-01. By expire, I mean my google calendar reminds me to manually login and pull up thes... [20:39:13] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@69ac0d5]: fix wrong yaml import in cli entry point [20:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:43] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@69ac0d5]: fix wrong yaml import in cli entry point (duration: 02m 30s) [20:41:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:04] 10Operations, 10Product-Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Remove expiry date from Morten Warncke-Wang's production shell access - https://phabricator.wikimedia.org/T200723 (10herron) Hi @Neil_P._Quinn_WMF, I've prepared a patch for this, however I don't see a staff entry for Morten as o... [20:46:17] 10Operations, 10Core-Platform-Team, 10WMF-JobQueue, 10MW-1.32-release-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), and 3 others: Exception "Job queue is read-only" - https://phabricator.wikimedia.org/T199594 (10mobrovac) 05Open>03Resolved The errors have completely disappeared as of this morning UTC. [20:56:58] !log ebernhardson@deploy1001 Started deploy [search/mjolnir/deploy@debb5b0]: Update spark.yaml with latest training configuration [20:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] bawolff and Reedy: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180730T2100). [21:01:10] !log ebernhardson@deploy1001 Finished deploy [search/mjolnir/deploy@debb5b0]: Update spark.yaml with latest training configuration (duration: 04m 12s) [21:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:22] (03PS1) 10Dduvall: Upgrade bazlets to latest revision [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449332 [21:14:24] (03PS1) 10Dduvall: Merge branch 'stable-2.15' [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449333 [21:14:26] (03PS1) 10Dduvall: Merge branch 'stable-2.15' [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449334 [21:14:28] (03PS1) 10Dduvall: Merge branch 'stable-2.15' [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449335 [21:14:30] (03PS1) 10Dduvall: Use anonymous project clone URLs [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449336 [21:15:47] (03Abandoned) 10Dduvall: Upgrade bazlets to latest revision [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449332 (owner: 10Dduvall) [21:16:04] (03Abandoned) 10Dduvall: Merge branch 'stable-2.15' [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449333 (owner: 10Dduvall) [21:16:47] (03Abandoned) 10Dduvall: Merge branch 'stable-2.15' [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449334 (owner: 10Dduvall) [21:17:00] (03Abandoned) 10Dduvall: Merge branch 'stable-2.15' [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449335 (owner: 10Dduvall) [21:17:17] (03Abandoned) 10Dduvall: Use anonymous project clone URLs [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449336 (owner: 10Dduvall) [21:17:57] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220 (10Krenair) [21:20:08] (03PS1) 10Dduvall: Use anonymous project clone URLs [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449337 [21:20:31] volans: i think i can merge the backup class now..i also addressed the 2 new comments [21:20:49] and then i'd try using it on netbox [21:21:28] (03CR) 10Thcipriani: [V: 032 C: 032] Use anonymous project clone URLs [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449337 (owner: 10Dduvall) [21:24:16] (03PS1) 10Thcipriani: Fork gerrit go-import plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/449338 [21:25:16] (03CR) 10Dduvall: [V: 032 C: 032] Use anonymous project clone URLs [software/gerrit/plugins/go-import] (stable-2.15) - 10https://gerrit.wikimedia.org/r/449337 (owner: 10Dduvall) [21:26:22] (03CR) 10Dduvall: [V: 032 C: 032] Fork gerrit go-import plugin [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/449338 (owner: 10Thcipriani) [21:27:35] (03PS7) 10Ayounsi: Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) [21:28:11] (03CR) 10jerkins-bot: [V: 04-1] Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) (owner: 10Ayounsi) [21:37:26] (03PS8) 10Ayounsi: Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) [21:38:08] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active [21:39:50] (03PS1) 10Thcipriani: Fork go-import plugin [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/449339 [21:42:38] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 261, down: 0, shutdown: 0 [21:48:43] (03CR) 10Dduvall: [V: 032 C: 032] Fork go-import plugin [software/gerrit] (deploy/wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/449339 (owner: 10Thcipriani) [21:48:50] (03PS1) 10BBlack: wikimediafoundation.org: switch IPs to Automattic [dns] - 10https://gerrit.wikimedia.org/r/449341 (https://phabricator.wikimedia.org/T198922) [21:49:09] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 40 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [21:49:51] (03PS1) 10BBlack: wikimediafoundation.org: switch TTLs back to 10m [dns] - 10https://gerrit.wikimedia.org/r/449342 (https://phabricator.wikimedia.org/T198922) [21:51:41] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@10e3207]: Updating go-import plugin (gerrit2001 only) [21:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:51] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@10e3207]: Updating go-import plugin (gerrit2001 only) (duration: 00m 09s) [21:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:59] PROBLEM - BGP status on cr1-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active, AS6939/IPv4: Connect [21:55:48] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@10e3207]: Updating go-import plugin (cobalt) [21:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:58] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@10e3207]: Updating go-import plugin (cobalt) (duration: 00m 10s) [21:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:05] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @jcrespo Thanks for the good tips, I agree with most of what you say. FWIW, I was working off of the exten... [21:57:38] RECOVERY - BGP status on cr1-eqsin is OK: BGP OK - up: 261, down: 0, shutdown: 0 [21:59:19] (03PS9) 10Dzahn: postgresql: add class to create db backups [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) [21:59:29] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 12 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [22:00:26] (03CR) 10Dzahn: [C: 032] "thanks for all the reviewing. since you said "nothing major" and i also addressed these 2 new comments, i will go ahead with this now" [puppet] - 10https://gerrit.wikimedia.org/r/447844 (https://phabricator.wikimedia.org/T190184) (owner: 10Dzahn) [22:02:16] (03PS9) 10Ayounsi: Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) [22:02:56] (03CR) 10jerkins-bot: [V: 04-1] Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) (owner: 10Ayounsi) [22:03:35] (03PS1) 10Dzahn: yubiauth: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449343 [22:04:13] (03PS2) 10Dzahn: yubiauth: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449343 [22:05:05] (03PS10) 10Ayounsi: Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) [22:05:07] (03Abandoned) 10Dzahn: yubiauth: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449343 (owner: 10Dzahn) [22:06:59] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 31 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [22:10:59] !log restarting jenkins after plugin updates [22:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:33] (03PS1) 10Dzahn: failoid/configcluster:: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449344 [22:11:35] (03PS1) 10Dzahn: parsoid/thumbor::mediawiki: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449345 [22:11:37] (03PS1) 10Dzahn: aqs/poolcounter:: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449346 [22:11:39] (03PS1) 10Dzahn: cache::canary/pybaltest: use ::profile::base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/449347 [22:13:58] (03PS3) 10Thcipriani: Scap: update-interwiki-cache for labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/446507 (https://phabricator.wikimedia.org/T198844) [22:17:09] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 16 probes of 309 (alerts on 19) - https://atlas.ripe.net/measurements/11645088/#!map [22:18:26] (03PS11) 10Ayounsi: Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) [22:22:21] (03CR) 10Ayounsi: "A few changes to handle IPv6 routes properly." [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) (owner: 10Ayounsi) [22:22:26] (03PS1) 10Dzahn: dbtree: update comments regarding required Apache [puppet] - 10https://gerrit.wikimedia.org/r/449348 [22:23:09] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Krinkle) [22:23:48] (03PS2) 10Dzahn: dbtree: update comments regarding required Apache [puppet] - 10https://gerrit.wikimedia.org/r/449348 [22:24:18] (03CR) 10Dzahn: [C: 032] "comments-only" [puppet] - 10https://gerrit.wikimedia.org/r/449348 (owner: 10Dzahn) [22:29:57] !log - puppet disabled on cp40* hosts - T195365 [22:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:01] T195365: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365 [22:30:04] (03CR) 10Ayounsi: [C: 032] Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) (owner: 10Ayounsi) [22:30:17] (03PS12) 10Ayounsi: Add static routes with MTU 1450 to ipsec destinations [puppet] - 10https://gerrit.wikimedia.org/r/437784 (https://phabricator.wikimedia.org/T195365) [22:30:22] (03PS1) 10Dzahn: tendril: move httpd out of module to role [puppet] - 10https://gerrit.wikimedia.org/r/449350 [22:32:00] (03CR) 10Dzahn: [C: 032] "follow-up: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/449350/" [puppet] - 10https://gerrit.wikimedia.org/r/449348 (owner: 10Dzahn) [22:34:38] 10Operations, 10Analytics, 10Discovery-Search (Current work), 10Patch-For-Review, 10Services (watching): Create kafka topic for mjolinr bulk daemon and decide on cluster - https://phabricator.wikimedia.org/T200215 (10EBernhardson) As requestsed, I've sent an email to ops list, cc'd to mobrovac, giving a... [22:35:30] !log applying static route + fixed MTU to cp4025 - T195365 [22:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:34] T195365: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365 [22:37:58] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Puppet has 21 failures. Last run 1 minute ago with 21 failures. Failed resources (up to 3 shown): Exec[ip route add 2620::861:103:10:64:32:100/128 via fe80::1 mtu lock 1450 dev eth0],Exec[ip route add 2620::861:103:10:64:32:101/128 via fe80::1 mtu lock 1450 dev eth0],Exec[ip route add 2620::861:103:10:64:32:102/128 via fe80::1 mtu lock 1450 dev eth0],Exec[ip route add 2 [22:37:58] 0:99/128 via fe80::1 mtu lock 1450 dev eth0] [22:38:33] (03CR) 10Dzahn: [C: 031] analytics_cluster::webserver: apache -> httpd module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [22:42:06] working on it [22:42:16] (03CR) 10Dzahn: "hashar, are you ok with me deploying this" [puppet] - 10https://gerrit.wikimedia.org/r/434427 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [22:46:45] (03CR) 10Krinkle: analytics_cluster::webserver: apache -> httpd module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/416742 (owner: 10Dzahn) [22:47:05] mutante: curious about libapache2-* package requirement. [22:47:17] (03PS6) 10EBernhardson: Drop query_clicks partitions after 90 days [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) [22:47:49] Krinkle: yea, the httpd module configures modules but does not automatically install a needed package [22:47:49] 10Operations, 10Performance-Team, 10Wikimedia-Mailing-lists: Close performance@lists.wikimedia.org in favour of wikitech-l - https://phabricator.wikimedia.org/T200733 (10Krinkle) [22:48:02] the most popular modules come with apache anyways though [22:48:07] Right [22:48:20] so only for some.. but if then you need to add require_package too somewhere [22:48:27] (03CR) 10EBernhardson: [C: 031] "The dependent patch has been deployed, this can now be deployed as well." [puppet] - 10https://gerrit.wikimedia.org/r/419954 (https://phabricator.wikimedia.org/T189845) (owner: 10EBernhardson) [22:48:50] mutante: Yeah, should that be in the role or profile class? E.g. for libapache2-mod-php7.0. [22:49:15] I ran into a problem with this, and it actually caused webperf1002 to fail in beta but work in prod due to an ordering issue. [22:49:43] Specifically, the require_package('libapache2-mod-php7.0') causes Apache to get installed before the httpd class is seen. [22:50:05] if it uses require_package() then it can be in the profiles since it should not create a duplicate definition and then allows the profiles to be moved around [22:50:26] if it can only happen once per node without causing a conflict. it needs to move to role [22:50:51] Yeah, in profiles makes sense to me as well. [22:50:52] https://github.com/wikimedia/puppet/blob/a55d57b3cd3992c3579744d00e86306493a21218/modules/role/manifests/webperf/profiling_tools.pp#L17-L21 [22:51:01] See https://github.com/wikimedia/puppet/blob/a55d57/modules/role/manifests/webperf/profiling_tools.pp#L17-L21 and https://phabricator.wikimedia.org/T180761#4445849 [22:51:19] the class httpd{} itself though should be in the role. one webserver per server [22:51:26] which can have many sites [22:51:26] Yeah [22:51:44] the ordering issue i haven't run into ..hmm [22:52:25] Has to do with some internal defaults for mpm (one of: event, worker, prefork) - which vary depending on whether php is used. [22:54:38] 10Operations, 10Toolforge: Upload python-pykube deb to apt.wikimedia.org - https://phabricator.wikimedia.org/T200660 (10Legoktm) >>! In T200660#4461684, @bd808 wrote: > Do we actually use pycube from the deb or are we using the version that is embedded in https://phabricator.wikimedia.org/diffusion/OSTW/browse... [22:56:16] for future reference, anyone know a good way to cherry-pick a series of commits in gerrit ui? or should i manually cherry-pick and rewrite the commit-ids.... :P [22:56:17] (03PS1) 10Krinkle: webperf: Move require_package for PHP from role to XHGui profile [puppet] - 10https://gerrit.wikimedia.org/r/449367 (https://phabricator.wikimedia.org/T180761) [22:56:47] Krinkle: yea, i dont have a quick answer for that. maybe it can be fixed by defining the order with the "Chaining arrows" syntax [22:56:49] (03PS2) 10Krinkle: webperf: Move require_package for PHP from role to XHGui profile [puppet] - 10https://gerrit.wikimedia.org/r/449367 (https://phabricator.wikimedia.org/T180761) [22:57:01] well, besides moving it around :) [22:57:23] mutante: I'm not sure this will eliminate the error, but I imagine we should do this regardless, right? [22:57:35] I'm trying to find other examples of apache with php to see whether mpm is set there or whether it just works. [22:57:44] (series meaning they depend on each other in order, so can't be cherry-picked separately and then merged. the cherry-pick ui asks for a branch, but not for a commit on top of that branch) [22:57:46] it seems for mediawiki we set it via Hiera. [22:58:22] Krinkle: yes we should. or at least _after_ the httpd class. yes, let's move to profile [22:59:25] brion: aye, I've had the same issue. If they apply cleanly, you can cherry-pick, and then rebase>[x]>gerrit id of parent. But otherwise... either cherry-pick and merge one by one, or have to stage locally in git and then push the stack for review. [22:59:33] brion you can enter a commit id in the branch field [22:59:43] at least it does that on polygerrit's ui [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180730T2300). [23:00:04] davidwbarratt, CFisch_WMDE, and brion: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:18] paladox: hmm, i'll try that :D [23:00:20] (03CR) 10Dzahn: [C: 032] webperf: Move require_package for PHP from role to XHGui profile [puppet] - 10https://gerrit.wikimedia.org/r/449367 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [23:00:20] here! [23:00:23] here! [23:00:24] i was the one who worked on that cherry pick dialog :) [23:00:36] \o/ [23:01:04] ah, interesting, if you give it the previously cherry-picked unmerged commit hash, it will base on that and still submit to the same branch? [23:01:14] hmm, doesn't seem to grok it in the version we've got if im doing it right [23:01:40] polygerrit=1 ? [23:02:33] nope, still comes back with 'cannot find /refs/head/blah' [23:02:38] ah well :D [23:02:50] oh [23:03:02] i missed up the rebase dialog. [23:03:05] :D [23:04:33] (03CR) 10Dzahn: [C: 04-1] "so now we would also have to move the parameter defaults to not change the config generated from .erb.. where does it end ?:) do we want t" [puppet] - 10https://gerrit.wikimedia.org/r/446242 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [23:05:15] brion https://bugs.chromium.org/p/gerrit/issues/detail?id=9504 [23:05:30] i amended to seomthing that was already beta-picked.. what is it called if the cherries have to be picked a second time :p [23:06:00] pitting ? :P :) [23:06:04] paladox: thx, i starred it :D [23:06:10] your welcome :) [23:07:12] Soooo anyone doing SWAT? :-) [23:09:04] I can SWAT [23:09:32] Krinkle: is this still a -1 https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/448146/ ? [23:09:47] Krinkle seems that is supported when using the rest api https://gerrit-review.googlesource.com/Documentation/rest-api-changes.html#cherrypick-input [23:09:56] (03CR) 10BBlack: [C: 032] wikimediafoundation.org: switch IPs to Automattic [dns] - 10https://gerrit.wikimedia.org/r/449341 (https://phabricator.wikimedia.org/T198922) (owner: 10BBlack) [23:10:10] (03CR) 10Krinkle: "Thanks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt) [23:10:12] thcipriani: nope [23:10:18] thanks [23:10:33] (03PS4) 10Thcipriani: Enable Special:Block Feedback Request (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt) [23:10:43] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt) [23:10:45] brion hmm, apparently what you may want may actually be supported already in the rest api (just not the ui yet) [23:10:46] https://gerrit-review.googlesource.com/Documentation/rest-api-changes.html#cherrypick-input [23:10:50] nice [23:10:55] is base what you want? [23:11:13] paladox: yeah that sounds right [23:11:27] cherry-pick on top of specific base with the same target branch [23:11:41] though destination is still required. [23:12:02] (03Merged) 10jenkins-bot: Enable Special:Block Feedback Request (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt) [23:12:36] davidwbarratt: ^ is live on mwdebug1002, check please [23:12:42] checking... [23:13:43] thcipriani looks good to me! [23:13:54] thanks for checking, going live [23:14:04] brion looks like it should be easy to add in polygerrit. [23:14:05] https://github.com/GerritCodeReview/gerrit/blob/be9c88a8031022ca0596d4fc61ba30477f418073/polygerrit-ui/app/elements/change/gr-change-actions/gr-change-actions.js#L975 [23:14:11] nice [23:15:32] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:448146|Enable Special:Block Feedback Request (2)]] T199919 (duration: 00m 49s) [23:15:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:37] T199919: Enable Partial Block banner on all Wikimedia wikis - https://phabricator.wikimedia.org/T199919 [23:15:41] ^ davidwbarratt live everywhere now [23:16:21] thcipriani looks fantastic! thanks! [23:16:27] yw [23:17:57] (03PS11) 10Krinkle: webperf: Move site vars to profile class params (set from Hiera) [puppet] - 10https://gerrit.wikimedia.org/r/443739 (https://phabricator.wikimedia.org/T195314) [23:18:34] brion: can you make the cherry-picks for those 3 patches? The gerrit UI seems to complain about it (sorry if this was talked about in scrollback and I missed it) [23:19:09] thcipriani: lemme try doing it manually, see if that works [23:19:20] I'd like to merge all 3, pull to mwdebug1002, have you test there, then do a full scap sync (since there are l10n changes it seems) [23:19:22] if not we may have to do them one at a time through the current gerrit ui [23:19:43] (03PS4) 10Dzahn: jenkins: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/434538 (https://phabricator.wikimedia.org/T194724) [23:19:43] thank you! [23:20:27] brion see https://gerrit-review.googlesource.com/c/gerrit/+/190770 [23:23:28] (03PS9) 10Krinkle: webperf: Rename webperf profiles for clarity [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) [23:23:30] (03CR) 10jenkins-bot: Enable Special:Block Feedback Request (2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/448146 (https://phabricator.wikimedia.org/T199919) (owner: 10Dbarratt) [23:23:38] (03CR) 10Krinkle: "(Rebase)" [puppet] - 10https://gerrit.wikimedia.org/r/443752 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [23:23:46] CFisch_WMDE: your change live on mwdebug1002, check please [23:23:53] (03PS1) 10Ayounsi: ip -6 route show, don't try to find the IP, but keyword instead [puppet] - 10https://gerrit.wikimedia.org/r/449371 (https://phabricator.wikimedia.org/T195365) [23:23:59] thcipriani: [23:24:00] sighhhh having trouble pushing manually to branch [23:24:02] jepp [23:24:03] (03PS8) 10Krinkle: webperf: Rename role::xenon to profile::webperf::xenon [puppet] - 10https://gerrit.wikimedia.org/r/443757 (https://phabricator.wikimedia.org/T195312) [23:24:09] (03PS8) 10Krinkle: mediawiki: Change xenon interval for Beta Cluster from 10min to 30s [puppet] - 10https://gerrit.wikimedia.org/r/443762 [23:24:49] (03PS6) 10Krinkle: webperf: Enable xenondata_host on perfsite in Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/443764 (https://phabricator.wikimedia.org/T195312) [23:24:55] thcipriani: Works, all fine, can go live! [23:25:04] CFisch_WMDE: thanks for checking, going live now [23:25:10] (03PS7) 10Krinkle: webperf: Split Redis from the rest of the arclamp profile [puppet] - 10https://gerrit.wikimedia.org/r/444331 (https://phabricator.wikimedia.org/T195312) [23:25:40] ah i think i'm doing it right now [23:26:25] (03PS5) 10Krinkle: webperf: Add arclamp profile to webperf::profiling_tools role [puppet] - 10https://gerrit.wikimedia.org/r/445066 (https://phabricator.wikimedia.org/T195312) [23:27:04] !log thcipriani@deploy1001 Synchronized php-1.32.0-wmf.14/extensions/RevisionSlider/modules: SWAT: [[gerrit:449198|RevisionSlider: Fix missing pin icon]] T200263 (duration: 00m 49s) [23:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:09] T200263: Pin icon does not show when RevisionSlider is expanded - https://phabricator.wikimedia.org/T200263 [23:27:15] ^ CFisch_WMDE change is live now [23:27:22] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler02/11925/" [puppet] - 10https://gerrit.wikimedia.org/r/449371 (https://phabricator.wikimedia.org/T195365) (owner: 10Ayounsi) [23:27:58] brion also upstream have some desgn mocks for polygerrit's change view https://groups.google.com/forum/#!topic/repo-discuss/H4pgIYhgEc4 :) [23:28:07] (03CR) 10Ayounsi: [C: 032] ip -6 route show, don't try to find the IP, but keyword instead [puppet] - 10https://gerrit.wikimedia.org/r/449371 (https://phabricator.wikimedia.org/T195365) (owner: 10Ayounsi) [23:28:35] thcipriani: Thanks :-)! [23:28:40] yw [23:28:45] thcipriani: ok i've got proper cherry-picks to branch now [23:29:00] brion: awesome, thank you, I'll check them out [23:29:05] thx [23:31:11] (03PS8) 10Dzahn: netbox: add psql dump cron and back it up [puppet] - 10https://gerrit.wikimedia.org/r/447842 (https://phabricator.wikimedia.org/T190184) [23:31:17] now we play the jenkins waiting game [23:31:29] later we'll play the scap waiting game [23:31:30] heheh [23:31:49] releng: big gamers [23:31:58] big waiting gamers [23:32:49] jouncebot: entertain [23:32:51] achievement unlocked: +2 on branch patch [23:34:28] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [23:35:11] !log re-enabling puppet on cp40* - T195365 [23:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:35:15] T195365: cp intermittent IPsec MTU issue - https://phabricator.wikimedia.org/T195365 [23:44:28] thcipriani: Can you run an extension maintenance script for testwiki? It's called populateDraftQueue.php, in extensions/PageTriage/maintenance [23:45:17] Niharika: sure [23:45:23] what's the new terbium called? [23:45:38] thcipriani: mwmaint1001.eqiad.wmnet [23:45:51] ah, thanks [23:45:55] We need to name it though. [23:46:13] mwmaint1001 just rolls off the tongue [23:46:16] Or maybe it's just terbium forever. [23:46:22] terbiumjr [23:46:42] Terbium the Second. [23:46:50] :D [23:48:17] you still have "deployment-terbium" in deployment-prep.. but rename it please ;) [23:49:05] can't really rename labs servers [23:49:11] can replace them [23:49:29] it's like the star trek transporter [23:49:36] Niharika: https://phabricator.wikimedia.org/P7402 [23:49:40] you can create a new machine and delete the old one ;) [23:49:40] sfkat server-formerly-known-as-terbium [23:50:02] thcipriani: Thank you! [23:50:05] yw :) [23:52:31] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Investigate and improve memory allocation rates of WDQS - https://phabricator.wikimedia.org/T181988 (10Smalyshev) 05Open>03Resolved [23:59:06] 10Operations, 10DBA, 10JADE, 10Scoring-platform-team, 10TechCom-RFC: Introduce a new namespace for collaborative judgments about wiki entities - https://phabricator.wikimedia.org/T200297 (10awight) @mark I think you're right that our future potential use cases involving bulk bots should be dropped and re... [23:59:52] thcipriani: all right, jenkins dance complete!