[01:40:58] 10Operations, 10Ops-Access-Requests: Please restore prod and gerrit access for J. Forrester - https://phabricator.wikimedia.org/T175505#3595204 (10Jdforrester-WMF) [02:34:44] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.17) (duration: 08m 10s) [02:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:41:43] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon Sep 11 02:41:43 UTC 2017 (duration 7m 0s) [02:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:41:36] (03PS1) 10Andrew Bogott: Add wmcs-roots to role::wmcs::openstack::main::control [puppet] - 10https://gerrit.wikimedia.org/r/377186 [03:42:09] (03CR) 10Andrew Bogott: [C: 032] Add wmcs-roots to role::wmcs::openstack::main::control [puppet] - 10https://gerrit.wikimedia.org/r/377186 (owner: 10Andrew Bogott) [03:50:00] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3595276 (10Jarekt) Two days ago I cleaned up [[ http... [04:25:49] (03PS1) 10Andrew Bogott: WIP: nova: turn off hourly instance usage audits [puppet] - 10https://gerrit.wikimedia.org/r/377187 [06:17:37] DB question: So if a database is doing a query that has a LIMIT, and ORDER BY and a join that's using a temporary table, would mysql collect all the relavent rows into a temporary table, and then sort them, or would it be smart enough to insert into temporary table into sorted (so far) order, and discard rows from the temporary table that clearly can't be returned for sorting too low, once... [06:17:39] ...there are more than LIMIT number of rows in the table? [06:19:49] This isn't that important, I'm just trying to figure out how certain things work, and now I'm curious [06:56:07] (03PS2) 10Muehlenhoff: Remove remaining salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376702 [07:03:56] (03CR) 10Muehlenhoff: [C: 032] Remove remaining salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376702 (owner: 10Muehlenhoff) [07:07:59] 10Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#3595338 (10MoritzMuehlenhoff) [07:08:03] 10Operations, 10Security-Team, 10vm-requests, 10Patch-For-Review: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#3595336 (10MoritzMuehlenhoff) 05declined>03Open >>! In T138650#3592768, @Dzahn wrote: > @EddieGP Thanks, yea. I removed it. Should be all done now,... [07:09:09] <_joe_> bawolff: sorted so far order for a temp table (that has thus to be on disk) is extremely expensive [07:09:39] <_joe_> then you can set temp tables to be on tmpfs, but it would still be a horrible choice [07:10:01] <_joe_> also, don't think of temp tables as simple representations of the data [07:10:04] <_joe_> they're not [07:10:18] <_joe_> but jaime surely knows more details :) [07:10:42] I know they're terrible. Alas that's how watchlist's work :s [07:10:45] (03PS1) 10Muehlenhoff: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/377200 [07:11:45] I posted some stuff on T171027 and I'm eagerly awaiting jynus to tell me how I'm totally wrong about how dbs work :) [07:11:45] T171027: "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027 [07:13:16] From what I understand, temp tables will be in memory if they are small. So if mariadb discards rows as it collects rows, that would have a significantly good impact on queries involving large temporary tables (albeit they would still be terrible) [07:14:09] If it only discards rows at the end, then the temp table would grow really big for queries involving them that have a LIMIT but look at lots of rows [07:15:22] (03PS1) 10Muehlenhoff: Stop installing debdeploy-minion for new WMCS VPS instances [puppet] - 10https://gerrit.wikimedia.org/r/377202 [07:15:29] (03CR) 10Muehlenhoff: [C: 032] Update comment [puppet] - 10https://gerrit.wikimedia.org/r/377200 (owner: 10Muehlenhoff) [07:17:04] (03CR) 10Gehel: "We are already using gelf4j (available as the debian package `liblogstash-gelf-java`) for a number of application. A more direct log4j inp" [puppet] - 10https://gerrit.wikimedia.org/r/377058 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [07:20:00] (03PS1) 10Muehlenhoff: Stop creating salt grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/377203 [07:25:38] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [07:26:19] <_joe_> moritzm: ^^ [07:26:22] (03PS1) 10Giuseppe Lavagetto: mobileapps: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377204 [07:26:24] (03PS1) 10Giuseppe Lavagetto: mathoid: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377205 [07:26:26] (03PS1) 10Giuseppe Lavagetto: cxserver: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377206 [07:26:28] (03PS1) 10Giuseppe Lavagetto: apertium: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377207 [07:26:30] (03PS1) 10Giuseppe Lavagetto: trendingedits: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377208 [07:26:32] (03PS1) 10Giuseppe Lavagetto: eventstreams: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377209 [07:26:34] (03PS1) 10Giuseppe Lavagetto: pdfrender: switch to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377210 [07:26:36] (03PS1) 10Giuseppe Lavagetto: changeprop: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377211 [07:26:38] (03PS1) 10Giuseppe Lavagetto: graphoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377212 [07:26:55] <_joe_> akosiaris, mobrovac ^^ [07:27:01] <_joe_> see? I hold my promises :P [07:27:28] nice! [07:27:44] (03CR) 10jerkins-bot: [V: 04-1] trendingedits: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377208 (owner: 10Giuseppe Lavagetto) [07:28:13] (03CR) 10jerkins-bot: [V: 04-1] eventstreams: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377209 (owner: 10Giuseppe Lavagetto) [07:28:58] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3595350 (10Johnuniq) @Jarekt [[https://commons.wikim... [07:29:23] ah, sorry, merging [07:29:38] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [07:37:19] <_joe_> uhm [07:41:08] !log installing remaining libonig security updates [07:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:40] (03PS1) 10Mobrovac: CP-JobQueue: Add the service to SCB [puppet] - 10https://gerrit.wikimedia.org/r/377213 (https://phabricator.wikimedia.org/T175281) [07:59:35] !log installing systemd updates from stretch point update [07:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:20] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/377203 (owner: 10Muehlenhoff) [08:03:08] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active, AS1299/IPv6: Active [08:05:01] (03PS2) 10Giuseppe Lavagetto: trendingedits: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377208 [08:05:03] (03PS2) 10Giuseppe Lavagetto: eventstreams: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377209 [08:05:05] (03PS2) 10Giuseppe Lavagetto: pdfrender: switch to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377210 [08:05:07] (03PS2) 10Giuseppe Lavagetto: changeprop: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377211 [08:05:09] (03PS2) 10Giuseppe Lavagetto: graphoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377212 [08:05:58] PROBLEM - puppet last run on ms-be2037 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools] [08:09:40] the BGP alert seems to be Telia from cr1-uslfo logs, and afaics there is a maintenance scheduled [08:15:14] (03CR) 10Giuseppe Lavagetto: [C: 032] mobileapps: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377204 (owner: 10Giuseppe Lavagetto) [08:15:45] (03CR) 10Giuseppe Lavagetto: [C: 032] mathoid: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377205 (owner: 10Giuseppe Lavagetto) [08:16:21] !log reimage script tests continue on mc100[1-2] T166300 [08:16:24] (03CR) 10Giuseppe Lavagetto: [C: 032] cxserver: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377206 (owner: 10Giuseppe Lavagetto) [08:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:35] T166300: Remove Salt from wmf-auto-reimage / wmf-reimage - https://phabricator.wikimedia.org/T166300 [08:18:51] (03CR) 10Mobrovac: "PCC OK - https://puppet-compiler.wmflabs.org/compiler02/7768/" [puppet] - 10https://gerrit.wikimedia.org/r/377213 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac) [08:18:55] (03PS2) 10Muehlenhoff: Stop creating salt grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/377203 [08:20:38] (03CR) 10Muehlenhoff: [C: 032] Stop creating salt grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/377203 (owner: 10Muehlenhoff) [08:22:29] RECOVERY - BGP status on cr1-ulsfo is OK: BGP OK - up: 16, down: 0, shutdown: 0 [08:29:57] (03PS2) 10Giuseppe Lavagetto: apertium: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377207 [08:30:05] (03PS1) 10Muehlenhoff: Switch role::debdeploy::master role to Cumin-based variant [puppet] - 10https://gerrit.wikimedia.org/r/377214 [08:31:19] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3595457 (10Verdy_p) @Jarekt, your cleanup has change... [08:32:33] (03CR) 10Giuseppe Lavagetto: [C: 032] apertium: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377207 (owner: 10Giuseppe Lavagetto) [08:33:41] RECOVERY - puppet last run on ms-be2037 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:34:29] (03Abandoned) 10Muehlenhoff: Remove elasticsearch salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376241 (owner: 10Muehlenhoff) [08:34:31] (03PS3) 10Giuseppe Lavagetto: trendingedits: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377208 [08:34:50] (03Abandoned) 10Muehlenhoff: Remove labtest salt grains previously used by debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/376271 (owner: 10Muehlenhoff) [08:35:19] (03CR) 10Giuseppe Lavagetto: [C: 032] trendingedits: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377208 (owner: 10Giuseppe Lavagetto) [08:35:37] (03PS3) 10Giuseppe Lavagetto: eventstreams: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377209 [08:36:13] (03CR) 10Giuseppe Lavagetto: [C: 032] eventstreams: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377209 (owner: 10Giuseppe Lavagetto) [08:36:14] 10Operations, 10Cloud-Services: rack/setup/install labstore100[67].wikimedia.org - https://phabricator.wikimedia.org/T167984#3595460 (10fgiunchedi) @madhuvishy no problem! I've added a section about this to https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0_Gen9 too [08:36:53] (03PS3) 10Giuseppe Lavagetto: pdfrender: switch to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377210 [08:38:31] (03CR) 10Giuseppe Lavagetto: [C: 032] pdfrender: switch to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377210 (owner: 10Giuseppe Lavagetto) [08:38:55] (03PS3) 10Giuseppe Lavagetto: changeprop: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377211 [08:40:26] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3595465 (10Vachovec1) Looks like the module implemen... [08:40:45] (03CR) 10Giuseppe Lavagetto: [C: 032] changeprop: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377211 (owner: 10Giuseppe Lavagetto) [08:41:23] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.3 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/376674 (https://phabricator.wikimedia.org/T173580) (owner: 10Gilles) [08:41:48] (03PS3) 10Giuseppe Lavagetto: graphoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377212 [08:42:44] (03CR) 10Giuseppe Lavagetto: [C: 032] graphoid: convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377212 (owner: 10Giuseppe Lavagetto) [08:44:24] !log mobrovac@tin Started deploy [zotero/translators@a0c41c3]: Update translators to upstream e03695273 - T174992 [08:44:32] !log mobrovac@tin Finished deploy [zotero/translators@a0c41c3]: Update translators to upstream e03695273 - T174992 (duration: 00m 07s) [08:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:37] T174992: Update zotero translators - https://phabricator.wikimedia.org/T174992 [08:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:50] (03PS2) 10Filippo Giunchedi: Upgrade to 1.3 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/376674 (https://phabricator.wikimedia.org/T173580) (owner: 10Gilles) [08:45:21] (03CR) 10Hashar: [C: 04-1] "From a quick chat with Moritz last weeek, it seems we will publish the packages to apt.wikimedia.org under a standalone component such as " [puppet] - 10https://gerrit.wikimedia.org/r/374837 (owner: 10Hashar) [08:45:28] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.3 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/376674 (https://phabricator.wikimedia.org/T173580) (owner: 10Gilles) [08:51:53] (03CR) 10Mobrovac: "I think this will break CI because AFAIK it requires mathoid::packages to exist" [puppet] - 10https://gerrit.wikimedia.org/r/377205 (owner: 10Giuseppe Lavagetto) [08:51:57] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3595482 (10Verdy_p) >>! In T171392#3588777, @Anomie... [09:00:26] (03PS2) 10Filippo Giunchedi: Thumbor: enable new MAX_ANIMATED_GIF_AREA option [puppet] - 10https://gerrit.wikimedia.org/r/376676 (https://phabricator.wikimedia.org/T173580) (owner: 10Gilles) [09:03:22] (03CR) 10Filippo Giunchedi: [C: 032] Thumbor: enable new MAX_ANIMATED_GIF_AREA option [puppet] - 10https://gerrit.wikimedia.org/r/376676 (https://phabricator.wikimedia.org/T173580) (owner: 10Gilles) [09:06:15] (03CR) 10DCausse: [C: 031] logstash - configure new logstash100[7-9] nodes [puppet] - 10https://gerrit.wikimedia.org/r/376488 (https://phabricator.wikimedia.org/T175045) (owner: 10Gehel) [09:06:31] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/377214 (owner: 10Muehlenhoff) [09:08:35] (03PS1) 10Mobrovac: ChangeProp: Separate packages into profile::changeprop::packages [puppet] - 10https://gerrit.wikimedia.org/r/377218 [09:08:45] _joe_: ^ [09:10:03] (03CR) 10Giuseppe Lavagetto: [C: 032] ChangeProp: Separate packages into profile::changeprop::packages [puppet] - 10https://gerrit.wikimedia.org/r/377218 (owner: 10Mobrovac) [09:10:31] (03PS2) 10Ema: varnish: convert role::cache::instances into a class [puppet] - 10https://gerrit.wikimedia.org/r/376521 [09:11:05] (03CR) 10Ema: [V: 032 C: 032] varnish: convert role::cache::instances into a class [puppet] - 10https://gerrit.wikimedia.org/r/376521 (owner: 10Ema) [09:11:10] PROBLEM - Check systemd state on thumbor2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:11:44] that's me ^ silencing [09:11:52] (03PS2) 10Mobrovac: CP-JobQueue: Add the service to SCB [puppet] - 10https://gerrit.wikimedia.org/r/377213 (https://phabricator.wikimedia.org/T175281) [09:12:51] PROBLEM - Host mc1002 is DOWN: PING CRITICAL - Packet loss = 100% [09:13:11] this is me ^^^ [09:13:23] (03CR) 10Giuseppe Lavagetto: [C: 032] CP-JobQueue: Add the service to SCB [puppet] - 10https://gerrit.wikimedia.org/r/377213 (https://phabricator.wikimedia.org/T175281) (owner: 10Mobrovac) [09:14:00] RECOVERY - Host mc1002 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [09:15:07] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3595634 (10Verdy_p) If you want a similar kind of er... [09:18:50] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[cpjobqueue/deploy] [09:19:48] (03PS3) 10Ema: prometheus: add aggregation rules for IPVS [puppet] - 10https://gerrit.wikimedia.org/r/376665 [09:19:57] (03CR) 10Ema: [V: 032 C: 032] prometheus: add aggregation rules for IPVS [puppet] - 10https://gerrit.wikimedia.org/r/376665 (owner: 10Ema) [09:21:01] (03PS1) 10Filippo Giunchedi: thumbor: add logstash config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/377220 (https://phabricator.wikimedia.org/T150734) [09:22:00] (03PS2) 10Filippo Giunchedi: thumbor: add logstash config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/377220 (https://phabricator.wikimedia.org/T150734) [09:24:20] (03PS2) 10MarcoAurelio: Update es.wiktionary logo from SVG version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374370 [09:24:47] (03CR) 10Muehlenhoff: Add a new debdeploy command query_version (0311 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376715 (owner: 10Muehlenhoff) [09:25:01] (03PS3) 10Muehlenhoff: Add a new debdeploy command query_version [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376715 [09:27:58] (03PS3) 10Filippo Giunchedi: thumbor: add logstash config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/377220 (https://phabricator.wikimedia.org/T150734) [09:30:00] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [09:31:49] (03PS4) 10Filippo Giunchedi: thumbor: add logstash config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/377220 (https://phabricator.wikimedia.org/T150734) [09:32:21] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: add logstash config for codfw [puppet] - 10https://gerrit.wikimedia.org/r/377220 (https://phabricator.wikimedia.org/T150734) (owner: 10Filippo Giunchedi) [09:34:30] RECOVERY - Check systemd state on thumbor2001 is OK: OK - running: The system is fully operational [09:36:30] PROBLEM - puppet last run on scb2004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 seconds ago with 1 failures. Failed resources (up to 3 shown) [09:36:40] PROBLEM - puppet last run on scb2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 20 seconds ago with 1 failures. Failed resources (up to 3 shown) [09:38:28] (03PS3) 10Elukey: Introduce role::analytics_cluster::coordinator [puppet] - 10https://gerrit.wikimedia.org/r/372131 (https://phabricator.wikimedia.org/T167790) [09:38:30] RECOVERY - puppet last run on scb2004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [09:38:40] RECOVERY - puppet last run on scb2003 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [09:39:34] !log mobrovac@tin Started deploy [cpjobqueue/deploy@e73a10f]: Initial deploy of cpjobqueue - T175281 [09:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:49] T175281: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281 [09:39:55] !log mobrovac@tin Finished deploy [cpjobqueue/deploy@e73a10f]: Initial deploy of cpjobqueue - T175281 (duration: 00m 20s) [09:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:20] !log roll-restart thumbor to apply changes for gif max animated area - T173580 [09:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:33] T173580: $wgMaxAnimatedGifArea is not honored by Thumbor - https://phabricator.wikimedia.org/T173580 [09:42:29] (03CR) 10Volans: [C: 031] "Thanks for all the fixes, see a couple of replies inline, both non-blocking, so the +1 in case you want to proceed as is." (033 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376715 (owner: 10Muehlenhoff) [09:42:59] (03PS4) 10Elukey: Introduce role::analytics_cluster::coordinator [puppet] - 10https://gerrit.wikimedia.org/r/372131 (https://phabricator.wikimedia.org/T167790) [09:43:01] 10Operations, 10monitoring: Review check_puppetrun frequency - https://phabricator.wikimedia.org/T173427#3595696 (10akosiaris) >>! In T173427#3592881, @herron wrote: >>! In T173427#3591926, @akosiaris wrote: >> >> It's about avoiding transient network failure positives and making sure the failure is not a tran... [09:45:34] (03PS1) 10Muehlenhoff: Remove salt grains from app server canaries [puppet] - 10https://gerrit.wikimedia.org/r/377222 [09:46:31] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3595700 (10mobrovac) 05Open>03Resolved Everything is set up now, and the `cpjobqueue` service is live in production on the SCB c... [09:49:17] (03PS5) 10Elukey: Introduce role::analytics_cluster::coordinator [puppet] - 10https://gerrit.wikimedia.org/r/372131 (https://phabricator.wikimedia.org/T167790) [09:52:15] (03PS3) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 [09:52:44] (03CR) 10jerkins-bot: [V: 04-1] restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 (owner: 10ArielGlenn) [09:54:26] (03PS4) 10ArielGlenn: restructure dumps webserver, zim manifests to module/role/profile [puppet] - 10https://gerrit.wikimedia.org/r/376750 [09:55:35] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Build a slim container for fluentd - https://phabricator.wikimedia.org/T175527#3595715 (10Joe) [09:58:56] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: thumbor1003 behaves differently than other thumbor hosts - https://phabricator.wikimedia.org/T174997#3595746 (10fgiunchedi) For some reason the `MemoryLimit=15%` change from https://gerrit.wikimedia.org/r/#/c/367373/ doesn't seem to be... [10:01:03] (03PS2) 10Alexandros Kosiaris: Revert "sshd_config: Increase MaxAuthTries" [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) [10:02:30] (03CR) 10Muehlenhoff: Add a new debdeploy command query_version (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376715 (owner: 10Muehlenhoff) [10:07:12] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "sshd_config: Increase MaxAuthTries" [puppet] - 10https://gerrit.wikimedia.org/r/376735 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [10:07:21] (03PS1) 10Gilles: Upgrade to 1.4 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/377226 (https://phabricator.wikimedia.org/T173580) [10:08:32] (03PS2) 10Gilles: Upgrade to 1.4 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/377226 (https://phabricator.wikimedia.org/T173580) [10:09:56] (03CR) 10Elukey: [C: 032] "Looks good on https://puppet-compiler.wmflabs.org/compiler02/7774/" [puppet] - 10https://gerrit.wikimedia.org/r/372131 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [10:10:02] (03PS6) 10Elukey: Introduce role::analytics_cluster::coordinator [puppet] - 10https://gerrit.wikimedia.org/r/372131 (https://phabricator.wikimedia.org/T167790) [10:10:05] 10Operations, 10Thumbor, 10Performance-Team (Radar), 10User-fgiunchedi: thumbor1003 behaves differently than other thumbor hosts - https://phabricator.wikimedia.org/T174997#3595807 (10Gilles) I suspected it was something like that :) Should be an easy fix, then! [10:15:43] no op on analytics1003 [10:18:32] 10Operations, 10ops-eqiad: Run hardware checks on mw1294 - https://phabricator.wikimedia.org/T167406#3595868 (10MoritzMuehlenhoff) 05Open>03Resolved Yeah, let's close it. If it still happens, we can just as well reopen this task. Thanks! [10:28:34] (03PS2) 10Muehlenhoff: Switch role::debdeploy::master role to Cumin-based variant [puppet] - 10https://gerrit.wikimedia.org/r/377214 [10:28:41] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3595881 (10Johnuniq) @Vachovec1 The module I wrote w... [10:29:06] (03CR) 10Muehlenhoff: [C: 032] Switch role::debdeploy::master role to Cumin-based variant [puppet] - 10https://gerrit.wikimedia.org/r/377214 (owner: 10Muehlenhoff) [10:30:54] !log restarting mysql @ labsdb1001 [10:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:59] (03PS1) 10MarcoAurelio: Lift account creation restrictions for WM Taiwan 10th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377227 (https://phabricator.wikimedia.org/T175534) [10:35:06] (03CR) 10Volans: [C: 031] "ACK" (031 comment) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376715 (owner: 10Muehlenhoff) [10:35:18] PROBLEM - mysqld processes on labsdb1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [10:35:35] <_joe_> ouch [10:35:50] _joe_: it's jaime, see above [10:36:17] <_joe_> heh, too much traffic in this channel [10:36:19] it actually failed on saturday, I am just cleaning up [10:36:32] (03CR) 10jerkins-bot: [V: 04-1] Lift account creation restrictions for WM Taiwan 10th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377227 (https://phabricator.wikimedia.org/T175534) (owner: 10MarcoAurelio) [10:36:33] PROBLEM - DPKG on labsdb1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:36:46] See T175487 [10:36:46] T175487: Significant replication lag for the s1, s2, and s4 wikis on labsdb100[13] - https://phabricator.wikimedia.org/T175487 [10:37:33] RECOVERY - DPKG on labsdb1001 is OK: All packages OK [10:41:28] RECOVERY - mysqld processes on labsdb1001 is OK: PROCS OK: 1 process with command name mysqld [10:41:29] (03PS2) 10MarcoAurelio: Lift account creation restrictions for WM Taiwan 10th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377227 (https://phabricator.wikimedia.org/T175534) [10:42:08] (03PS1) 10Muehlenhoff: Convert debdeploy to a profile and make it part of role::cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/377228 [10:47:40] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Build containers for statsd, prometheus-statsd-exporter - https://phabricator.wikimedia.org/T175539#3595929 (10Joe) [10:49:21] (03PS2) 10Muehlenhoff: Convert debdeploy to a profile and make it part of role::cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/377228 [10:57:50] (03CR) 10Giuseppe Lavagetto: [C: 031] Convert debdeploy to a profile and make it part of role::cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/377228 (owner: 10Muehlenhoff) [10:58:14] (03PS1) 10ArielGlenn: convert 'other' dump cron jobs into role/profile under dumps [puppet] - 10https://gerrit.wikimedia.org/r/377231 (https://phabricator.wikimedia.org/T175528) [10:58:28] (03CR) 10jerkins-bot: [V: 04-1] convert 'other' dump cron jobs into role/profile under dumps [puppet] - 10https://gerrit.wikimedia.org/r/377231 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [11:06:43] (03CR) 10Muehlenhoff: [C: 032] Convert debdeploy to a profile and make it part of role::cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/377228 (owner: 10Muehlenhoff) [11:08:43] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3596005 (10Verdy_p) Note that one or the three bugs... [11:14:48] (03PS20) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [11:15:36] (03CR) 10Matthias Mullie: [C: 031] "This is currently cherry-picked on beta & seems fine. I suspect this can be merged?" [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [11:26:36] 10Operations, 10monitoring, 10Patch-For-Review: Fix Icinga checks for test/decom servers - https://phabricator.wikimedia.org/T151632#3596051 (10akosiaris) @Dzahn, does the above sounds reasonable ? [11:30:17] (03PS1) 10Muehlenhoff: Remove old debdeploy/salt packages via puppet [puppet] - 10https://gerrit.wikimedia.org/r/377236 (https://phabricator.wikimedia.org/T164817) [11:33:31] (03CR) 10jerkins-bot: [V: 04-1] Remove old debdeploy/salt packages via puppet [puppet] - 10https://gerrit.wikimedia.org/r/377236 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff) [11:36:56] (03PS2) 10Muehlenhoff: Remove old debdeploy/salt packages via puppet [puppet] - 10https://gerrit.wikimedia.org/r/377236 (https://phabricator.wikimedia.org/T164817) [11:43:15] !log installing file/libmagic security updates (stretch-specific, does not affect older distros) [11:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:38] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Build a slim container for fluentd - https://phabricator.wikimedia.org/T175527#3596137 (10mobrovac) > we might want to improve upon that. Yeah, definitely. Also, I don't have a concrete use case in mind, but I'm going to guess that t... [11:47:35] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:01:06] (03PS1) 10Mobrovac: Kafka: Make all Kafka clients require the same set of packages [puppet] - 10https://gerrit.wikimedia.org/r/377238 [12:01:44] (03PS1) 10Muehlenhoff: Remove salt-based debdeploy code [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/377239 [12:07:41] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/377236 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff) [12:07:57] jouncebot: refresh [12:08:00] I refreshed my knowledge about deployments. [12:08:04] jouncebot: next [12:08:04] In 0 hour(s) and 51 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170911T1300) [12:09:30] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/compiler02/7780/" [puppet] - 10https://gerrit.wikimedia.org/r/377238 (owner: 10Mobrovac) [12:10:31] godog: are you doing something on rb1009? [12:10:49] seems there's a stopped service there [12:12:24] ah it's the metrics collector [12:12:50] mobrovac: o/ - qq: in eventstreams I can see Service::Packages['eventstreams'] [12:13:05] checking [12:13:16] line 64 [12:13:20] not part of the changeset [12:13:49] (03PS1) 10Muehlenhoff: Add library alias for src:file [puppet] - 10https://gerrit.wikimedia.org/r/377241 [12:13:57] ah, forgot to remove it while c/p the require hehe [12:14:04] thnx for checking elukey, amending [12:14:46] that require is not needed anyway [12:15:04] (03PS2) 10Mobrovac: Kafka: Make all Kafka clients require the same set of packages [puppet] - 10https://gerrit.wikimedia.org/r/377238 [12:15:12] (03CR) 10Muehlenhoff: [C: 032] Add library alias for src:file [puppet] - 10https://gerrit.wikimedia.org/r/377241 (owner: 10Muehlenhoff) [12:15:45] godog: wth? "Error: Could not find or load main class org.wikimedia.cassandra.metrics.service.Service" [12:20:41] ah, it seems git-fat failed or something [12:21:03] -rw-r--r-- 1 deploy-service deploy-service 74 Sep 6 09:31 /srv/deployment/cassandra/metrics-collector/lib/cassandra-metrics-collector-4.0.1-jar-with-dependencies.jar [12:21:19] 74 bytes, that's not right [12:22:06] oh but wait, aren't we using prometheus now for cass 3? [12:24:01] (03PS15) 10Ema: varnish::instance: fix template attributes scope [puppet] - 10https://gerrit.wikimedia.org/r/376242 [12:29:37] (03CR) 10Ema: "Looks ok finally!" [puppet] - 10https://gerrit.wikimedia.org/r/376242 (owner: 10Ema) [12:43:24] PROBLEM - puppet last run on db1054 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:48:24] "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Attempt to assign to a reserved variable name: 'trusted' on node db1054.eqiad.wmnet" [12:48:34] RECOVERY - puppet last run on db1054 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [12:48:50] I always ask why that happens someimtes and I forget immediatly later [12:51:04] jynus: it should get fixed when we'll upgrade puppetdb AFAIK, see T153246 and https://phabricator.wikimedia.org/T173427#3595696 in particular [12:51:05] T153246: Puppet failures with "Attempt to assign to a reserved variable name: 'trusted'" - https://phabricator.wikimedia.org/T153246 [12:55:10] bd0 [12:56:15] getting a coffee and I will come for the swat [12:57:21] O/ [12:57:26] Hashar im here [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170911T1300). Please do the needful. [13:00:04] geoffreytrang, Zppix, tabbycat, Amir1, and stephanebisson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:14] o/ [13:00:16] o/ [13:00:18] present [13:00:26] here [13:00:37] o/ [13:01:08] I will do [config] 374063 Rename Wikisaurus namespace on Wiktionary to "Thesaurus" later on [13:01:11] it is a bit longy [13:01:20] (03CR) 10Hashar: [C: 032] Change logo for huwiktonary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377059 (https://phabricator.wikimedia.org/T175483) (owner: 10Zppix) [13:02:52] for Add Extension:Newsletter permissions to CommonSettings https://gerrit.wikimedia.org/r/#/c/376886/ [13:02:55] that adds the permissions [13:03:09] but iirc we also need to grant permission to some groups to add the permission dont we? [13:03:41] (03PS1) 10Giuseppe Lavagetto: citoid: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377250 [13:03:43] (03PS1) 10Giuseppe Lavagetto: role::scb: only include profiles, not roles [puppet] - 10https://gerrit.wikimedia.org/r/377251 [13:05:06] ah no that is for global groups bah [13:05:40] (03CR) 10Hashar: [C: 032] Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376886 (owner: 10MarcoAurelio) [13:05:52] Ci running a bit slow today eh? [13:06:09] someone +2ed a bunch of patches :/ [13:06:11] hashar: yep, it's for globalgrouppermissions [13:06:30] :/ indeed [13:07:12] (03CR) 10Hashar: [C: 032] Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376897 (https://phabricator.wikimedia.org/T175356) (owner: 10MarcoAurelio) [13:09:06] (03CR) 10Addshore: Install composer for PHP imaages (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/369838 (https://phabricator.wikimedia.org/T172358) (owner: 10BryanDavis) [13:10:35] hashar: please if you could ping me when your ready for me that would be Great [13:11:58] (03CR) 10Hashar: "Looking at that new version, the text is a light gray and that is barely readable with the Vector gray background?! Do you have the sour" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374370 (owner: 10MarcoAurelio) [13:12:51] Amir1: are you sure about "Reduce wikiPageUpdaterDbBatchSize to 20" https://gerrit.wikimedia.org/r/#/c/376562/1 ?:) [13:13:17] hashar: yeah, I introduced the config variable :D [13:13:26] okkkkk [13:13:28] hashar: worst case scenario, it doesn't work [13:13:37] and the batch size stays at 50 [13:13:43] (03CR) 10Hashar: [C: 032] Reduce wikiPageUpdaterDbBatchSize to 20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376562 (https://phabricator.wikimedia.org/T173710) (owner: 10Ladsgroup) [13:14:00] hashar: https://commons.wikimedia.org/wiki/File:Wiktionary_logo-vector_es.svg [13:14:04] (03CR) 10Alexandros Kosiaris: [C: 04-1] role::scb: only include profiles, not roles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/377251 (owner: 10Giuseppe Lavagetto) [13:14:07] stephanebisson: and I am looking at your patch finally [13:14:14] hashar: btw. It's not testable at all (job-related config) [13:14:18] I don't mind to halt that patch if there are issues with it [13:15:39] tabbycat: well I found the text color to be too light but maybe that is just me :] Was there a task / community request to add it ? [13:15:51] (03CR) 10MarcoAurelio: "> Do you have the source for the svg?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374370 (owner: 10MarcoAurelio) [13:16:25] ah https://es.wiktionary.org/w/index.php?title=Wikcionario:Caf%C3%A9&oldid=4716095#Logos [13:16:37] hashar: https://es.wiktionary.org/wiki/Wikcionario:Caf%C3%A9#Logos [13:16:41] yep, that's right [13:16:46] but I don't use Vector [13:16:51] so I couldn't notice [13:16:53] and you are active there so I guess it is all fine :] [13:17:00] or mabe it needs a bit more discussion there? [13:17:11] (03PS1) 10Muehlenhoff: Add additional alias [puppet] - 10https://gerrit.wikimedia.org/r/377252 [13:17:16] hashar: but if the text is not going to be visible... well I guess we can stop that one for now [13:17:29] let me take the issue to the community [13:17:39] sigh, that damn logo is giving me a headache :) [13:22:45] (03CR) 10Hashar: "That drops the graphoid::packages class which we use to provision the Graphoid binary dependencies on CI without provisioning the rest. (w" [puppet] - 10https://gerrit.wikimedia.org/r/377212 (owner: 10Giuseppe Lavagetto) [13:23:17] oh my.. [13:23:33] I am catching up with mails, and basically CI is severly throttled [13:24:17] hashar: i get why its throttled but the throttle is a little ridiculous at this point no? [13:24:21] (03CR) 10Giuseppe Lavagetto: role::scb: only include profiles, not roles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/377251 (owner: 10Giuseppe Lavagetto) [13:24:48] (03CR) 10Muehlenhoff: [C: 032] Add additional alias [puppet] - 10https://gerrit.wikimedia.org/r/377252 (owner: 10Muehlenhoff) [13:25:24] (03CR) 10MarcoAurelio: "I've raised the issue to the community again. Please do not deploy this today. Thank you." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374370 (owner: 10MarcoAurelio) [13:26:31] (03PS2) 10Andrew Bogott: WIP: nova: turn off hourly instance usage audits [puppet] - 10https://gerrit.wikimedia.org/r/377187 [13:26:33] (03PS1) 10Andrew Bogott: wt-static: reword alert messages [puppet] - 10https://gerrit.wikimedia.org/r/377253 [13:27:04] (03PS2) 10Giuseppe Lavagetto: citoid: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377250 [13:27:06] (03PS2) 10Giuseppe Lavagetto: role::scb: only include profiles, not roles [puppet] - 10https://gerrit.wikimedia.org/r/377251 [13:27:30] (03PS2) 10Andrew Bogott: wt-static: reword alert messages [puppet] - 10https://gerrit.wikimedia.org/r/377253 [13:27:54] so we're waiting for zuul right? [13:28:05] let me know when I can test my patches [13:28:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3466006 (10Cmjohnson) You have successfully submitted request SR953656459. [13:28:18] Yes tabbycat [13:29:42] (03PS3) 10Andrew Bogott: wt-static: reword alert messages [puppet] - 10https://gerrit.wikimedia.org/r/377253 [13:29:44] (03PS3) 10Andrew Bogott: WIP: nova: turn off hourly instance usage audits [puppet] - 10https://gerrit.wikimedia.org/r/377187 [13:30:40] (03CR) 10Andrew Bogott: [C: 032] wt-static: reword alert messages [puppet] - 10https://gerrit.wikimedia.org/r/377253 (owner: 10Andrew Bogott) [13:31:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] role::scb: only include profiles, not roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377251 (owner: 10Giuseppe Lavagetto) [13:31:38] (03CR) 10Alexandros Kosiaris: [C: 031] role::scb: only include profiles, not roles [puppet] - 10https://gerrit.wikimedia.org/r/377251 (owner: 10Giuseppe Lavagetto) [13:33:58] (03Abandoned) 10Elukey: confluent::kafka: set kafka-authorizer log to DEBUG [puppet] - 10https://gerrit.wikimedia.org/r/376015 (https://phabricator.wikimedia.org/T173493) (owner: 10Elukey) [13:34:07] (03PS3) 10Giuseppe Lavagetto: citoid: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377250 [13:35:08] mobrovac: we are yeah, disabling cassandra-metrics-collector is one of the action items pending on T171772 [13:35:09] T171772: Prometheus metrics storage for RESTBase dev environment - https://phabricator.wikimedia.org/T171772 [13:35:34] (03CR) 10Giuseppe Lavagetto: [C: 032] citoid: move to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/377250 (owner: 10Giuseppe Lavagetto) [13:36:47] (03PS3) 10Giuseppe Lavagetto: role::scb: only include profiles, not roles [puppet] - 10https://gerrit.wikimedia.org/r/377251 [13:37:29] Finally some movement in zuul [13:37:34] this is not right :S I mean, we've consumed already half the SWAT windown in waiting for zuul [13:38:05] instances managed to spawn [13:38:12] (03CR) 10Giuseppe Lavagetto: [C: 032] role::scb: only include profiles, not roles [puppet] - 10https://gerrit.wikimedia.org/r/377251 (owner: 10Giuseppe Lavagetto) [13:38:13] so it is catching up on the four mediawiki-config patches [13:38:18] We need to disable c2 for non swatters around the swat windows xD [13:38:34] (03Merged) 10jenkins-bot: Change logo for huwiktonary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377059 (https://phabricator.wikimedia.org/T175483) (owner: 10Zppix) [13:38:36] (03Merged) 10jenkins-bot: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376886 (owner: 10MarcoAurelio) [13:38:38] (03Merged) 10jenkins-bot: Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376897 (https://phabricator.wikimedia.org/T175356) (owner: 10MarcoAurelio) [13:38:44] (03CR) 10jenkins-bot: Change logo for huwiktonary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377059 (https://phabricator.wikimedia.org/T175483) (owner: 10Zppix) [13:38:51] \o/ [13:38:56] Yay [13:39:07] (03Merged) 10jenkins-bot: Reduce wikiPageUpdaterDbBatchSize to 20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376562 (https://phabricator.wikimedia.org/T173710) (owner: 10Ladsgroup) [13:39:17] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 1.4 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/377226 (https://phabricator.wikimedia.org/T173580) (owner: 10Gilles) [13:39:18] * bf5719864 - Change logo for huwiktonary (19 hours ago) [13:39:19] is on mwdebug1001 [13:39:32] Looking hashar [13:40:21] Were good hashar [13:40:40] bah [13:40:45] too many authentication failures [13:40:52] !log hashar@tin Synchronized static/images/project-logos/huwiktionary.png: Change logo for huwiktonary - T175483 (duration: 00m 28s) [13:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:05] T175483: Change the logo of huwiktionary - https://phabricator.wikimedia.org/T175483 [13:41:08] Ty [13:41:14] (03CR) 10jenkins-bot: Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376897 (https://phabricator.wikimedia.org/T175356) (owner: 10MarcoAurelio) [13:41:28] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#3596642 (10herron) Is the established upgrade process a rebuild or dist-upgrade? In either case I'm thinking we should pull each server from the MX record in DNS while we upgrade and validate to avoid unexpected... [13:42:18] Rgmrmgr [13:42:22] the deploy doesn't work properly [13:42:40] ? [13:43:17] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 28s) [13:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:33] so on hold for now [13:43:44] (03PS5) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) [13:43:47] ?? Hashar wym [13:44:04] it's commonsettings if you're doing mine [13:44:19] (03PS6) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) [13:44:43] so it is on hold again :/ [13:44:58] there is some trouble on the cluster that we gotta sort out [13:45:12] Whats on hold? [13:46:37] (03PS2) 10BBlack: [WIP] stabilize backend storage patterns [puppet] - 10https://gerrit.wikimedia.org/r/376751 [13:46:39] (03PS1) 10BBlack: VCL: Improve cache status stats calcuations [puppet] - 10https://gerrit.wikimedia.org/r/377255 [13:46:48] <_joe_> s/on the cluster/with the way scap works/ [13:47:11] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#3591354 (10MoritzMuehlenhoff) >>! In T175361#3596642, @herron wrote: > Is the established upgrade process a rebuild or dist-upgrade? A dist-upgrade is generally fine, but a reimage has some benefits: There's e.g... [13:48:19] (03PS3) 10Muehlenhoff: Remove old debdeploy/salt packages via puppet [puppet] - 10https://gerrit.wikimedia.org/r/377236 (https://phabricator.wikimedia.org/T164817) [13:48:50] (03PS3) 10MarcoAurelio: Lift account creation restrictions for WM Taiwan 10th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377227 (https://phabricator.wikimedia.org/T175534) [13:48:55] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 3 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [13:50:06] ^ [13:50:26] (03CR) 10Muehlenhoff: [C: 032] Remove old debdeploy/salt packages via puppet [puppet] - 10https://gerrit.wikimedia.org/r/377236 (https://phabricator.wikimedia.org/T164817) (owner: 10Muehlenhoff) [13:50:36] (03CR) 10Andrew Bogott: [C: 031] openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [13:50:55] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 28s) [13:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:20] well [13:51:31] nothing can be deployed [13:51:40] (03PS7) 10Rush: openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) [13:52:29] (03CR) 10Rush: [C: 032] openstack: re-enable notify and subscribe for nova [puppet] - 10https://gerrit.wikimedia.org/r/376708 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [13:52:41] 10Operations, 10Mail: Upgrade mx1001/mx2001 to stretch - https://phabricator.wikimedia.org/T175361#3596777 (10herron) Ok, reimage sounds good to me. It would also be a good opportunity for some hands on with server builds. [13:52:45] (03PS4) 10MarcoAurelio: Lift account creation restrictions for WM Taiwan 10th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377227 (https://phabricator.wikimedia.org/T175534) [13:53:17] (03PS5) 10MarcoAurelio: Lift account creation restrictions for WM Taiwan 10th anniversary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377227 (https://phabricator.wikimedia.org/T175534) [13:53:52] hashar: so isn't scap working? [13:54:46] (03CR) 10Muehlenhoff: [C: 032] Remove salt-based debdeploy code [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/377239 (owner: 10Muehlenhoff) [13:54:46] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 27s) [13:54:56] (03PS5) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [13:55:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:28] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [13:55:34] !log hashar@tin Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 27s) [13:55:40] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3596789 (10elukey) a:05RobH>03Cmjohnson Assigning back to Chris as discussed on IRC: we'd need to move the Kafka Jumbo hosts out... [13:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:31] I am cancelling the whole SWAT and will revert the patches that got merged without deploy [13:56:41] the whole stack is screwed up (CI / deploy script) [13:57:00] meh, another hour of my life lost :) [13:57:31] stephanebisson: sorry your patch will not make it in this window :(( [13:57:48] hashar: will we have an incident report? [13:57:54] (03CR) 10Elukey: [C: 031] Remove salt grains from app server canaries [puppet] - 10https://gerrit.wikimedia.org/r/377222 (owner: 10Muehlenhoff) [13:58:18] hashar: yeah, I hope things get back to normal soon [13:58:22] (03PS2) 10BBlack: VCL: Improve cache status stats calcuations [puppet] - 10https://gerrit.wikimedia.org/r/377255 [13:58:24] (03PS3) 10BBlack: [WIP] stabilize backend storage patterns [puppet] - 10https://gerrit.wikimedia.org/r/376751 [13:59:01] (03PS1) 10Hashar: Revert "Change logo for huwiktonary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377257 [13:59:08] (03PS1) 10Hashar: Revert "Add Extension:Newsletter permissions to CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377258 [13:59:14] (03PS1) 10Hashar: Revert "Enable WikidataPageBanner for Russian Wikimedia chapter wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377259 [13:59:20] (03PS1) 10Hashar: Revert "Reduce wikiPageUpdaterDbBatchSize to 20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377260 [14:00:32] (03PS1) 10Gehel: update BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377263 [14:01:16] (03PS2) 10Rush: rabbitmq: Add drain_queue utility script [puppet] - 10https://gerrit.wikimedia.org/r/377039 (https://phabricator.wikimedia.org/T170492) (owner: 10BryanDavis) [14:01:24] (03PS1) 10Filippo Giunchedi: thumbor: use memorysize_mb fact for unit MemoryLimit [puppet] - 10https://gerrit.wikimedia.org/r/377264 (https://phabricator.wikimedia.org/T121388) [14:01:25] hashar: thanks anyway [14:02:04] hashar: hey, why revert? [14:02:16] is there any problem? [14:02:33] (03PS2) 10Hashar: Revert "Change logo for huwiktonary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377257 [14:02:35] (03PS2) 10Hashar: Revert "Add Extension:Newsletter permissions to CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377258 [14:02:37] (03PS2) 10Hashar: Revert "Enable WikidataPageBanner for Russian Wikimedia chapter wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377259 [14:02:39] (03PS2) 10Hashar: Revert "Reduce wikiPageUpdaterDbBatchSize to 20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377260 [14:03:43] (03CR) 10Hashar: [C: 032] "Revert since it has not been deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377257 (owner: 10Hashar) [14:03:45] (03PS1) 10Muehlenhoff: Remove package declarations for debdeploy-minion/debdeploy-common [puppet] - 10https://gerrit.wikimedia.org/r/377265 [14:03:47] (03CR) 10Hashar: [C: 032] "Revert since it has not been deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377258 (owner: 10Hashar) [14:03:49] (03CR) 10Hashar: [C: 032] "Revert since it has not been deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377259 (owner: 10Hashar) [14:03:52] (03CR) 10Hashar: [C: 032] "Revert since it has not been deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377260 (owner: 10Hashar) [14:04:04] Amir1: scap isn't working apparently [14:04:14] (03PS2) 10Filippo Giunchedi: thumbor: use memorysize_mb fact for unit MemoryLimit [puppet] - 10https://gerrit.wikimedia.org/r/377264 (https://phabricator.wikimedia.org/T121388) [14:04:21] I thought there is something wrong with the jobqueue [14:04:22] * hashar orders a few tons of tomatoes seeds (poke addshore) [14:04:30] =o [14:04:52] well, thia all looks interesting [14:04:54] *this [14:05:08] 10Operations, 10monitoring, 10Patch-For-Review: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903#3596818 (10herron) If the power supply ipmi sensor approach seems worthwhile to folks could we arrange a time to test pulling cables and power suppl... [14:05:31] stephanebisson: Amir1: Zppix: tabbycat: basically I have reverted all the merged patches for this SWAT and end up deploying none. We are unable to deploy patches on the cluster right now, at least not for SWAT. [14:05:42] !log European SWAT cancelled due to deploy infrastructure issue [14:05:50] (03PS3) 10Rush: rabbitmq: Add drain_queue utility script [puppet] - 10https://gerrit.wikimedia.org/r/377039 (https://phabricator.wikimedia.org/T170492) (owner: 10BryanDavis) [14:05:52] (03PS2) 10Rush: rabbitmq: remove orphan files [puppet] - 10https://gerrit.wikimedia.org/r/377040 (owner: 10BryanDavis) [14:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:15] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: Add drain_queue utility script [puppet] - 10https://gerrit.wikimedia.org/r/377039 (https://phabricator.wikimedia.org/T170492) (owner: 10BryanDavis) [14:06:59] (03CR) 10jerkins-bot: [V: 04-1] Remove package declarations for debdeploy-minion/debdeploy-common [puppet] - 10https://gerrit.wikimedia.org/r/377265 (owner: 10Muehlenhoff) [14:07:34] (03CR) 10jerkins-bot: [V: 04-1] thumbor: use memorysize_mb fact for unit MemoryLimit [puppet] - 10https://gerrit.wikimedia.org/r/377264 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [14:07:34] this deploy > https://twitter.com/sadserver/status/657969482783158272 :( [14:07:45] (03CR) 10DCausse: [V: 032 C: 032] update BUILD_VERSION [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377263 (owner: 10Gehel) [14:07:50] (03PS4) 10Rush: rabbitmq: Add drain_queue utility script [puppet] - 10https://gerrit.wikimedia.org/r/377039 (https://phabricator.wikimedia.org/T170492) (owner: 10BryanDavis) [14:07:52] (03PS3) 10Rush: rabbitmq: remove orphan files [puppet] - 10https://gerrit.wikimedia.org/r/377040 (owner: 10BryanDavis) [14:08:00] (03PS2) 10Muehlenhoff: Remove package declarations for debdeploy-minion/debdeploy-common [puppet] - 10https://gerrit.wikimedia.org/r/377265 [14:08:45] tin.eqiad.wmnet /srv/mediawiki-staging is a dirty state right now. Gotta fetch a few patches that are pending [14:08:46] (03PS3) 10Filippo Giunchedi: thumbor: use memorysize_mb fact for unit MemoryLimit [puppet] - 10https://gerrit.wikimedia.org/r/377264 (https://phabricator.wikimedia.org/T121388) [14:08:49] (03CR) 10Rush: [C: 032] rabbitmq: Add drain_queue utility script [puppet] - 10https://gerrit.wikimedia.org/r/377039 (https://phabricator.wikimedia.org/T170492) (owner: 10BryanDavis) [14:08:51] (03CR) 10jerkins-bot: [V: 04-1] Remove package declarations for debdeploy-minion/debdeploy-common [puppet] - 10https://gerrit.wikimedia.org/r/377265 (owner: 10Muehlenhoff) [14:11:04] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/7786/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/377264 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [14:11:10] (03PS4) 10Filippo Giunchedi: thumbor: use memorysize_mb fact for unit MemoryLimit [puppet] - 10https://gerrit.wikimedia.org/r/377264 (https://phabricator.wikimedia.org/T121388) [14:11:13] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: remove orphan files [puppet] - 10https://gerrit.wikimedia.org/r/377040 (owner: 10BryanDavis) [14:12:47] (03PS3) 10Muehlenhoff: Remove package declarations for debdeploy-minion/debdeploy-common [puppet] - 10https://gerrit.wikimedia.org/r/377265 [14:12:52] (03CR) 10Rush: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/377040 (owner: 10BryanDavis) [14:14:24] (03CR) 10jerkins-bot: [V: 04-1] thumbor: use memorysize_mb fact for unit MemoryLimit [puppet] - 10https://gerrit.wikimedia.org/r/377264 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [14:14:57] (03CR) 10Filippo Giunchedi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/377264 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [14:16:05] (03CR) 10jerkins-bot: [V: 04-1] rabbitmq: remove orphan files [puppet] - 10https://gerrit.wikimedia.org/r/377040 (owner: 10BryanDavis) [14:16:15] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: use memorysize_mb fact for unit MemoryLimit [puppet] - 10https://gerrit.wikimedia.org/r/377264 (https://phabricator.wikimedia.org/T121388) (owner: 10Filippo Giunchedi) [14:16:51] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3596865 (10jcrespo) Hey, Chris, did you see this^. This is not an emergency, but the service affected (External Store) is relatively important (a... [14:17:16] godog: lmk when you have a few minutes? I have more-or-less the same dumb questions that I had last week :/ [14:17:25] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3596869 (10MoritzMuehlenhoff) [14:17:27] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Migrate debdeploy to cumin - https://phabricator.wikimedia.org/T164817#3596867 (10MoritzMuehlenhoff) 05Open>03Resolved debdeploy is now based on cumin. The old version based on salt has been removed. There's some additional feature work... [14:18:01] (03Merged) 10jenkins-bot: Revert "Change logo for huwiktonary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377257 (owner: 10Hashar) [14:18:03] (03Merged) 10jenkins-bot: Revert "Add Extension:Newsletter permissions to CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377258 (owner: 10Hashar) [14:18:11] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10MoritzMuehlenhoff) [14:18:21] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3596872 (10Cmjohnson) @jcrespo sorry I missed that Friday...I will take a look [14:18:46] (03PS1) 10Alexandros Kosiaris: Revert "Revert "sshd_config: Increase MaxAuthTries"" [puppet] - 10https://gerrit.wikimedia.org/r/377268 [14:19:38] (03PS2) 10Alexandros Kosiaris: Revert "Revert "sshd_config: Increase MaxAuthTries"" [puppet] - 10https://gerrit.wikimedia.org/r/377268 (https://phabricator.wikimedia.org/T172333) [14:19:48] (03PS3) 10Alexandros Kosiaris: Revert "Revert "sshd_config: Increase MaxAuthTries"" [puppet] - 10https://gerrit.wikimedia.org/r/377268 (https://phabricator.wikimedia.org/T172333) [14:20:19] I am waiting for two mediawiki-config patches to merge and I will rebase tin.eqiad.wmnet:/srv/mediawiki-staging to be in a clean state [14:20:23] (03PS4) 10Alexandros Kosiaris: Revert "Revert "sshd_config: Increase MaxAuthTries"" [puppet] - 10https://gerrit.wikimedia.org/r/377268 (https://phabricator.wikimedia.org/T172333) [14:20:27] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Revert "sshd_config: Increase MaxAuthTries"" [puppet] - 10https://gerrit.wikimedia.org/r/377268 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [14:20:29] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Revert "Revert "sshd_config: Increase MaxAuthTries"" [puppet] - 10https://gerrit.wikimedia.org/r/377268 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [14:20:42] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Revert "sshd_config: Increase MaxAuthTries""" [puppet] - 10https://gerrit.wikimedia.org/r/377269 [14:21:23] (03PS2) 10Alexandros Kosiaris: Revert "Revert "Revert "sshd_config: Increase MaxAuthTries""" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) [14:22:26] (03Merged) 10jenkins-bot: Revert "Enable WikidataPageBanner for Russian Wikimedia chapter wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377259 (owner: 10Hashar) [14:22:28] (03Merged) 10jenkins-bot: Revert "Reduce wikiPageUpdaterDbBatchSize to 20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377260 (owner: 10Hashar) [14:23:19] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [14:23:42] !log tin.eqiad.wmnet: reverted all four mediawiki-config that got merged but left undeployed and rebased the workspace [14:23:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:39] !log tin.eqiad.wmnet: reverted patch "WLFilters: Respect default values" that got merged but left undeployed [14:24:51] so at least the deployment server is in a clean state [14:24:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:36] (03PS3) 10BBlack: VCL: Improve cache status stats calcuations [puppet] - 10https://gerrit.wikimedia.org/r/377255 [14:25:38] (03PS4) 10BBlack: [WIP] stabilize backend storage patterns [puppet] - 10https://gerrit.wikimedia.org/r/376751 [14:28:55] 10Operations, 10ops-eqdfw, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3596894 (10Cmjohnson) Swapped the backplane with another from a similar server w/same specs and the error came back. Swapped the cables... [14:29:01] (03CR) 10jerkins-bot: [V: 04-1] [WIP] stabilize backend storage patterns [puppet] - 10https://gerrit.wikimedia.org/r/376751 (owner: 10BBlack) [14:29:11] (03CR) 10Ema: [C: 031] VCL: Improve cache status stats calcuations [puppet] - 10https://gerrit.wikimedia.org/r/377255 (owner: 10BBlack) [14:29:43] (03PS1) 10MarcoAurelio: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377274 [14:30:11] hashar: I'm restoring the patches so we can try to merge them in a next window [14:30:17] is that okay? [14:31:07] (03CR) 10BBlack: [C: 032] VCL: Improve cache status stats calcuations [puppet] - 10https://gerrit.wikimedia.org/r/377255 (owner: 10BBlack) [14:31:29] (03PS2) 10MarcoAurelio: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377274 [14:31:34] (03PS3) 10MarcoAurelio: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377274 [14:32:38] (03PS3) 10DCausse: Upgrade plugins to elastic 5.5.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376477 (https://phabricator.wikimedia.org/T175159) [14:34:38] (03PS1) 10MarcoAurelio: Revert "Revert "Enable WikidataPageBanner for Russian Wikimedia chapter wiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377275 [14:35:36] (03PS2) 10MarcoAurelio: Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377275 (https://phabricator.wikimedia.org/T175356) [14:35:51] (03PS3) 10MarcoAurelio: Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377275 (https://phabricator.wikimedia.org/T175356) [14:38:22] (03CR) 10MarcoAurelio: [C: 04-1] "Concerns confirmed. We need to upload a new version." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374370 (owner: 10MarcoAurelio) [14:38:58] (03PS3) 10MarcoAurelio: Update es.wiktionary logo from SVG version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374370 [14:40:06] !log roll-restart thumbor to apply https://gerrit.wikimedia.org/r/#/c/377264/ and upgrade to 1.4 - T173580 T174997 [14:40:14] andrewbogott: sure, shoot! [14:40:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:19] T174997: thumbor1003 behaves differently than other thumbor hosts - https://phabricator.wikimedia.org/T174997 [14:40:19] T173580: $wgMaxAnimatedGifArea is not honored by Thumbor - https://phabricator.wikimedia.org/T173580 [14:41:46] godog: so mostly this comes down to me being baffled by the query docs. If you visit http://andrewclient.puppet.wmflabs.org/labs/graph and select probe_http_status_code you'll see 5 different states... [14:42:08] First, I need a query that just picks one of them, based on url [14:42:57] (and in theory I to apply a label to each of those states, but I'm guessing that's grafana's job) [14:43:21] (03PS1) 10Muehlenhoff: Also add Section for the source package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377277 [14:46:20] (03CR) 10DCausse: [V: 032 C: 032] Also add Section for the source package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377277 (owner: 10Muehlenhoff) [14:46:35] andrewbogott: yup, so to select a particular label you can use a selector, essentially what's written here https://prometheus.io/docs/querying/basics/#time-series-selectors [14:46:52] in grafana you can also turn that into a dropdown basically [14:46:57] (03PS4) 10DCausse: Upgrade plugins to elastic 5.5.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376477 (https://phabricator.wikimedia.org/T175159) [14:47:45] 10Operations, 10ops-codfw: mw2256 - hardware issue - https://phabricator.wikimedia.org/T163346#3596984 (10Papaul) a:05elukey>03Papaul [14:47:58] godog: ok, so probe_http_status_code{instance="labcontrol1001.wikimedia.org:5000/v3"} [14:48:01] that seems to work! [14:48:52] next… for uptime % you suggested that I do number-of-successes/number-of-tests, which makes sense. And I assume number-of-tests is http_requests_total [14:49:32] But I'm not clear on how to get number-of-successes… and also, I'm concerned that that ratio tells me uptime-since-whenever whereas I'd much rather have uptime-last-30-days or similar. Think that's possible? [14:49:53] (at some point I will have a breakthrough and this will start to feel like programming :) ) [14:51:10] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3597000 (10Papaul) @jcrespo on db2010 I have 5 bad disks is there any particular order you will want them replaced? [14:51:49] fyi T175576 [14:51:49] T175576: Investigate today's EU SWAT scap & infrastructure issues - https://phabricator.wikimedia.org/T175576 [14:52:16] andrewbogott: more like how many times probe_success was 1 over how many times probe_success was either 0 or 1 [14:52:29] andrewbogott: also really there should be a task for this I think, it'll come handy in the future [14:52:49] (03CR) 10Rush: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/377040 (owner: 10BryanDavis) [14:52:53] https://phabricator.wikimedia.org/T167556 [14:53:42] nice, I'll subscribe [14:54:18] godog: so probe_http_status_code{instance="labcontrol1001.wikimedia.org:5000/v3"}[1w] gets me all the probes in a week, I assume there's some way to get a count of them? [14:54:48] (03CR) 10Ottomata: [C: 031] "Sounds fine to me!" [puppet] - 10https://gerrit.wikimedia.org/r/377238 (owner: 10Mobrovac) [14:54:57] 10Operations, 10ops-codfw: Degraded RAID on ms-be2023 - https://phabricator.wikimedia.org/T174777#3597024 (10Papaul) 05Open>03Resolved This package shows delivered so I am resolving this task [14:55:58] 10Operations, 10ops-codfw, 10media-storage: Degraded RAID on ms-be2024 - https://phabricator.wikimedia.org/T174534#3597029 (10Papaul) 05Open>03Resolved This servers looks good so far so resolving the task for now. [14:56:08] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3597034 (10jcrespo) 5? wow. I would say 1 at a time, and we check they rebuild correctly. Do not necessarily wait, we can do a couple per day when you are around (normally it takes a few hours to rebuild eac... [14:56:26] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3597035 (10Papaul) p:05Triage>03High [14:56:47] which isn't count() :( [14:56:50] andrewbogott: yeah, that'd be the count_over_time function, also I think using probe_success would be better as the status code is harder to define whether it succeeded or not [14:57:00] yeah the full list is https://prometheus.io/docs/querying/functions/ [14:57:27] godog: probe_success won't work for me, because it regards 300 as failure but for some of my tests 300 is a good outcome for my purposes [14:57:43] So I'm going to have to actually check the return code, which seems not so bad so far [14:58:05] (to get a 200 for those services I have to pass in auth creds which I'd rather avoid) [14:58:15] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3587259 (10Papaul) Ok i have I have disk in slot 5 replaced [14:58:40] (03CR) 10Mobrovac: "> Sounds fine to me!" [puppet] - 10https://gerrit.wikimedia.org/r/377238 (owner: 10Mobrovac) [14:58:51] andrewbogott: heh the blackbox_exporter has config options to regard what status codes are valid [14:59:00] for a given http test that is [14:59:04] oh, well, that would be nice [14:59:06] I'll look at that [14:59:21] (03PS4) 10Rush: rabbitmq: remove orphan files [puppet] - 10https://gerrit.wikimedia.org/r/377040 (owner: 10BryanDavis) [14:59:45] so now I can count # of results, next question is how to count # of results with given value [14:59:53] (03PS6) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [15:00:22] which is apparently not probe_http_status_code{instance="labcontrol1001.wikimedia.org:5000/v3",value=xxx} [15:01:04] (03PS1) 10Muehlenhoff: Also add Priority to source package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377280 [15:01:07] (03CR) 10Rush: [C: 032] rabbitmq: remove orphan files [puppet] - 10https://gerrit.wikimedia.org/r/377040 (owner: 10BryanDavis) [15:01:41] (03CR) 10Gehel: [C: 032] Also add Priority to source package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377280 (owner: 10Muehlenhoff) [15:01:41] probably sth like probe_http{...} == 200 [15:01:44] (03CR) 10Gehel: [V: 032 C: 032] Also add Priority to source package [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377280 (owner: 10Muehlenhoff) [15:03:05] that works but now I can't make the [30d] part work [15:03:08] (03CR) 10Daniel Kinzler: "Also consider Ic095e2eba" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376562 (https://phabricator.wikimedia.org/T173710) (owner: 10Ladsgroup) [15:03:22] 10Operations, 10Thumbor, 10Patch-For-Review, 10Performance-Team (Radar), 10User-fgiunchedi: thumbor1003 behaves differently than other thumbor hosts - https://phabricator.wikimedia.org/T174997#3597083 (10fgiunchedi) 05Open>03Resolved Indeed, the latency now is the same across all hosts and I've deplo... [15:06:19] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3586259 (10mobrovac) IMHO, `updateBetaFeaturesUserCounts` is the perfect candidate here. It's very lightweight (one `SELECT`, one `U... [15:06:52] (03PS1) 10Bartosz Dziewoński: Fix case of MediaWiki\Auth\AbstractPreAuthenticationProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377281 [15:07:35] andrewbogott: you'll have to paste the query and the error somewhere, I don't know how to debug otherwise [15:09:34] godog: the console is public, so you can see for yourself of course :) [15:09:38] probe_success{instance="labcontrol1001.wikimedia.org:5000/v3"} == 1 works [15:10:12] but where do I put the [30d] now? probe_success{instance="labcontrol1001.wikimedia.org:5000/v3"}[30d] == 1 gets me "binary expression must contain only scalar and instant vector types" [15:14:19] 10Operations, 10ops-ulsfo, 10Traffic: cp4021 memory hardware issue - DIMM B1 - https://phabricator.wikimedia.org/T175585#3597148 (10BBlack) [15:14:37] 10Operations, 10ops-ulsfo, 10Traffic: cp4021 memory hardware issue - DIMM B1 - https://phabricator.wikimedia.org/T175585#3597162 (10BBlack) [15:16:38] andrewbogott: so it is the count_over_time() that wants [30d], since probe_success is 0 or 1 you can do sum_over_time(...) / count_over_time(...) [15:17:10] oh, clever :) [15:18:28] ok, that should get me what I need once I figure out how to get those 300s to look like success. thanks! [15:19:07] Next step is to get this stuff into grafana, which I will bug you about in a bit if it doesn't just work. [15:20:03] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4028.* [15:20:06] np, yeah I suspect you'll need an additional "module" in blackbox_exporter configuration that considers 300s too as success [15:20:10] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4022.* [15:20:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:04] godog: that's something that will land in /etc/prometheus/blackbox.yml? [15:28:21] andrewbogott: yup [15:32:07] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3597231 (10Cmjohnson) @jcrespo es1019 is back up. It seemed to be stuck in some weird state, The fans were blowing but nothing else worked. Pull... [15:34:34] (03PS4) 10Andrew Bogott: labmon: prometheus classes to monitor the keystone api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/375452 [15:34:36] (03PS1) 10Andrew Bogott: prometheus: Add additional blackbox module http_tolerant_connect [puppet] - 10https://gerrit.wikimedia.org/r/377288 (https://phabricator.wikimedia.org/T167556) [15:34:55] godog: ^ [15:35:15] (03PS2) 10Andrew Bogott: prometheus: Add additional blackbox module http_tolerant_connect [puppet] - 10https://gerrit.wikimedia.org/r/377288 (https://phabricator.wikimedia.org/T167556) [15:36:14] RECOVERY - MegaRAID on db2010 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [15:36:40] 10Operations, 10Traffic: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3597248 (10Esc3300) [15:37:36] 10Operations, 10Traffic: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#3597250 (10Esc3300) [15:38:11] 10Operations, 10Traffic: Multiple 503 Errors - https://phabricator.wikimedia.org/T175473#3597252 (10BBlack) [15:38:16] 10Operations, 10Traffic: Recurrent 'mailbox lag' critical alerts and 500s - https://phabricator.wikimedia.org/T174932#3597253 (10BBlack) [15:38:53] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3597259 (10jcrespo) @Cmjohnson As I said, this is not an emergency- I will make the server catch up but after that I would like to do a more deta... [15:39:30] (03PS5) 10DCausse: Upgrade plugins to elastic 5.5.2 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/376477 (https://phabricator.wikimedia.org/T175159) [15:42:03] (03CR) 10Filippo Giunchedi: "Naming bikeshed, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377288 (https://phabricator.wikimedia.org/T167556) (owner: 10Andrew Bogott) [15:43:51] (03CR) 10Andrew Bogott: "I was hoping you'd have a better name :)" [puppet] - 10https://gerrit.wikimedia.org/r/377288 (https://phabricator.wikimedia.org/T167556) (owner: 10Andrew Bogott) [15:46:04] (03PS3) 10Andrew Bogott: prometheus: Add additional blackbox module http_tolerant_connect [puppet] - 10https://gerrit.wikimedia.org/r/377288 (https://phabricator.wikimedia.org/T167556) [15:46:06] (03PS5) 10Andrew Bogott: labmon: prometheus classes to monitor the keystone api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/375452 [15:48:38] (03PS4) 10Andrew Bogott: prometheus: Add additional blackbox module http_200_300_connect [puppet] - 10https://gerrit.wikimedia.org/r/377288 (https://phabricator.wikimedia.org/T167556) [15:48:40] (03PS6) 10Andrew Bogott: labmon: prometheus classes to monitor the keystone api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/375452 [15:49:32] (03CR) 10Filippo Giunchedi: [C: 031] prometheus: Add additional blackbox module http_200_300_connect [puppet] - 10https://gerrit.wikimedia.org/r/377288 (https://phabricator.wikimedia.org/T167556) (owner: 10Andrew Bogott) [15:49:54] (03CR) 10Andrew Bogott: [C: 032] prometheus: Add additional blackbox module http_200_300_connect [puppet] - 10https://gerrit.wikimedia.org/r/377288 (https://phabricator.wikimedia.org/T167556) (owner: 10Andrew Bogott) [15:51:40] (03CR) 10Andrew Bogott: [C: 032] labmon: prometheus classes to monitor the keystone api endpoint [puppet] - 10https://gerrit.wikimedia.org/r/375452 (owner: 10Andrew Bogott) [15:53:03] RECOVERY - IPMI Temperature on es1019 is OK: Sensor Type(s) Temperature Status: OK [15:53:11] (03PS1) 10Rush: designate private to match for changeset 376848 [labs/private] - 10https://gerrit.wikimedia.org/r/377289 [15:54:44] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:54:49] (03CR) 10Rush: [V: 032 C: 032] designate private to match for changeset 376848 [labs/private] - 10https://gerrit.wikimedia.org/r/377289 (owner: 10Rush) [15:55:13] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 35 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[prometheus],Package[nginx-full] [15:55:33] PROBLEM - Check systemd state on labmon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:57:40] godog: do I need prometheus::web installed for prometheus to feed to grafana? It turns out there's already something using port 80 on the box I'm trying to add this to [15:58:05] andrewbogott: is labmon1001 issue you?^ [15:58:11] chasemp: yes [15:58:13] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:58:19] kk, no worries then just checking [15:58:35] (03PS7) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [15:59:33] andrewbogott: that's the nginx reverse proxy yeah, it'd be simpler if grafana can talk to prometheus via a reverse proxy [16:00:22] (03PS1) 10Muehlenhoff: Remove salt grains for LVS [puppet] - 10https://gerrit.wikimedia.org/r/377291 [16:00:24] (03PS1) 10Muehlenhoff: Remove Puppet code which is obsolete with Salt grain removal [puppet] - 10https://gerrit.wikimedia.org/r/377292 [16:01:15] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Separate off ChangePropagation for JobQueue as a new deployment - https://phabricator.wikimedia.org/T175281#3588533 (10GWicke) Yay! 🎆 [16:01:24] godog: so I guess we don't have grafana and prometheus running together on any other boxes? [16:01:59] andrewbogott: in a meeting [16:02:04] 'k [16:07:14] 10Operations, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3597396 (10BBlack) So far, other nodes are testing ok on this front. This is likely a node-specific early hardware failure. [16:07:30] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3597398 (10BBlack) [16:07:33] ACKNOWLEDGEMENT - Check systemd state on labmon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott This is a grafana v. prometheus issue, Im looking at it. [16:07:33] ACKNOWLEDGEMENT - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages andrew bogott This is a grafana v. prometheus issue, Im looking at it. [16:07:33] ACKNOWLEDGEMENT - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[nginx-full] andrew bogott This is a grafana v. prometheus issue, Im looking at it. [16:08:25] (03PS1) 10Andrew Bogott: Revert "labmon: prometheus classes to monitor the keystone api endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/377295 [16:09:46] (03CR) 10Andrew Bogott: [C: 032] Revert "labmon: prometheus classes to monitor the keystone api endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/377295 (owner: 10Andrew Bogott) [16:10:49] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: rack/setup/install wdqs100[45].eqiad.wmnet - https://phabricator.wikimedia.org/T171210#3597399 (10Gehel) initial data import is done, wdqs100[45] can now be pooled. [16:11:33] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:12:49] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3597404 (10Gehel) [16:13:25] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint: decommission wdqs100[12] - https://phabricator.wikimedia.org/T175595#3597417 (10Gehel) [16:18:22] (03PS1) 10RobH: Revert "revoke James F's ssh key" [puppet] - 10https://gerrit.wikimedia.org/r/377297 (https://phabricator.wikimedia.org/T175505) [16:18:26] (03PS2) 10RobH: Revert "revoke James F's ssh key" [puppet] - 10https://gerrit.wikimedia.org/r/377297 (https://phabricator.wikimedia.org/T175505) [16:18:47] (03CR) 10jerkins-bot: [V: 04-1] Revert "revoke James F's ssh key" [puppet] - 10https://gerrit.wikimedia.org/r/377297 (https://phabricator.wikimedia.org/T175505) (owner: 10RobH) [16:19:35] https://integration.wikimedia.org/ci/job/operations-puppet-tests-docker/4903/console [16:19:39] that doesnt seem like a normal error [16:19:48] 16:18:44 mv: cannot stat ‘/tmp/cache/puppet/.tox/log/*’: No such file or directory [16:20:01] this seems like the errors we got a couple of weeks ago that were an issue with the CI testing? [16:20:48] robh it is a commit message error :) [16:20:49] 16:18:44 The following errors were found: [16:20:49] 16:18:44 Line 3: Line should be <=100 characters [16:20:53] nm its my commit [16:20:59] sorry, i updated in other chan not here [16:21:33] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:21:41] AuthorDate: 2017-09-11 16:18:12 +0000 [16:21:51] paladox: line 3 is less, but my bug and commit ids should swap i think [16:22:18] robh viewing the diff seems to show stuff the editor does not show. :) [16:22:28] if you click edit, it will show you the correctly line [16:22:47] https://gerrit.wikimedia.org/r/#/c/377297/2//COMMIT_MSG [16:22:48] ? [16:22:58] click the edit icon [16:23:04] next to (diffusion) [16:23:14] line 3 is line 9 in the diff [16:24:02] bah, these commit message rules suck cuz the revert window doesnt wrap the lines [16:24:17] intorducing a commit message error within gerrit's editing of said message. [16:24:42] (03PS3) 10RobH: Revert "revoke James F's ssh key" [puppet] - 10https://gerrit.wikimedia.org/r/377297 (https://phabricator.wikimedia.org/T175505) [16:25:04] rephrase: i clicked revert and entered a message, but gerrit doesn't wrap the commit message lines like my editor [16:25:14] so it introduced its own line wrap error in the commit message check. [16:25:29] (03CR) 10RobH: [C: 032] Revert "revoke James F's ssh key" [puppet] - 10https://gerrit.wikimedia.org/r/377297 (https://phabricator.wikimedia.org/T175505) (owner: 10RobH) [16:25:53] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Please restore prod and gerrit access for J. Forrester - https://phabricator.wikimedia.org/T175505#3595204 (10RobH) https://gerrit.wikimedia.org/r/#/c/377297/ is pushed back live restoring shell access. [16:30:10] (03PS8) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [16:36:15] robh if you file a bug here https://phabricator.wikimedia.org/project/view/330/ and we can upstream it. [16:36:21] Most likly this will affect polygerrit only as we can ask them to develop plugin support to allow us to line wrap or use a config. [16:36:22] GWTUI is deprecated and is being removed very soon so if i filled a report against it, they will likly say no point as it is being removed very soon. [16:38:16] (03PS9) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [16:39:11] 10Operations, 10ORES, 10Graphite, 10Scoring-platform-team (Current), 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3415026 (10awight) An example of high-level metrics we want to keep forever: request rate over time. [16:43:01] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Please restore prod and gerrit access for J. Forrester - https://phabricator.wikimedia.org/T175505#3597617 (10RobH) I'm not sure how your gerrit access was revoked, since your jforrester username still as the WMF ldap group. @Bawolff was the one who d... [16:45:08] (03PS1) 10Rush: openstack values from designate refactor [labs/private] - 10https://gerrit.wikimedia.org/r/377303 [16:46:21] (03PS2) 10Rush: openstack values from designate refactor [labs/private] - 10https://gerrit.wikimedia.org/r/377303 [16:46:52] (03CR) 10Rush: [V: 032 C: 032] openstack values from designate refactor [labs/private] - 10https://gerrit.wikimedia.org/r/377303 (owner: 10Rush) [16:53:53] (03CR) 10BBlack: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/376751 (owner: 10BBlack) [16:57:03] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational [16:58:02] (03PS2) 10Dzahn: admins: add rho to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/376746 (https://phabricator.wikimedia.org/T175204) [16:58:12] (03CR) 10Dzahn: [C: 032] admins: add rho to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/376746 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn) [17:00:04] gehel: Respected human, time to deploy Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170911T1700). Please do the needful. [17:00:04] Smalyshev: A patch you scheduled for Wikidata Query Service weekly deploy is about to be deployed. Please be available during the process. [17:00:32] jouncebot: o/ [17:01:04] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:02:26] hashar: any idea on eta when deployments will work again? [17:03:05] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational [17:06:22] !log gehel@tin Started deploy [wdqs/wdqs@177d20a]: (no justification provided) [17:06:24] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2034530 [17:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:57] ^damn, forgot the justification again on that wdqs scap deploy... [17:07:19] !log wdqs weekly deployment (logging and throttling fixes) [17:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:45] (03PS1) 10Chad: Scap3: Go ahead and `scap deploy --init` a freshly provisioned repo [puppet] - 10https://gerrit.wikimedia.org/r/377304 [17:08:12] andrewbogott: to answer your question, no we don't have prometheus and grafana running on the same machine now, also prometheus::web uses nginx but it should be using apache really because of T151009 [17:08:12] T151009: Provide authenticated access to Prometheus native web interface - https://phabricator.wikimedia.org/T151009 [17:08:48] godog: oh! Well, I already have apache running on labmon, so if prometheus used apache as well then that would solve my problem [17:09:01] !log gehel@tin Finished deploy [wdqs/wdqs@177d20a]: (no justification provided) (duration: 02m 39s) [17:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:36] andrewbogott: indeed, essentially an apache virtualhost to do the same reverse-proxy that nginx does now will do the trick [17:09:44] SMalyshev: deployment completed, tests are green... [17:10:00] godog: should I ignore or pay attention to the ldap/pam comments on that bug? [17:10:28] SMalyshev: side node, we need to add the new wdqs100[45] to scap... [17:10:31] I guess that's unrelated to what I need to do just now anyway [17:11:10] gehel: cool, thanks! [17:11:16] gehel: how do you add that? [17:11:29] SMalyshev: doing it, I'll send you the CR [17:11:42] andrewbogott: yeah only related to nginx, would be a non-issue with apache [17:12:33] SMalyshev: https://gerrit.wikimedia.org/r/#/c/377305/ [17:13:25] gehel: yup, merged [17:13:37] SMalyshev: thanks! [17:14:26] andrewbogott: I have to go now, though if you want to try an come up with a patch to prometheus::web that can optionally use apache I'm happy to review [17:14:37] great. Thanks for your help today! [17:14:43] gehel: pooling for 1005 doesn't seem to work [17:14:48] PROBLEM - HHVM rendering on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:14:53] Pooling wdqs1005.eqiad.wmnet from all services... [17:14:53] WARNING:etcd.client:etcd response did not contain a cluster ID [17:14:53] ERROR:conftool:Error when trying to set/pooled=yes on wdqs1005.eqiad.wmnet [17:14:53] ERROR:conftool:Failure writing to the kvstore: Backend error: The request requires user authentication : Insufficient credentials [17:15:12] np, I think all of this will also need to end up on wikitech at some point anyways [17:15:38] RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 78368 bytes in 0.134 second response time [17:16:44] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Please restore prod and gerrit access for J. Forrester - https://phabricator.wikimedia.org/T175505#3595204 (10Paladox) @RobH he was disabled in gerrit by setting his account to in active. Though doing curl -X GET https://gerrit.wikimedia.org/r/account... [17:17:07] SMalyshev: checking... [17:17:27] gehel: Insufficient credentials usually means that sudo -i was not used ;) [17:17:47] SMalyshev: ^ (volans always has the answer! [17:18:06] I used sudo, sudo -i doesn't work for me [17:18:10] robh i think we can re use line wrapping under the editor for polygerrit commit editor :). [17:18:35] SMalyshev: define "does not work" ? [17:18:41] so when i file a task in phab for fixing other team stuff, and shit happens [17:18:46] i think back to before we had phab [17:18:47] gehel: asks for password, which I do not have [17:18:48] and im happy. [17:19:11] SMalyshev: ok, so I'm probably missing something in the sudoers rule, checking [17:19:30] the credentials are in root's home, that's why sudo -i is usually needed for conftool actions [17:19:31] those times before phabricator were dark times, full of mailing list only discussions and no way to track anything except via mediawiki pages that no one touched regularly... [17:19:57] that means that you need to be able to login as the user you are going to execute the command [17:20:43] 10Operations, 10monitoring: Monitor hardware thermal issues - https://phabricator.wikimedia.org/T125205#3597741 (10jcrespo) [17:20:46] 10Operations, 10monitoring, 10Patch-For-Review: Several hosts return "internal IPMI error" in the check_ipmi_temp check - https://phabricator.wikimedia.org/T167121#3597738 (10jcrespo) 05Open>03Resolved a:05Cmjohnson>03faidon IPMI seems responsive again, both the programatic calls and the SSH interfac... [17:20:59] volans: not sure I fully understand that - does it mean I can't do pool/depool unless I have full root access (not limited sudo)? [17:21:26] (03PS1) 10Jcrespo: Revert "mariadb: Depool es1019 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377306 [17:22:13] there are 2 ways to depool, on the master or on the host itself, fwiw. are you doing this on the host itself? [17:22:15] SMalyshev: I not fully sure but I think so for how it's configured right now [17:22:28] but you can at the host level [17:23:28] try "sudo depool" literally on wdqs1005 [17:23:33] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Please restore prod and gerrit access for J. Forrester - https://phabricator.wikimedia.org/T175505#3597743 (10RobH) a:03Bawolff @bawolff: Can you do one of the following: Preferred: Just tell us what you did and how to undo it, so we (myself mainly)... [17:23:48] im not sure the 'best' way to change gerrit permissions. [17:23:57] volans: he is trying at the host level, but he is refused "sudo -i". Probably missing an option on https://github.com/wikimedia/puppet/blob/production/modules/admin/data/data.yaml#L376-L377 [17:24:14] (03PS10) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [17:24:40] (03CR) 10jerkins-bot: [V: 04-1] openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [17:24:41] gehel: sorry I always do from puppetmaster where sudo -i is required, but from what I'm seeing on the hosts it might be required anyway also on the host [17:25:43] volans: same for me (always on the puppetmaster). But if he needs "sudo -i", how can we allow that in sudoers (I'm trying to parse the man page, but it seems non obvious) [17:25:48] j.oe could tell you for sure with context on why it was done as it is [17:26:14] gehel: I don't think you can, do we have other non-root that can pool/depool? [17:26:43] volans: it looks like wdqs is the only one so far... [17:26:46] the current puppetization might just not allow it ;) [17:27:12] too bad :( [17:27:33] SMalyshev: so I'll do the pooling for this time, we'll see if we can find a way to give you that access... [17:28:53] (03PS1) 10Jcrespo: MariaDB: Repool db1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377307 (https://phabricator.wikimedia.org/T167121) [17:29:07] (03PS11) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [17:29:26] (03CR) 10Jcrespo: [C: 04-1] "Do not merge until es1019 catches up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377307 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [17:29:51] (03CR) 10Jcrespo: "Merge after 377307" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377306 (owner: 10Jcrespo) [17:29:59] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Please restore prod and gerrit access for J. Forrester - https://phabricator.wikimedia.org/T175505#3597753 (10Jdforrester-WMF) Confirmed that prod access is working again, thanks! [17:41:27] (03PS1) 10Framawiki: Enable Timeless on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377310 (https://phabricator.wikimedia.org/T154371) [17:44:35] (03PS12) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [17:48:50] (03CR) 10Dzahn: "sounds cool, and actually better if we don't have 2 support 2 ways to do a thing, yep" [puppet] - 10https://gerrit.wikimedia.org/r/374837 (owner: 10Hashar) [17:54:32] (03CR) 10Jcrespo: [C: 032] MariaDB: Repool db1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377307 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [17:54:47] (03CR) 10Jcrespo: [C: 04-1] MariaDB: Repool db1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377307 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [17:55:06] (03CR) 10Jcrespo: [V: 04-1 C: 04-1] MariaDB: Repool db1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377307 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [17:55:36] (03PS2) 10Jcrespo: MariaDB: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377307 (https://phabricator.wikimedia.org/T167121) [17:56:00] (03CR) 10Jcrespo: [C: 032] MariaDB: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377307 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170911T1800). [18:00:04] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:18] Heya. [18:00:27] Anyone around to SWAT? [18:00:35] I can SWAT. [18:00:50] I see the EU SWAT didn't happen but Zuul still seems under stress. [18:01:11] Niharika: But config patches in SWAT are special, so it should be fine. [18:01:26] Yeah. [18:01:59] … hopefully. :-) [18:02:45] (03PS2) 10Niharika29: Enable responsive reference columns on Wikitionaries and Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376573 (owner: 10Jforrester) [18:03:23] (03CR) 10Niharika29: [C: 032] Enable responsive reference columns on Wikitionaries and Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376573 (owner: 10Jforrester) [18:03:27] (03PS1) 10Dzahn: admins: fix group membership for 'rho' [puppet] - 10https://gerrit.wikimedia.org/r/377313 (https://phabricator.wikimedia.org/T175204) [18:03:36] Niharika: James_F is swat for config still broken if so any eta? [18:03:42] !log upgrading firmware on scs-ulsfo [18:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:21] (03CR) 10Dzahn: [C: 032] admins: fix group membership for 'rho' [puppet] - 10https://gerrit.wikimedia.org/r/377313 (https://phabricator.wikimedia.org/T175204) (owner: 10Dzahn) [18:04:39] Zppix: The SWAT queue is empty according to https://integration.wikimedia.org/zuul/ so maybe it's fine. [18:04:59] Niharika: can i add my config patch from eu swat? [18:05:41] Zppix: Give me a few minutes to make sure it's not broken. I'll let you know. [18:05:48] Niharika: thanks [18:07:55] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3598009 (10Dzahn) 05Open>03Resolved Hello @RHo your access should work now and i can confirm your user has been created on stat1004 and sta... [18:07:59] (03Merged) 10jenkins-bot: MariaDB: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377307 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo) [18:08:18] (03PS3) 10Niharika29: Enable responsive reference columns on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371630 (https://phabricator.wikimedia.org/T173176) (owner: 10Jforrester) [18:08:22] (03CR) 10Niharika29: [C: 032] Enable responsive reference columns on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371630 (https://phabricator.wikimedia.org/T173176) (owner: 10Jforrester) [18:09:25] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to Production shell access for rho (Rita Ho) - https://phabricator.wikimedia.org/T175204#3598019 (10Dzahn) If you need to ssh directly to stat1004/stat1005, see the standard config example here: https://wikitech.wikimedia.org/wiki/Pr... [18:10:24] !log scs-ulsfo has successfully updated firmware T174475 [18:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:37] T174475: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475 [18:12:16] (03CR) 10Hashar: [C: 04-1] "Yeah I went with aptly as a quick/dirty way because it looked less intimidating than reprepro on the paper :]" [puppet] - 10https://gerrit.wikimedia.org/r/374837 (owner: 10Hashar) [18:12:49] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3598042 (10RobH) [18:13:56] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3563233 (10RobH) [18:13:59] (03PS13) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [18:14:22] (03PS1) 10Dzahn: admins: delete unused group sectools-roots [puppet] - 10https://gerrit.wikimedia.org/r/377315 (https://phabricator.wikimedia.org/T138650) [18:15:22] (03PS15) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [18:15:51] hrmm [18:16:33] whats up, rob [18:16:41] firmware upgrade failed? [18:17:26] Niharika: Isn't the wait such fun? [18:18:19] James_F: One of them is merged but there are some local changes on tin. [18:18:25] (03CR) 10Dzahn: [C: 032] admins: delete unused group sectools-roots [puppet] - 10https://gerrit.wikimedia.org/r/377315 (https://phabricator.wikimedia.org/T138650) (owner: 10Dzahn) [18:18:44] jynus: marostegui There are some changes to db-eqiad.php on tin? [18:18:58] yeah [18:19:07] ignore those, I will merge after scap finishes [18:19:18] jynus: I have a couple of SWAT patches to merge. [18:19:20] It was taking 30 minutes to [18:19:25] verfy [18:19:28] 10Operations, 10Security-Team, 10vm-requests, 10Patch-For-Review: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#3598079 (10Dzahn) >>! In T138650#3595336, @MoritzMuehlenhoff wrote: > There's a remaining empty group sectools-roots in data.yaml, let's also remove thi... [18:19:33] 10Operations, 10Ops-Access-Requests: root access on security-tools instances for Darian Patrick - https://phabricator.wikimedia.org/T138873#3598081 (10Dzahn) [18:19:36] 10Operations, 10Security-Team, 10vm-requests, 10Patch-For-Review: provide ganeti VM for security team sectools - https://phabricator.wikimedia.org/T138650#3598080 (10Dzahn) 05Open>03Resolved [18:19:44] so it went into that schedule [18:19:59] rebase without problem [18:20:09] I will file-sync later [18:21:28] James_F: You can test https://gerrit.wikimedia.org/r/#/c/376573/ now. [18:21:48] It's on mwdebug1002 (goatland for all further communications). [18:22:39] (03PS16) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [18:22:41] Niharika: Yeah, LGTM. Wait for the enwiki one and do a single scap? [18:23:00] (03CR) 10Dzahn: "yep, taking the https part anyways seems good :)" [puppet] - 10https://gerrit.wikimedia.org/r/374837 (owner: 10Hashar) [18:23:02] (03PS4) 10Niharika29: Enable responsive reference columns on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371630 (https://phabricator.wikimedia.org/T173176) (owner: 10Jforrester) [18:23:08] (03CR) 10Niharika29: [C: 032] Enable responsive reference columns on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371630 (https://phabricator.wikimedia.org/T173176) (owner: 10Jforrester) [18:23:24] Yup. [18:23:30] Cool. [18:24:15] I'm not sure why these patches are not in the swat queue. [18:24:42] !log updating firmware on scs-a1-codfw and scs-c1-codfw [18:24:46] Niharika: Do you have to say "SWAT" as a review comment when you +2 to trigger it? [18:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:59] James_F: Nope. [18:25:23] James_F: Or at least it wasn't like that until a while back. [18:25:31] Maybe thcipriani can verify^ [18:25:34] * James_F nods. [18:25:39] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3598114 (10RobH) [18:25:40] Niharika: my merge took over 30 minutes on CI [18:25:56] that is why it went thtough in overtime [18:26:00] It looks like there's a lack of -jessie CI job runners. [18:26:18] jynus: It's incredibly slow. Recovering from a failure, apparently. It's no problem. I'll be done soon. [18:26:19] Did the mix of tasks / availability of runners change recently? [18:27:11] config patches? They've never gone through the gate-and-submit job queue. Although they are prioritized the same as if they were. [18:27:47] thcipriani: Don't they go through the gate-and-submit-swat? [18:28:03] Last I checked they don't [18:28:08] Only wmf/* branches do [18:28:10] not operations/config only patches for a wmf/[whatever] branch on MW core or extensions [18:28:16] Ah. [18:28:26] I filed a bug about wmf/* processing starving out mw-config [18:28:31] there's a task about this, it's trickier than it appears on the surface (as are all things) [18:28:33] Because config has its own high-priority queue anyway. [18:28:53] right, that too. It should be roughly functionally equivalent. [18:29:01] My usual MO as a swatter was to +2 the wmf/* ones first, because they take longer, then do config patches that have a quick turnaround [18:29:06] ^ [18:29:07] But "high-priority" doesn [18:29:09] Bah. [18:29:15] Zppix: I think I would defer doing more patches right now might stretch this window beyond it's assigned time. CI is incredibly slow. Your patch doesn't seem super urgent either. Is that alright? [18:29:16] But "high-priority" doesn't magically give us more jessie runners. [18:29:22] But that stopped working because if you have more than 1-2 wmf/* patches, the gate-and-submit-swat queue starves out everything else [18:29:40] !log firmware update on scs-a1-codfw and scs-c1-codfw has been started, awaiting completion T174475 [18:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:52] T174475: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475 [18:29:54] !log bump nodepool rate to 15 and max-servers to 18 (no restart as it picks up config periodically) [18:29:56] thcipriani: ^ [18:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:08] going to slowly bump it to see it survive [18:30:42] chasemp: thank you. I didn't realized you tweaked the number of available machines as well, I just thought the rate had been slowed. [18:31:27] both kind of work in conjunction to cause the stampede and I was being very conservative at the time mate [18:32:04] Niharika: no worries [18:33:30] !log firmware update on scs-a1-codfw and scs-c1-codfw has been complete and tested good T174475 [18:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:01] (03Merged) 10jenkins-bot: Enable responsive reference columns on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371630 (https://phabricator.wikimedia.org/T173176) (owner: 10Jforrester) [18:34:18] are we still in time for two patches? [18:34:32] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3598187 (10RobH) [18:35:41] gehel: are 4/5 pooled? I still see pretty low qps on grafana [18:35:42] James_F: Your enwiki patch is on goatland too. [18:35:48] Yay. [18:35:58] SMalyshev: nope, not yet (still stuck in meetings...) [18:36:00] Niharika: added two patches to the calendar fyi [18:36:07] gehel: ah, ok then :) [18:36:27] tabbycat: CI is incredibly slow. Can you do them as part of evening SWAT? [18:36:27] SMalyshev: that's probably going to wait until tomorrow :( [18:36:36] gehel: np, ok [18:36:46] It took 30 minutes-ish to merge one. [18:37:10] Niharika: evening swat probably not, they were merged in EU SWAT, but we couldn't merge due to scap issues [18:37:16] Niharika: Manually pulled? [18:37:33] given that I'd suggest to manually submit if at all possible [18:37:52] James_F: Manually? [18:37:53] Niharika: Oh, no, just zuul being out of date. [18:38:22] tabbycat: How urgent are they? [18:38:36] Niharika: Yeah, LGTM. Let's roll. [18:39:07] Niharika: no urgency, but they're waiting for some time already and I couldn't had that merged before due to tech issues. Let's try today? [18:39:14] (03PS1) 10Hashar: openstack: explicitly define default_log_levels [puppet] - 10https://gerrit.wikimedia.org/r/377321 [18:39:17] (03PS1) 10Hashar: openstack: debug oslo.messaging nova.network.manager [puppet] - 10https://gerrit.wikimedia.org/r/377322 [18:40:28] !log niharika29@tin Synchronized wmf-config/InitialiseSettings.php: Enable responsive reference columns on wiktionaries, wikivoyages and enwiki https://gerrit.wikimedia.org/r/#/c/376573/, https://gerrit.wikimedia.org/r/#/c/371630/ (duration: 00m 46s) [18:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:10] why is CI soooooo slow today? [18:41:17] tabbycat: I'm not very comfortable with manually pulling patches. Can you wait until tomorrow then? [18:41:25] It's recovering from the downtime. [18:41:29] if I have to I guess I can [18:41:31] Lot of backlog. [18:41:43] tabbycat: because the rate of action and max number of instances was lowered this weekend, it is recovering [18:41:43] but it's annoying, it's the third time I have to re-schedule this [18:41:57] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625#3598199 (10RobH) [18:42:07] Thank you. [18:42:09] will CI be ready for evening swat? [18:42:17] because if not I won't bother [18:42:24] on re-scheduling again [18:43:03] tabbycat: I can't say for sure but will probably be fine for merging a couple of config patches. [18:43:06] who knows what is in store for the future, the problem isn't "CI" at this point, it's "too many patches" ;) (only so much our infra can process with a backlog) [18:43:25] James_F: Forgot to say - it's live. [18:43:39] Niharika: Yeah. Thank you. :-) [18:43:42] greg-g: I guess aborting all pending jobs is not an option, at least those who ain't merges? [18:44:16] tabbycat: Those tests are important for developers. :) [18:44:17] (rescheduling) [18:49:15] PROBLEM - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:51:30] ● cassandra-metrics-collector.service loaded failed failed cassandra metrics collector [18:52:04] PROBLEM - Check systemd state on restbase1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:52:38] !log restbase - java[12418]: Error: Could not find or load main class org.wikimedia.cassandra.metrics.service.Service [18:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:10] (03PS17) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [18:53:32] godog: ^ [18:53:40] cassandra3? [18:54:13] James_F: is Forrestbot down? T175462 was not tagged :) [18:54:13] T175462: Inability to lock via setglobalaccountstatus API in mw 1.29 - https://phabricator.wikimedia.org/T175462 [18:54:14] PROBLEM - Disk space on restbase-dev1004 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [18:54:57] 10Operations, 10Cassandra, 10Epic, 10Goal, and 2 others: End of August milestone: Cassandra 3 cluster in production - https://phabricator.wikimedia.org/T169939#3531163 (10Dzahn) 14:49 < icinga-wm> PROBLEM - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but... [18:55:37] tabbycat: https://phabricator.wikimedia.org/p/ReleaseTaggerBot/ certainly looks like ReleaseTaggerBot isn't tagging things with wmf.18. [18:55:48] tabbycat: I don't have any access, though. File a task? [18:56:01] James_F: no longer your bot? [18:56:07] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3598256 (10Cmjohnson) @Joe mw1307-1328 are ready for install [18:56:10] I thought it was :O [18:56:18] tabbycat: Never was. It was named after my practice, not written by me. [18:56:24] ACKNOWLEDGEMENT - Check systemd state on restbase1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T169939 [18:56:24] ACKNOWLEDGEMENT - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T169939 [18:56:24] ACKNOWLEDGEMENT - Check systemd state on restbase1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T169939 [18:56:24] ACKNOWLEDGEMENT - Check systemd state on restbase2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T169939 [18:56:24] ACKNOWLEDGEMENT - Check systemd state on restbase2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T169939 [18:56:34] James_F: okay, I'll file a task then [18:56:36] thanks [18:56:50] tabbycat: valhallasw and other awesome people did the hard work so I could stop bothering. :-) [18:57:06] lol good option [18:57:32] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3598262 (10Cmjohnson) [18:57:34] I agree. I enjoy being replaced by a bash script. [18:58:41] urandom: restbase-dev1004 is out of disk and your home is using 25G of 28G or so [18:58:57] urandom: but there is lots of space on /srv instead [18:59:08] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3501521 (10Cmjohnson) All the on-site work has been completed for labvirts1019-20. @robh lmk if you want to take it from here [18:59:11] ./work_for_me.sh james_f :P :) [18:59:46] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review: rack/setup/install labvirt10(19|20).eqiad.wmnet - https://phabricator.wikimedia.org/T172538#3598266 (10RobH) a:05Cmjohnson>03RobH [19:00:35] mutante: yeah, that's my bad [19:00:38] filed as T175626 [19:00:38] T175626: @ReleaseTaggerBot not working for wmf/1.30.0-wmf.18 et seq. - https://phabricator.wikimedia.org/T175626 [19:00:39] restbase-dev1004:/srv# mv /home/eevans/cassandra-1502141526-pid13410.hprof /srv/eevans/ [19:00:43] mutante: should be better now [19:00:47] * James_F grins. [19:00:48] urandom: ^ i just did that command [19:00:52] but it's working on it [19:00:53] oh crap [19:01:01] bad? [19:01:07] well... [19:01:15] RECOVERY - Disk space on restbase-dev1004 is OK: DISK OK [19:01:23] only in that i need to restart a transfer :) [19:01:27] but that's OK [19:01:32] !log restbase-dev1004:/srv# mv /home/eevans/cassandra-1502141526-pid13410.hprof /srv/eevans/ to free disk space [19:01:37] oh.. ok [19:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:11] mutante: it was using rsync, so it should be fine [19:02:39] urandom: ook..phew.. unrelated , did you see the alerts on the restbase non-dev hosts.. it seems that happened after Cassandra 3 upgrades: [19:02:42] https://phabricator.wikimedia.org/T169939#3598247 [19:02:51] that actually made me see the "dev" one [19:03:13] Error: Could not find or load main class org.wikimedia.cassandra.metrics.service.Service [19:04:01] yeah; TL;DR we need to deinstall cassandra-metrics-collector there, since we're going to use prometheus going forward, I think godog had that under schedule maintenance, but I guess we didn't get that done before it expired [19:04:29] he did mention today in meeting how we want to replace more things with prometheus :) [19:04:33] ok, *nod* [19:04:39] prometheus all the things [19:04:43] yea [19:05:11] 10Operations, 10DC-Ops: update firmware on scs consoles - https://phabricator.wikimedia.org/T174475#3598284 (10RobH) I've created a sub-task to troubleshoot scs-c1-eqiad. I'm not sure of the status of the two additional scs consoles in esams. @Mark: I've updated the firmware and tested it as functioning on... [19:05:29] (03Abandoned) 10Halfak: Adds require ::icinga::plugins to ::ores::base [puppet] - 10https://gerrit.wikimedia.org/r/358240 (https://phabricator.wikimedia.org/T167602) (owner: 10Halfak) [19:05:49] (03PS1) 10Halfak: Adds myspell-lv package to ores::base [puppet] - 10https://gerrit.wikimedia.org/r/377327 [19:06:20] akosiaris: Do you happen to know how the ORES code was installed on the new cluster? [19:06:37] Manually edited scap config, maybe? [19:08:21] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool es1019 with low load (duration: 00m 46s) [19:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:04] mutante: i put it back under maintenance, with a reference to the ticket (and i'll put together a gerrit, so it gets fixed before that expires again :)) [19:10:39] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Please restore prod and gerrit access for J. Forrester - https://phabricator.wikimedia.org/T175505#3598317 (10Bawolff) >>! In T175505#3597743, @RobH wrote: > @bawolff: Can you do one of the following: > > Preferred: Just tell us what you did and how t... [19:11:19] !log adjust nodepool rate to 10 [19:11:22] andrewbogott: ^ [19:11:24] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Please restore prod and gerrit access for J. Forrester - https://phabricator.wikimedia.org/T175505#3598319 (10Bawolff) a:05Bawolff>03None [19:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:38] bawolff: thx! [19:13:53] i just was curious so i could fix the same way for access revocation int he future =] [19:13:56] I left the bug open until James confirms he has everything back [19:14:13] Just in case it didn't work or we missed some other access that we took away [19:14:19] wfm [19:14:33] James_F: let us know if anything doesnt work right =] [19:14:59] (03PS18) 10Rush: openstack: designate as module/profile/role [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) [19:15:57] Really it probably should have been ldap that gets revoked, not gerrit specificly. gerrit is much less sensitive than say logstash [19:16:57] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3598347 (10GWicke) >>! In T175210#3597099, @mobrovac wrote: > IMHO, `updateBetaFeaturesUserCounts` is the perfect candidate here. It... [19:17:03] yeah i was surprised to se ethe wmf flag still tagged to the account [19:17:06] (03PS2) 10Halfak: Adds myspell-lv package to ores::base [puppet] - 10https://gerrit.wikimedia.org/r/377327 [19:17:14] so that is likely what we should do in future maybe [19:17:16] I suppose we really need a list somewhere of what and how to revoke everything during a compromise [19:17:29] thats what we have for offboarding, but not sure we have a compromise checklist, i just use offboarding [19:17:46] (03CR) 10jerkins-bot: [V: 04-1] Adds myspell-lv package to ores::base [puppet] - 10https://gerrit.wikimedia.org/r/377327 (owner: 10Halfak) [19:17:47] well, modified offboarding, indeed, not ideal [19:17:57] (03PS4) 10MarcoAurelio: Enable WikidataPageBanner for Russian Wikimedia chapter wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377275 (https://phabricator.wikimedia.org/T175356) [19:18:06] urandom: perfect:) thank you [19:18:17] PROBLEM - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [19:18:21] (03PS1) 10Cmjohnson: Updating dns entries for kafka-jumbo100[1-6] to reflect change in vlan T167992 [dns] - 10https://gerrit.wikimedia.org/r/377329 [19:18:24] crap [19:18:39] and those, too [19:19:30] So i guess for future reference, next time something like this happens we should use [19:19:32] ACKNOWLEDGEMENT - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T169939#3598247 [19:19:46] offboard-user -l username [19:20:27] or does that do other things [19:20:27] ACKNOWLEDGEMENT - Check systemd state on restbase2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. eevans See: T171772 [19:20:27] maybe it should have a "soft offboarding" option for temp disabling for these cases [19:20:35] * bawolff goes read what that script actually does [19:20:50] bawolff: i dont use the script really [19:22:03] so not sure what it does [19:22:18] i use the verbose offboard wiki page for step by step [19:22:26] mutante: that would be nice [19:22:48] also not sure just removing the ssh key is good enough [19:22:58] (03PS3) 10Halfak: Adds myspell-lv package to ores::base Switches myspell-uk to aspell-uk (better package) [puppet] - 10https://gerrit.wikimedia.org/r/377327 [19:23:00] or if they should technicaly be absented during those periods [19:23:01]  [19:23:25] (03CR) 10jerkins-bot: [V: 04-1] Adds myspell-lv package to ores::base Switches myspell-uk to aspell-uk (better package) [puppet] - 10https://gerrit.wikimedia.org/r/377327 (owner: 10Halfak) [19:23:45]  [19:24:03] (03CR) 10Cmjohnson: [C: 032] Updating dns entries for kafka-jumbo100[1-6] to reflect change in vlan T167992 [dns] - 10https://gerrit.wikimedia.org/r/377329 (owner: 10Cmjohnson) [19:24:06] ?? [19:24:08] ¯\_(ツ)_/¯ [19:24:10] i think that removing just the key is good enough for these [19:24:28] urandom: ?:) [19:25:02] my irc connection go janky, and when i got it back i'd apprently left a couple of null messages [19:25:08] s/go/got/ [19:25:23] weird [19:26:22] heh, ok :) [19:29:17] PROBLEM - Host kafka-jumbo1002 is DOWN: PING CRITICAL - Packet loss = 100% [19:29:17] PROBLEM - Host kafka-jumbo1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:37:37] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster, and 2 others: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3598393 (10Cmjohnson) a:05Cmjohnson>03elukey @elukey updated dns entries and swich ports to reflect vlan-private1-row-eqiad [19:38:27] PROBLEM - Check Varnish expiry mailbox lag on cp1072 is CRITICAL: CRITICAL: expiry mailbox lag is 2006636 [19:43:30] (03PS1) 10Dzahn: prometheus: replace deprecated parser functions with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/377331 [19:43:51] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace deprecated parser functions with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/377331 (owner: 10Dzahn) [19:46:15] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/377331 (owner: 10Dzahn) [19:47:17] (03CR) 10jerkins-bot: [V: 04-1] prometheus: replace deprecated parser functions with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/377331 (owner: 10Dzahn) [19:54:42] (03PS1) 10Andrew Bogott: prometheus::web to apache [puppet] - 10https://gerrit.wikimedia.org/r/377332 [19:56:46] (03PS2) 10Andrew Bogott: prometheus::web to apache [puppet] - 10https://gerrit.wikimedia.org/r/377332 (https://phabricator.wikimedia.org/T151009) [19:57:35] (03CR) 10Dzahn: prometheus::web to apache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377332 (https://phabricator.wikimedia.org/T151009) (owner: 10Andrew Bogott) [19:59:07] (03CR) 10Dzahn: prometheus::web to apache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377332 (https://phabricator.wikimedia.org/T151009) (owner: 10Andrew Bogott) [20:00:04] No patches in the queue for this window. Wheeee! [20:07:18] robh, bawolff: Thanks! [20:07:41] welcome =] [20:07:56] 10Operations, 10Ops-Access-Requests: Please restore prod and gerrit access for J. Forrester - https://phabricator.wikimedia.org/T175505#3598458 (10Jdforrester-WMF) 05Open>03Resolved a:03RobH Everything looks good. Thanks! [20:13:28] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.13 (duration: 02m 49s) [20:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:20] (03PS3) 10Andrew Bogott: prometheus::web to apache [puppet] - 10https://gerrit.wikimedia.org/r/377332 (https://phabricator.wikimedia.org/T151009) [20:23:23] (03CR) 10Andrew Bogott: prometheus::web to apache (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377332 (https://phabricator.wikimedia.org/T151009) (owner: 10Andrew Bogott) [20:30:04] (03PS1) 10Rush: nodepool: make puppet match current settings [puppet] - 10https://gerrit.wikimedia.org/r/377338 (https://phabricator.wikimedia.org/T170492) [20:30:12] (03PS4) 10Andrew Bogott: prometheus::web to apache [puppet] - 10https://gerrit.wikimedia.org/r/377332 (https://phabricator.wikimedia.org/T151009) [20:32:01] (03CR) 10Rush: [C: 032] nodepool: make puppet match current settings [puppet] - 10https://gerrit.wikimedia.org/r/377338 (https://phabricator.wikimedia.org/T170492) (owner: 10Rush) [20:37:34] mutante: for teh puppet compiler does it write out results serially? I've been waiting to see the results for 7802 or 3 for awhile, 7801 is coming up on 2 hours or so [20:37:37] is it stuck? [20:42:09] (03CR) 10Andrew Bogott: "I didn't notice this and self-merged https://gerrit.wikimedia.org/r/#/c/377253/3/modules/role/manifests/labs/openstack/nova/manager.pp whi" [puppet] - 10https://gerrit.wikimedia.org/r/377023 (owner: 10BryanDavis) [20:45:00] (03CR) 10Andrew Bogott: [C: 031] "I'm happy with this if @Merlijn removes his -1. Low priority but not wrong." [puppet] - 10https://gerrit.wikimedia.org/r/375860 (https://phabricator.wikimedia.org/T174082) (owner: 10Merlijn van Deen) [20:50:10] (03Abandoned) 10BryanDavis: wmcs: tweak wikitech-static test label [puppet] - 10https://gerrit.wikimedia.org/r/377023 (owner: 10BryanDavis) [20:50:58] sorry bd808 :( [20:51:08] andrewbogott: heh. no worries [20:51:21] (03CR) 10Jforrester: "Just to check, was this done intentionally? Might be a good idea to have a Phab task for this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377274 (owner: 10MarcoAurelio) [20:51:32] I wasn't trying to pad my commit count. :) [20:52:25] no_justification hi, i am wonder why is gerrit's proxy port on http, where in gerrits config it is proxy-https? [20:52:39] I will submit a patch to change that. [20:52:41] * no_justification shrugs [20:52:45] If it works who cares [20:53:14] (03Draft1) 10Paladox: Gerrit: Make proxy http port configurable [puppet] - 10https://gerrit.wikimedia.org/r/377340 [20:53:18] (03PS2) 10Paladox: Gerrit: Make proxy http port configurable [puppet] - 10https://gerrit.wikimedia.org/r/377340 [20:53:43] (03CR) 10Merlijn van Deen: [labs] Rename wmflabs-* files to wmcs-* [puppet] - 10https://gerrit.wikimedia.org/r/375860 (https://phabricator.wikimedia.org/T174082) (owner: 10Merlijn van Deen) [20:54:14] 10Operations, 10ops-eqiad, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: Analytics hosts showed high temperature alarms - https://phabricator.wikimedia.org/T132256#3598540 (10Cmjohnson) @elukey we finished these...correct? [20:55:02] (03PS3) 10Andrew Bogott: [labs] Rename wmflabs-* files to wmcs-* [puppet] - 10https://gerrit.wikimedia.org/r/375860 (https://phabricator.wikimedia.org/T174082) (owner: 10Merlijn van Deen) [20:55:16] (03Draft1) 10Paladox: Gerrit: Switch to proxy https in apache [puppet] - 10https://gerrit.wikimedia.org/r/377341 [20:55:18] (03PS2) 10Paladox: Gerrit: Switch to proxy https in apache [puppet] - 10https://gerrit.wikimedia.org/r/377341 [20:55:57] (03CR) 10Chad: [C: 04-1] "Why would it be taken? This really should be the only service running on a given host. There is no need to add this complexity." [puppet] - 10https://gerrit.wikimedia.org/r/377340 (owner: 10Paladox) [20:56:13] (03Abandoned) 10Paladox: Gerrit: Make proxy http port configurable [puppet] - 10https://gerrit.wikimedia.org/r/377340 (owner: 10Paladox) [20:56:37] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 1954 [20:58:20] (03CR) 10Andrew Bogott: [C: 032] [labs] Rename wmflabs-* files to wmcs-* [puppet] - 10https://gerrit.wikimedia.org/r/375860 (https://phabricator.wikimedia.org/T174082) (owner: 10Merlijn van Deen) [20:58:48] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [20:58:57] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [20:58:58] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [20:59:07] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [20:59:08] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [20:59:17] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [20:59:38] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [20:59:48] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [21:00:04] No patches in the queue for this window. Wheeee! [21:00:24] (03CR) 10Chad: [C: 04-1] "No, this is wrong....we don't actually serve it with HTTPS on the backend." [puppet] - 10https://gerrit.wikimedia.org/r/377341 (owner: 10Paladox) [21:01:03] (03CR) 10Paladox: "> No, this is wrong....we don't actually serve it with HTTPS on the" [puppet] - 10https://gerrit.wikimedia.org/r/377341 (owner: 10Paladox) [21:01:32] (03PS1) 10RobH: adding documentation on renaming users [puppet] - 10https://gerrit.wikimedia.org/r/377342 [21:02:22] (03PS2) 10RobH: adding documentation on renaming users [puppet] - 10https://gerrit.wikimedia.org/r/377342 [21:03:06] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4023.* [21:03:07] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [21:03:07] RECOVERY - Disk space on stat1005 is OK: DISK OK [21:03:07] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:03:17] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [21:03:17] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [21:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:40] (03CR) 10RobH: [C: 032] adding documentation on renaming users [puppet] - 10https://gerrit.wikimedia.org/r/377342 (owner: 10RobH) [21:03:47] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [21:03:57] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [21:03:57] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [21:04:07] RECOVERY - DPKG on stat1005 is OK: All packages OK [21:07:08] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:11:00] seems latency spikes on cirrus are not solved ... but when they happen now they are incredibly short. Going to be another pain to debug... [21:21:36] (03Abandoned) 10Paladox: Gerrit: Install and configure filebeats [puppet] - 10https://gerrit.wikimedia.org/r/377060 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [21:21:47] (03Abandoned) 10Paladox: Logstash: Add beats support [puppet] - 10https://gerrit.wikimedia.org/r/377058 (https://phabricator.wikimedia.org/T141324) (owner: 10Paladox) [21:24:38] (03CR) 10Thcipriani: [C: 031] "Cherry-picked on beta, seems to work as expected." [puppet] - 10https://gerrit.wikimedia.org/r/376571 (owner: 10Thcipriani) [21:27:48] (03CR) 10Paladox: prometheus: replace deprecated parser functions with validate_legacy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377331 (owner: 10Dzahn) [21:28:27] (03CR) 10Dereckson: [C: 031] Enable Timeless on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377310 (https://phabricator.wikimedia.org/T154371) (owner: 10Framawiki) [21:29:07] (03Abandoned) 10Paladox: Gerrit: Switch to proxy https in apache [puppet] - 10https://gerrit.wikimedia.org/r/377341 (owner: 10Paladox) [21:30:08] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [21:30:27] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0 [21:30:56] (03PS1) 10Dereckson: Revert "Don't deploy Timeless on fr.wiktionary for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377346 [21:31:28] !log awight@tin Started deploy [ores/deploy@42c5663]: Try ORES filehandle fix on new cluster [21:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:58] (03PS1) 10Ottomata: Add class to periodically run dfsadmin -fetchImage [puppet/cdh] - 10https://gerrit.wikimedia.org/r/377350 [21:36:03] (03PS2) 10Ottomata: Add class to periodically run dfsadmin -fetchImage [puppet/cdh] - 10https://gerrit.wikimedia.org/r/377350 [21:38:53] (03PS1) 10Ottomata: Fetch Hadoop NameNode fsimage backups daily and also save them in bacula [puppet] - 10https://gerrit.wikimedia.org/r/377352 [21:39:14] (03CR) 10jerkins-bot: [V: 04-1] Fetch Hadoop NameNode fsimage backups daily and also save them in bacula [puppet] - 10https://gerrit.wikimedia.org/r/377352 (owner: 10Ottomata) [21:40:22] (03PS2) 10Dzahn: prometheus: replace deprecated parser functions with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/377331 [21:41:28] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4025.* [21:41:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:14] !log awight@tin Finished deploy [ores/deploy@42c5663]: Try ORES filehandle fix on new cluster (duration: 13m 47s) [21:45:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:06] (03PS1) 10Dzahn: stdlib: fix quoting in validate_legacy example [puppet] - 10https://gerrit.wikimedia.org/r/377355 [21:51:52] 10Operations, 10Traffic, 10monitoring: prometheus -> grafana stats for per-numa-node meminfo - https://phabricator.wikimedia.org/T175636#3598644 (10BBlack) [21:59:30] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: End of September milestone: Migrate first production use case - https://phabricator.wikimedia.org/T175637#3598673 (10GWicke) [22:01:24] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Services Q1 2017/18 goal: Begin migrating job queue processing to multi-DC enabled eventbus infrastructure. - https://phabricator.wikimedia.org/T169937#3413824 (10GWicke) [22:08:11] (03PS1) 10Dereckson: Enable Special:PageLanguage on mul.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377357 (https://phabricator.wikimedia.org/T175622) [22:11:46] (03Abandoned) 10Dereckson: Add urdu logo to mobile site [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367946 (https://phabricator.wikimedia.org/T171769) (owner: 10محمد شعیب) [22:12:44] (03CR) 10Dereckson: "https://gerrit.wikimedia.org/r/#/c/374445/ has been abandoned. So you can proceeed with this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374446 (https://phabricator.wikimedia.org/T174413) (owner: 10Ebe123) [22:14:15] (03PS2) 10Dereckson: Enable usage aspect C on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/375544 (https://phabricator.wikimedia.org/T151717) (owner: 10Eranroz) [22:20:18] (03CR) 10Dereckson: [C: 031] Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) (owner: 10GeoffreyT2000) [22:24:21] (03CR) 10Dereckson: [C: 031] Fix case of MediaWiki\Auth\AbstractPreAuthenticationProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377281 (owner: 10Bartosz Dziewoński) [22:24:36] (03CR) 10Dereckson: [C: 032] Fix case of MediaWiki\Auth\AbstractPreAuthenticationProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377281 (owner: 10Bartosz Dziewoński) [22:26:13] (03Merged) 10jenkins-bot: Fix case of MediaWiki\Auth\AbstractPreAuthenticationProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377281 (owner: 10Bartosz Dziewoński) [22:26:36] (03PS1) 10Smalyshev: Add more wikis to the list of the dumped. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377358 (https://phabricator.wikimedia.org/T171807) [22:26:40] Dites, sur Tin, pouvoir faire un git status ce serait cool [22:26:59] On Tin, to be able to `git status` on /srv/mediawiki/ would be nice [22:28:05] oh well, we don't really need that, as scap handles this directory [22:28:07] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3598812 (10awight) @akosiaris > We'll need a manual puppet run on ores1001. Thanks! [22:28:39] robh: ^ That’s up for grabs, if you find idle time… [22:29:20] kick puppet, yea no worries [22:29:49] awight: so its disabled by alex on there though [22:30:04] and halfak [22:30:15] Puppet is disabled. alex/halfak stress tests [22:30:35] !log dereckson@tin Synchronized wmf-config/CommonSettings-labs.php: Fix MediaWiki\Auth\AbstractPreAuthenticationProvider case ([[Gerrit:377281]], no-op in prod) (duration: 00m 45s) [22:30:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:48] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4026.* [22:30:50] robh: thx yeah that’s the reason it’s manual. [22:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:01] awight: ok, i just wanted ot make sure that was noted and ok [22:31:09] I’m running stress tests now, so would love some new stuff from puppet. Yes ty! [22:31:23] MatmaRex: I've merged your fix, you can check logs aren't flooded anymore on labs :) [22:31:24] ok, its running now [22:32:06] MatmaRex: well, we still need to wait the Jenkins task to deploy to labs [22:32:11] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3560572 (10RobH) >>! In T174402#3598812, @awight wrote: > @akosiaris > We'll need a manual puppet... [22:32:11] and no errors [22:33:01] I’ve killed my current test, and will start fresh on your mark :D [22:33:53] robh: Confirmed that I see the puppet changes. Nicely {{done}} :-) [22:33:54] (03PS2) 10Dereckson: Add more wikis to the list of the dumped [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377358 (https://phabricator.wikimedia.org/T171807) (owner: 10Smalyshev) [22:33:58] (03CR) 10Dereckson: [C: 031] Add more wikis to the list of the dumped [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377358 (https://phabricator.wikimedia.org/T171807) (owner: 10Smalyshev) [22:34:16] awight: i left puppet enabled on the host though [22:34:24] if it needs to be disabled for whatever testing let me know [22:34:38] robh: That would be best, thanks for thinking of it. [22:35:10] I probably won’t need another run for weeks. [22:35:43] disabled with comment of Please see T174402 [22:35:43] T174402: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402 [22:36:32] (03CR) 10Dereckson: [C: 031] "I confirm this variable has been removed from the MobileFrontend extension." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/372838 (owner: 10Jdlrobson) [22:39:47] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:39:48] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:40:17] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:40:27] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:40:27] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:40:28] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:40:28] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:40:37] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:40:44] robh: my fault—I also need a manual puppet run on ores1002.eqiad.wmnet [22:40:56] that should be all though. [22:41:10] stat1005 ran out of memory and killed things :S [22:41:57] ebernhardson: need someone to reboot it? [22:42:12] (03CR) 10Rush: "http://puppet-compiler.wmflabs.org/7805/" [puppet] - 10https://gerrit.wikimedia.org/r/376848 (https://phabricator.wikimedia.org/T171494) (owner: 10Rush) [22:42:12] awight: doing now =] [22:42:12] robh: no it seems to have killed enough things to free half its memory [22:42:25] wow, puppet was halted on these awhile [22:42:30] Dereckson: you can search https://logstash-beta.wmflabs.org for "AbstractPreAuthenticationProvider" after it's live and check if there are any recent entries [22:42:32] robh: it can wait until after stat*, no worries! [22:42:44] awight: im not sure them sitting for a month without puppet calls is ok [22:42:47] Dereckson: (also, thanks :) ) [22:42:52] they dont get security or user updates [22:43:02] i'll comment my concern on the task [22:43:24] +1, maybe we should switch to short disablement windows [22:43:37] awight: +1 [22:43:56] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3598841 (10RobH) So I just did the same, upon request, for ores1002. Howeve,r while it was runni... [22:44:00] I’m can ping at the end of the day, if that’s helpful. [22:44:09] awight: im not acknowledging them in icinga [22:44:20] i know i dont leave puppet disabled on hosts overnight [22:44:23] im too paranoid [22:44:32] :) I’m happy to contract that paranoia [22:44:41] but im VERY active in disabling puppet on the install servers all the time to hack at partman [22:44:56] so yeah, im around all day, ideally we turn puppet back on if possible =] [22:45:05] and im on clinic this week so im happy to toggle as needed! [22:46:25] updated my comment tor eflect irc chat =] [22:46:29] reflect even [22:47:07] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [22:48:42] robh: last question: is it better to schedule the windows ahead of time? [22:49:19] we're just talking about disabling puppet for a workday while you guys do livehacking before puppetizing right? [22:49:36] cuz i do that for partman, since it requires a production environment (bare metal) to work properly [22:49:59] if its just during a day or so, then i'd simply ping the ops clinic on duty or anyone who is around. [22:50:06] if im around im always happy to toggle. [22:50:20] It’s specifically for stress testing. Okay we’ll do it ad-hoc, should only be one or two more sessions. [22:50:52] yeah seems perfect clinic duty territory =] [22:51:10] (but if they arent around and i am in future, happyt o do so as well) [22:51:55] i dont think it nees to be more formalized than that, as long as we dont leave it disabled for days on end. Ideally I rather it never be disabled overnight, because if a user account is compromised [22:52:02] hosts not calling into puppet dont get said updates [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170911T2300). Please do the needful. [23:00:05] tabbycat and Dereckson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:03:07] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [23:03:37] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [23:03:38] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [23:03:38] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:03:47] RECOVERY - DPKG on stat1005 is OK: All packages OK [23:04:07] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [23:04:08] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [23:04:10] (03PS1) 10Dzahn: elasticsearch: replace validate_bool with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/377366 [23:04:24] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: replace validate_bool with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/377366 (owner: 10Dzahn) [23:05:52] jenkins-bot: well, if the issue is "23:04:22 java.lang.OutOfMemoryError: PermGen space" then it's not me :) [23:06:20] !log jenkins-bot says: java.lang.OutofMemoryError: PermGen space [23:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:06:41] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/377366 (owner: 10Dzahn) [23:06:55] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: replace validate_bool with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/377366 (owner: 10Dzahn) [23:08:17] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 86.63 ms [23:09:57] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [23:10:20] I can SWAT this evening. [23:10:25] tabbycat: ping? [23:10:55] Zppix: Amir1: stephanebisson: ping? [23:12:42] (03CR) 10Dereckson: [C: 032] Revert "Don't deploy Timeless on fr.wiktionary for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377346 (owner: 10Dereckson) [23:17:07] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Mon 2017-09-11 23:17:00 UTC. [23:19:10] Well we can't SWAT without Jenkins either. [23:19:29] I don't see anything urgent enough to force a CR +2 without gating. [23:21:31] I updated the deploment table to explain that [23:28:38] RECOVERY - Check Varnish expiry mailbox lag on cp1072 is OK: OK: expiry mailbox lag is 235 [23:28:42] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/377366 (owner: 10Dzahn) [23:32:07] PROBLEM - Disk space on notebook1001 is CRITICAL: DISK CRITICAL - free space: / 346 MB (0% inode=80%) [23:33:06] (03PS1) 10Smalyshev: Add categories RDF dump into the index page [puppet] - 10https://gerrit.wikimedia.org/r/377369 (https://phabricator.wikimedia.org/T173892) [23:36:08] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/377366 (owner: 10Dzahn) [23:36:24] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: replace validate_bool with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/377366 (owner: 10Dzahn) [23:39:46] (03CR) 10Dzahn: "" Unknown function validate_legacy " why? It exists and is used here without this error: https://gerrit.wikimedia.org/r/#/c/377331/2/modu" [puppet] - 10https://gerrit.wikimedia.org/r/377366 (owner: 10Dzahn) [23:40:34] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/377366 (owner: 10Dzahn) [23:40:57] Dereckson: it probably works again now [23:40:59] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: replace validate_bool with validate_legacy [puppet] - 10https://gerrit.wikimedia.org/r/377366 (owner: 10Dzahn) [23:41:08] even though i cant explain that -1 :) [23:42:51] (03CR) 10jenkins-bot: Enable responsive reference columns on Wikitionaries and Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376573 (owner: 10Jforrester) [23:43:46] (03CR) 10jenkins-bot: Revert "Enable WikidataPageBanner for Russian Wikimedia chapter wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377259 (owner: 10Hashar) [23:45:17] RECOVERY - Disk space on notebook1001 is OK: DISK OK [23:45:22] mutante: yes JenkinsBot voted +2 on the change above [23:45:42] (03CR) 10jenkins-bot: Revert "Add Extension:Newsletter permissions to CommonSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377258 (owner: 10Hashar) [23:46:22] (03CR) 10jenkins-bot: Enable responsive reference columns on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/371630 (https://phabricator.wikimedia.org/T173176) (owner: 10Jforrester) [23:47:30] (03CR) 10jenkins-bot: Fix case of MediaWiki\Auth\AbstractPreAuthenticationProvider [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377281 (owner: 10Bartosz Dziewoński) [23:48:24] (03CR) 10jenkins-bot: Add Extension:Newsletter permissions to CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376886 (owner: 10MarcoAurelio) [23:49:02] Dereckson: ok :) [23:49:07] (03CR) 10jenkins-bot: Reduce wikiPageUpdaterDbBatchSize to 20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376562 (https://phabricator.wikimedia.org/T173710) (owner: 10Ladsgroup) [23:49:50] (03CR) 10jenkins-bot: Revert "Change logo for huwiktonary" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377257 (owner: 10Hashar) [23:50:32] (03CR) 10jenkins-bot: Revert "Reduce wikiPageUpdaterDbBatchSize to 20" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377260 (owner: 10Hashar) [23:51:21] (03CR) 10jenkins-bot: MariaDB: Repool es1019 with low load after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377307 (https://phabricator.wikimedia.org/T167121) (owner: 10Jcrespo)