[00:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Time to snap out of that daydream and deploy Evening SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T0000). [00:00:05] No GERRIT patches in the queue for this window AFAICS. [00:00:39] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10RobH) a:05Afandian→03None >>! In T209298#4979565, @RyanSteinberg wrote: > I'm reopening this... [00:02:34] (03PS1) 10RobH: adding kharlan to analytics-users group [puppet] - 10https://gerrit.wikimedia.org/r/492933 (https://phabricator.wikimedia.org/T216258) [00:03:46] (03CR) 10RobH: [C: 03+2] adding kharlan to analytics-users group [puppet] - 10https://gerrit.wikimedia.org/r/492933 (https://phabricator.wikimedia.org/T216258) (owner: 10RobH) [00:05:09] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stat1007 for kharlan - https://phabricator.wikimedia.org/T216258 (10RobH) 05Open→03Resolved a:03RobH As this is past the 3 business day wait and has all the approved requirements, I've merged this access live. Please allow u... [00:05:29] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10RobH) 05Open→03Resolved a:03RobH [00:48:27] (03PS1) 1020after4: deployment-prep: swap db master / slave [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492938 (https://phabricator.wikimedia.org/T216067) [00:49:58] anyone able to review database config change for deployment-prep? [00:50:01] https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/492938/ [00:52:42] (03CR) 10Alex Monk: [C: 03+1] deployment-prep: swap db master / slave [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492938 (https://phabricator.wikimedia.org/T216067) (owner: 1020after4) [00:53:00] twentyafterfour, it's just db-labs I doubt we need to bother the prod DBAs about it, and you're a deployer, so... [01:00:31] twentyafterfour: yeah, jfdi right now with that [01:09:37] (03PS1) 10Revi: WIP: Add enwiki to azwiki import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492942 (https://phabricator.wikimedia.org/T217104) [01:10:17] (03PS2) 10Revi: WIP: Add enwiki to azwiki import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492942 (https://phabricator.wikimedia.org/T217104) [01:11:02] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add enwiki to azwiki import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492942 (https://phabricator.wikimedia.org/T217104) (owner: 10Revi) [01:11:49] (03PS3) 10Revi: WIP: Add enwiki to azwiki import source [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492942 (https://phabricator.wikimedia.org/T217104) [01:18:27] * twentyafterfour jfdi [01:18:46] we have that whole no self-plus2 policy [01:19:03] * twentyafterfour just doesn't wanna get teh banhammer [01:19:28] (03CR) 1020after4: [C: 03+2] deployment-prep: swap db master / slave [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492938 (https://phabricator.wikimedia.org/T216067) (owner: 1020after4) [01:19:48] I’m pretty sure no one will cause a problem over you self +2’ing over a beta change [01:20:05] Also 2 people gave you the go ahead :) [01:20:33] (03Merged) 10jenkins-bot: deployment-prep: swap db master / slave [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492938 (https://phabricator.wikimedia.org/T216067) (owner: 1020after4) [01:22:39] paladox: I know I'm not really actually worried ;) [01:22:52] :) [01:24:24] it's been a couple of years but IIRC that policy was never successfully enforced against operations/ repositories [01:31:06] (03CR) 10jenkins-bot: deployment-prep: swap db master / slave [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492938 (https://phabricator.wikimedia.org/T216067) (owner: 1020after4) [01:45:05] doesn't the new policy specifically call out this case as allowed? [01:45:22] last time i read it, there was an example of a deployer deploying a thing [01:49:20] given that the change only affects beta and this is to unbreak beta, I'm sure it's an ok self-merge. It's still good to have review if someone is able to give it quick look ;) [01:49:27] (which krenair did, so thanks!) [02:44:30] (03CR) 10GTirloni: [C: 03+2] labstores: make failures on these hosts page more [puppet] - 10https://gerrit.wikimedia.org/r/492761 (https://phabricator.wikimedia.org/T217068) (owner: 10Andrew Bogott) [03:00:04] kart_: Time to snap out of that daydream and deploy . Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T0300). [03:06:13] !log Manual run of unpublished ContentTranslation draft purge script (T216983) [03:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:20] T216983: Run unpublished draft purge script for CX (Week of 24/02) - https://phabricator.wikimedia.org/T216983 [04:17:44] 10Operations, 10serviceops, 10Core Platform Team Backlog (Later), 10Patch-For-Review, 10Services (next): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Tgr) [04:25:32] !log Finished manual run of unpublished ContentTranslation draft purge script (T216983) [04:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:25:37] T216983: Run unpublished draft purge script for CX (Week of 24/02) - https://phabricator.wikimedia.org/T216983 [05:05:01] (03PS1) 10Aaron Schulz: Make foreign set/delete WAN cache operations asynchronous in mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/492948 [05:17:32] (03PS2) 10Aaron Schulz: Make foreign set/delete WAN cache operations asynchronous in mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/492948 [05:17:46] (03PS3) 10Aaron Schulz: Make foreign set/delete WAN cache operations asynchronous in mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/492948 [05:20:42] (03PS4) 10Aaron Schulz: Make foreign set/delete WAN cache operations asynchronous in mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/492948 [05:58:23] tgr|away: the reason I suggested EU morning is because we have both DBAs online in case something goes wrong :) [05:59:51] marostegui: if this counts as morning already, works for me [06:00:00] tgr: go for it!! [06:10:24] !log T215107 running mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki 'The_Photographer' 'Wilfredor' [06:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:28] T215107: Global rename of The_Photographer → Wilfredor: supervision needed - https://phabricator.wikimedia.org/T215107 [06:11:01] (03PS1) 10Marostegui: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492951 (https://phabricator.wikimedia.org/T187295) [06:15:10] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492951 (https://phabricator.wikimedia.org/T187295) (owner: 10Marostegui) [06:16:11] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492951 (https://phabricator.wikimedia.org/T187295) (owner: 10Marostegui) [06:16:25] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492951 (https://phabricator.wikimedia.org/T187295) (owner: 10Marostegui) [06:17:19] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1083 T187295 (duration: 00m 51s) [06:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:23] T187295: Apply AbuseFilter patch-fix-index - https://phabricator.wikimedia.org/T187295 [06:17:38] !log Change abuse_filter_log indexes on db1083 - T187295 [06:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492952 (https://phabricator.wikimedia.org/T86342) [06:25:59] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492952 (https://phabricator.wikimedia.org/T86342) (owner: 10Marostegui) [06:26:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492952 (https://phabricator.wikimedia.org/T86342) (owner: 10Marostegui) [06:27:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492952 (https://phabricator.wikimedia.org/T86342) (owner: 10Marostegui) [06:28:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1088 T86342 (duration: 00m 48s) [06:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:13] T86342: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 [06:28:46] tgr: how is it looking like? [06:29:08] I think it didn't do anything [06:29:30] the maintenance script just fires off a bunch of jobs and those don't have any logging so hard to tell [06:29:35] tgr: indeed: https://logstash.wikimedia.org/goto/c9d8027a8ee41bbd16bab2d8137baada :( [06:30:51] duh, I'm being stupid [06:31:05] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492953 [06:31:25] the rename is marked as in progress because error handling got all messed up the last time [06:31:47] so the job needs to be forced now [06:31:51] Ah so it never "failed"? [06:31:52] let me try again [06:32:07] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492953 [06:32:28] yeah, there was a spectacular cascade of errors and the rollback operations failed in all three layers of the code that had them [06:32:53] and setting the failure status would have happened after that, presumably [06:33:16] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492953 (owner: 10Marostegui) [06:34:07] !log T215107 running mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=commonswiki --logwiki=metawiki --ignorestatus 'The_Photographer' 'Wilfredor' [06:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:10] T215107: Global rename of The_Photographer → Wilfredor: supervision needed - https://phabricator.wikimedia.org/T215107 [06:34:23] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492953 (owner: 10Marostegui) [06:34:58] tgr: I can see the rename now on s4 commons master \o/ [06:35:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1083 (duration: 00m 46s) [06:35:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:05] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492954 [06:36:55] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492954 [06:38:33] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492953 (owner: 10Marostegui) [06:39:34] tgr: I think it is done on commons, I don't see any related queries on s4 master anymore [06:40:11] (03CR) 10Elukey: [C: 03+2] hadoop: allow yarn rmstore to be stored on HDFS [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492697 (https://phabricator.wikimedia.org/T216952) (owner: 10Elukey) [06:40:42] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492954 (owner: 10Marostegui) [06:41:41] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492954 (owner: 10Marostegui) [06:41:46] !log Pooling tthumbor1002 [06:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:05] It has generated lag on one host, I am going to depool it [06:42:38] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1088 (duration: 00m 45s) [06:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492955 [06:43:51] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [06:44:01] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492955 (owner: 10Marostegui) [06:44:02] marostegui: yeah the job finished there [06:44:16] and eswiki too so I guess the interesting part is done [06:44:39] I noticed commonswiki.logging is different on db1097, so it is lagging behind [06:44:42] I am depooling it [06:44:59] We are aware of those schema differences on the logging table - hopefully we will get to them soon :( [06:45:09] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1097:3314 (duration: 00m 45s) [06:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:18] seems finished [06:46:10] !log jiji@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [06:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:17] !log jiji@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 07s) [06:46:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:38] tgr: some hosts in s4 codfw lagging behind, but that's kinda expected [06:46:45] and they just recovered :) [06:46:56] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492956 [06:49:07] (03PS1) 10Elukey: hadoop: store Yarn rmstore state on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/492957 (https://phabricator.wikimedia.org/T216952) [06:49:49] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492954 (owner: 10Marostegui) [06:49:51] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492955 (owner: 10Marostegui) [06:50:35] 10Operations, 10ops-eqiad, 10Analytics, 10Patch-For-Review, 10User-Elukey: rack/setup/install labsdb1012.eqiad.wmnet - https://phabricator.wikimedia.org/T215231 (10elukey) @Cmjohnson would it be possible to move the host in a new Rack? [06:50:52] !log Depool and reimage thumbor1003 and thumbor2003 - T214597 [06:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:55] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [06:51:17] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492956 (owner: 10Marostegui) [06:52:13] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492956 (owner: 10Marostegui) [06:53:10] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1097:3314 (duration: 00m 45s) [06:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:33] (03PS1) 10Elukey: hadoop: add missing '$' in variable reference [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492958 [06:54:47] (03CR) 10Elukey: [V: 03+2 C: 03+2] hadoop: add missing '$' in variable reference [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492958 (owner: 10Elukey) [06:55:47] (03PS2) 10Elukey: hadoop: store Yarn rmstore state on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/492957 (https://phabricator.wikimedia.org/T216952) [07:01:37] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1097:3314" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492956 (owner: 10Marostegui) [07:02:55] !log Deploy schema change on s2 codfw (this will generate lag on s2 codfw) T86342 [07:02:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:59] T86342: Dropping page.page_no_title_convert on wmf databases - https://phabricator.wikimedia.org/T86342 [07:05:34] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin1001.eqiad.wmnet for hosts: ` thumbor1003.eqiad.wmnet ` The log can be found in... [07:05:59] (03PS1) 10Elukey: hadoop: fix name of default variable argument [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492959 [07:06:17] (03CR) 10Elukey: [V: 03+2 C: 03+2] hadoop: fix name of default variable argument [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492959 (owner: 10Elukey) [07:08:30] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` thumbor2003.codfw.wmnet ` The log can be found in... [07:10:11] (03PS1) 10Elukey: hadoop: set the default yarn rmstore zk path to undef [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492960 (https://phabricator.wikimedia.org/T216952) [07:11:35] (03CR) 10Elukey: [V: 03+2 C: 03+2] hadoop: set the default yarn rmstore zk path to undef [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492960 (https://phabricator.wikimedia.org/T216952) (owner: 10Elukey) [07:12:21] (03PS3) 10Elukey: hadoop: store Yarn rmstore state on HDFS [puppet] - 10https://gerrit.wikimedia.org/r/492957 (https://phabricator.wikimedia.org/T216952) [07:14:03] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/14847/analytics1028.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/492957 (https://phabricator.wikimedia.org/T216952) (owner: 10Elukey) [07:22:07] (03PS1) 10Elukey: hadoop: set proper hdfs path for the Yarn rmstore (test cluster) [puppet] - 10https://gerrit.wikimedia.org/r/492961 (https://phabricator.wikimedia.org/T216952) [07:22:57] (03CR) 10Elukey: [C: 03+2] hadoop: set proper hdfs path for the Yarn rmstore (test cluster) [puppet] - 10https://gerrit.wikimedia.org/r/492961 (https://phabricator.wikimedia.org/T216952) (owner: 10Elukey) [07:39:10] (03PS2) 10Giuseppe Lavagetto: Add golang image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/492667 [07:39:54] (03CR) 10Giuseppe Lavagetto: Add golang image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/492667 (owner: 10Giuseppe Lavagetto) [07:40:21] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add golang image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/492667 (owner: 10Giuseppe Lavagetto) [07:44:09] !log installing tiff security updates [07:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:39] <_joe_> !log publishing golang:1.11.5-1 docker image [07:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:47] !log removed /rmstore-analytics-test-hadoop from zookeeper main-eqiad - T216952 [07:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:50] T216952: Hadoop Yarn stores a ton of znodes related to running/old applications - https://phabricator.wikimedia.org/T216952 [07:57:09] ACKNOWLEDGEMENT - MD RAID on thumbor2003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.58. Check system logs on 10.192.16.58 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T217118 [07:57:15] 10Operations, 10ops-codfw: Degraded RAID on thumbor2003 - https://phabricator.wikimedia.org/T217118 (10ops-monitoring-bot) [08:03:35] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor2003.codfw.wmnet'] ` and were **ALL** successful. [08:04:20] !log jiji@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [08:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:50] !log jiji@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 30s) [08:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:25] !log Pooling thumbor2003 - T214597 [08:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:29] T214597: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 [08:08:12] (03PS1) 10Alexandros Kosiaris: golang: Fix dependency on stretch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/492965 [08:08:27] !log Depool and reimage thumbor2004 - T214597 [08:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:50] 10Operations, 10Release Pipeline, 10Core Platform Team Backlog (Watching / External), 10Patch-For-Review, and 2 others: Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10akosiaris) I 've just added a simple patch to blubber that should... [08:23:54] (03PS1) 10Vgutierrez: acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/492966 (https://phabricator.wikimedia.org/T214921) [08:24:32] (03CR) 10Vgutierrez: [C: 03+1] acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/492966 (https://phabricator.wikimedia.org/T214921) (owner: 10Vgutierrez) [08:24:37] (03CR) 10Marostegui: mariadb: Add the option of postprocessing backups (033 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [08:27:23] (03CR) 10Mathew.onipe: [C: 03+1] acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/492966 (https://phabricator.wikimedia.org/T214921) (owner: 10Vgutierrez) [08:28:03] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jiji on cumin2001.codfw.wmnet for hosts: ` thumbor2004.codfw.wmnet ` The log can be found in... [08:28:13] (03PS1) 10Marostegui: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492967 [08:28:15] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o [puppet] - 10https://gerrit.wikimedia.org/r/492966 (https://phabricator.wikimedia.org/T214921) (owner: 10Vgutierrez) [08:29:15] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492967 (owner: 10Marostegui) [08:30:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492967 (owner: 10Marostegui) [08:31:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1080 (duration: 00m 46s) [08:31:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1080 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492967 (owner: 10Marostegui) [08:36:08] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10Discovery-Search (Current work), 10Patch-For-Review: Setup elasticsearch on cloudelastic100[1-4] - https://phabricator.wikimedia.org/T214921 (10Vgutierrez) The certificate has been issued successfully: `vgutierrez@acmechief1001:~$ sudo -i openssl x509 -... [08:36:17] ACKNOWLEDGEMENT - MD RAID on thumbor1003 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.71. Check system logs on 10.64.48.71 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T217121 [08:36:21] 10Operations, 10ops-eqiad: Degraded RAID on thumbor1003 - https://phabricator.wikimedia.org/T217121 (10ops-monitoring-bot) [08:38:29] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492969 [08:40:36] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492969 (owner: 10Marostegui) [08:41:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492969 (owner: 10Marostegui) [08:42:44] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1080 (duration: 00m 45s) [08:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:33] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [08:44:43] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1080" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492969 (owner: 10Marostegui) [08:44:48] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor1003.eqiad.wmnet'] ` and were **ALL** successful. [08:45:07] !log installing elfutils security updates [08:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:20] !log jiji@deploy1001 Started deploy [3d2png/deploy@ca39432]: (no justification provided) [08:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:09] !log jiji@deploy1001 Finished deploy [3d2png/deploy@ca39432]: (no justification provided) (duration: 00m 49s) [08:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:00] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] golang: Fix dependency on stretch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/492965 (owner: 10Alexandros Kosiaris) [09:00:01] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) Update: while converting `global` instance with full retention `10500h` the migrator ran into a problem that would abort the migration. I... [09:04:21] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10fgiunchedi) >>! In T204567#4981395, @Papaul wrote: > @fgiunchedi we can do this tomorrow if thats okay with you. Thanks. Sounds good to me -- ping me on IRC when good to go! [09:04:59] !log Pooling thumbor1003 [09:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:52] !log temporarilly stop dbstore1001:s1replication to perform new backup system test [09:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:23] (03PS1) 10Vgutierrez: Revert "acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o" [puppet] - 10https://gerrit.wikimedia.org/r/492974 [09:12:34] sigh [09:13:00] (03CR) 10jerkins-bot: [V: 04-1] Revert "acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o" [puppet] - 10https://gerrit.wikimedia.org/r/492974 (owner: 10Vgutierrez) [09:13:08] WTF? [09:13:37] Weird [09:13:56] oh cool.. gerrit generates commit messages that not comply with our commit msg validator [09:13:57] \o/ [09:14:52] (03PS2) 10Vgutierrez: Revert "acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o" [puppet] - 10https://gerrit.wikimedia.org/r/492974 (https://phabricator.wikimedia.org/T214921) [09:16:03] (03CR) 10Vgutierrez: [C: 03+2] Revert "acme_chief: Issue a certificate for cloudelastic100[1-4].wm.o" [puppet] - 10https://gerrit.wikimedia.org/r/492974 (https://phabricator.wikimedia.org/T214921) (owner: 10Vgutierrez) [09:20:38] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['thumbor2004.codfw.wmnet'] ` and were **ALL** successful. [09:22:25] (03CR) 10Alexandros Kosiaris: [C: 03+1] Use schemas from docker image, configure api-request stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/492925 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [09:26:01] (03PS2) 10Filippo Giunchedi: rsyslog: always use mmjsonparse when shipping to kafka [puppet] - 10https://gerrit.wikimedia.org/r/492632 (https://phabricator.wikimedia.org/T213189) [09:26:01] (03CR) 10Vgutierrez: [C: 03+2] certcentral: wipe certcentral from our puppet codebase [puppet] - 10https://gerrit.wikimedia.org/r/492744 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [09:26:01] (03PS2) 10Vgutierrez: certcentral: wipe certcentral from our puppet codebase [puppet] - 10https://gerrit.wikimedia.org/r/492744 (https://phabricator.wikimedia.org/T207389) [09:30:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+1] Use schemas from docker image, configure api-request stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/492925 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [09:30:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Use schemas from docker image, configure api-request stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/492925 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [09:30:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/492925 (https://phabricator.wikimedia.org/T214080) (owner: 10Ottomata) [09:36:01] (03PS10) 10Jcrespo: mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) [09:36:53] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Add the option of postprocessing backups [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [12:17:11] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [12:18:55] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [12:19:50] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [12:20:00] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) 05Open→03Resolved All servers have been upgraded to stretch, next episode on T216815 🍾 [12:21:53] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Thumbor upgrade to stretch plan - https://phabricator.wikimedia.org/T214597 (10jijiki) [12:21:59] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [12:22:03] 10Operations, 10Multimedia, 10Thumbor, 10serviceops, and 2 others: Deploy 3d2png to thumbor servers (stretch) - https://phabricator.wikimedia.org/T216494 (10jijiki) 05Open→03Resolved a:03jijiki Thanks to @gilles and @mobrovac, we can close this. [12:23:09] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [12:23:29] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [12:23:36] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, 10User-jijiki: Investigate systemd hardening to replace Firejail for Thumbor - https://phabricator.wikimedia.org/T212941 (10jijiki) [12:23:38] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [12:23:46] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [12:23:51] 10Operations, 10Thumbor, 10Wikimedia-Logstash, 10serviceops, 10User-jijiki: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10jijiki) [12:24:03] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [12:24:52] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [12:26:12] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [12:31:23] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) 05Open→03Resolved a:03jijiki All servers have been upgraded to stretch, next episode on T216815 🍾 [12:36:19] (03PS1) 10Ladsgroup: labs: Enable musical notation datatype in wikidatawiki in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493009 (https://phabricator.wikimedia.org/T216730) [12:37:20] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [12:38:30] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10Gilles) [12:41:29] (03PS1) 10Ladsgroup: Enable musical notation datatype on testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493010 (https://phabricator.wikimedia.org/T216730) [12:41:31] (03PS1) 10Ladsgroup: Enable musical notation datatype in wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493011 (https://phabricator.wikimedia.org/T216730) [12:42:45] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.18; 2019-02-19), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) Took a look to some metrics and so far ev... [12:52:01] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [12:52:11] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [12:52:18] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628 (10jijiki) [12:52:54] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [12:53:01] 10Operations, 10Commons, 10Thumbor, 10media-storage, 10Performance-Team (Radar): Jessie rsvg/cairo can't render specific SVG file on Commons - https://phabricator.wikimedia.org/T170628 (10jijiki) 05Stalled→03Resolved a:03jijiki Looks like this file is ok after upgrading to stretch [12:53:52] 10Operations, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 - https://phabricator.wikimedia.org/T193352 (10jijiki) [13:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T1300) [13:09:24] 10Operations, 10Thumbor, 10serviceops: Use image testing for Thumbor upgrades - https://phabricator.wikimedia.org/T217133 (10jijiki) p:05Triage→03Low [13:09:49] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Use image testing for Thumbor upgrades - https://phabricator.wikimedia.org/T217133 (10jijiki) [13:11:19] sorry for the wikibugs spam [13:11:38] upgrading thumbor was a very old task with many many relations between them [13:11:51] and many other old open tasks as well [13:19:18] (03PS1) 10Marostegui: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493020 [13:21:23] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493020 (owner: 10Marostegui) [13:22:24] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493020 (owner: 10Marostegui) [13:22:38] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493020 (owner: 10Marostegui) [13:23:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1122 (duration: 00m 46s) [13:23:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:00] (03PS8) 10Hashar: scan and process templates in parallel [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/484578 [13:31:10] (03PS1) 10Vgutierrez: CI test, do not merge [software/acme-chief] - 10https://gerrit.wikimedia.org/r/493022 (https://phabricator.wikimedia.org/T207389) [13:31:44] (03PS2) 10Hashar: cli.defaults was altered by read_config() [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487848 [13:35:52] bah I forgot the train [13:37:17] jouncebot: next [13:37:17] In 0 hour(s) and 22 minute(s): MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T1400) [13:37:41] !log cutting deployment branch 1.33.0-wmf.19 [13:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:49] (03Abandoned) 10Vgutierrez: CI test, do not merge [software/acme-chief] - 10https://gerrit.wikimedia.org/r/493022 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [13:39:58] (03PS1) 10Sbisson: Enable the GrowthExperiments Homepage in lab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493023 (https://phabricator.wikimedia.org/T215982) [13:40:24] jouncebot: now [13:40:25] For the next 0 hour(s) and 19 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T1300) [13:44:37] (03PS2) 10Vgutierrez: fix eqiad lvs cross-vlan A records [dns] - 10https://gerrit.wikimedia.org/r/451614 [13:48:50] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493025 [13:49:04] (03CR) 10Filippo Giunchedi: [C: 03+1] "Looks good, see inline for last regexp-nits, after that I think good to merge (no need to re-review)" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/492734 (owner: 10Alexandros Kosiaris) [13:58:17] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493025 (owner: 10Marostegui) [13:58:22] (03CR) 10Filippo Giunchedi: [C: 03+1] nagios_common::commands: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/491460 (owner: 10Muehlenhoff) [13:58:44] (03CR) 10Vgutierrez: [C: 03+1] "This CR reduces the number of warnings from 10954 to 10884, specifically reduces the W001|MISSING_IP_FOR_NAME_AND_PTR from 4838 to 4768" [dns] - 10https://gerrit.wikimedia.org/r/451614 (owner: 10Vgutierrez) [13:59:27] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493025 (owner: 10Marostegui) [13:59:46] (03CR) 10Filippo Giunchedi: "FYI a similar change has been merged today, I think this can be abandoned now" [puppet] - 10https://gerrit.wikimedia.org/r/490797 (owner: 10Paladox) [14:00:04] hashar: That opportune time is upon us again. Time for a MediaWiki train - European version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T1400). [14:00:27] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1122 (duration: 00m 45s) [14:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:30] elukey: impressive :) [14:10:04] (03PS6) 10Fsero: Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [14:11:56] (03PS1) 10Vgutierrez: deployment-prep: Get rid of deprecated certcentral references [puppet] - 10https://gerrit.wikimedia.org/r/493027 (https://phabricator.wikimedia.org/T207389) [14:14:38] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493025 (owner: 10Marostegui) [14:15:19] (03CR) 10Ottomata: "Waiting is fine with me, but I don't think it has to wait at all! The deployed code is already using the EventServices config and not usi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko) [14:17:42] (03CR) 10Vgutierrez: [C: 03+2] deployment-prep: Get rid of deprecated certcentral references [puppet] - 10https://gerrit.wikimedia.org/r/493027 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [14:18:19] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stat1007 for kharlan - https://phabricator.wikimedia.org/T216258 (10Ottomata) Interesting, it might be good for me to at least comment on questions like this. Am I the owner of that group? Probably one of either me, Luca, or Nur... [14:20:47] !log Applied 1.33.0-wmf.19 security patches | T206673 [14:20:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:50] T206673: 1.33.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T206673 [14:24:38] (03PS1) 10Hashar: Group0 to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493028 (https://phabricator.wikimedia.org/T206673) [14:24:40] (03CR) 10Ottomata: [C: 03+1] "One nit, +1 tho thank youuuuu" (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492978 (owner: 10Elukey) [14:34:25] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10Gilles) What does that mean, exactly? Losing 84 days worth of data that's already 15 months old? [14:36:27] (03CR) 10Elukey: Remove all inheritance occurrences (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492978 (owner: 10Elukey) [14:36:44] !log hashar@deploy1001 Pruned MediaWiki: 1.33.0-wmf.19 (duration: 04m 42s) [14:36:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:05] what [14:37:23] ahah [14:37:25] wrong command :- [14:37:34] * hashar starts again from scratch [14:39:49] 10Operations, 10Acme-chief, 10Traffic, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [14:40:41] (03PS11) 10Elukey: Remove all inheritance occurrences [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492978 [14:41:20] 10Operations, 10Acme-chief, 10Traffic, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Krenair) Possible other subtasks: {T207294} {T203423}? [14:42:56] (03PS2) 10CDanis: webserver_misc_apps: unbundle grafana [puppet] - 10https://gerrit.wikimedia.org/r/491780 [14:46:17] 10Operations, 10Acme-chief, 10Icinga, 10monitoring: Create icinga checks for certcentral - https://phabricator.wikimedia.org/T207294 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [14:47:22] (03PS1) 10Ottomata: Add eventgate-analytics-0.0.5.tgz and update index [deployment-charts] - 10https://gerrit.wikimedia.org/r/493032 [14:48:05] (03CR) 10Vgutierrez: [C: 03+2] fix eqiad lvs cross-vlan A records [dns] - 10https://gerrit.wikimedia.org/r/451614 (owner: 10Vgutierrez) [14:48:43] (03PS7) 10Fsero: Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [14:48:50] (03PS1) 10Gilles: Enable Priority Hints origin trial on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493036 (https://phabricator.wikimedia.org/T216499) [14:48:57] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) >>! In T187987#4984611, @Gilles wrote: > What does that mean, exactly? Losing 84 days worth of data that's already 15 months old? That's... [14:49:57] 10Operations, 10monitoring, 10Patch-For-Review: Serve >= 50% of production Prometheus systems with Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10Gilles) Sounds fine to me! [14:50:38] (03CR) 10Elukey: ">" (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492978 (owner: 10Elukey) [14:51:08] (03CR) 10Elukey: "> >" (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492978 (owner: 10Elukey) [14:51:25] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add eventgate-analytics-0.0.5.tgz and update index [deployment-charts] - 10https://gerrit.wikimedia.org/r/493032 (owner: 10Ottomata) [14:53:10] (03CR) 10jerkins-bot: [V: 04-1] Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [14:53:45] Nikerabbit: yes really great! If we manage to reduce the size of the translate-groups key i think that we'll be done done done :) [14:54:01] (03PS3) 10Andrew Bogott: labstores: make failures on these hosts page more [puppet] - 10https://gerrit.wikimedia.org/r/492761 (https://phabricator.wikimedia.org/T217068) [14:56:22] (03PS8) 10Fsero: Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [14:58:04] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [14:58:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:10] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [14:58:10] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:09] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:10] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:02:10] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:02:10] !log hashar@deploy1001 Started scap: testwiki to php-1.33.0-wmf.19 and rebuild l10n cache # T206673 [15:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:16] T206673: 1.33.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T206673 [15:02:52] elukey: yeah I'm hoping that me or someone else will have a look at that in the coming weeks [15:02:53] (03CR) 10jerkins-bot: [V: 04-1] Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [15:03:40] (03CR) 10Filippo Giunchedi: [C: 03+1] hiera: upgrade prometheus-node-exporter to 0.17 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/490689 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [15:05:10] Nikerabbit: thanks! Lemme know if I can help! [15:05:26] (03PS3) 10Cwhite: hiera: upgrade prometheus-node-exporter to 0.17 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/490689 (https://phabricator.wikimedia.org/T213708) [15:08:20] (03CR) 10Cwhite: [C: 03+2] hiera: upgrade prometheus-node-exporter to 0.17 in codfw [puppet] - 10https://gerrit.wikimedia.org/r/490689 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [15:09:37] (03PS9) 10Fsero: Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [15:13:36] PROBLEM - DPKG on mx2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:13:48] !log poweroff ms-be2030 - T204567 [15:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:51] T204567: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 [15:14:00] PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:14:30] PROBLEM - puppet last run on elastic2026 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:14:39] 10Operations, 10serviceops, 10Performance-Team (Radar), 10User-Elukey, 10User-jijiki: Test different growth factors for memcached (prep step for upgrade to newer versions) - https://phabricator.wikimedia.org/T217020 (10elukey) [15:14:54] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:00] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:06] PROBLEM - puppet last run on mw2191 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:06] PROBLEM - puppet last run on pc2008 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:08] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:14] PROBLEM - puppet last run on wdqs2002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:28] PROBLEM - puppet last run on ms-be2042 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:42] PROBLEM - puppet last run on acrab is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:42] PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:42] PROBLEM - puppet last run on wtp2020 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:56] PROBLEM - puppet last run on db2091 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:56] PROBLEM - puppet last run on db2093 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:56] PROBLEM - puppet last run on ms-be2041 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:58] PROBLEM - puppet last run on mw2278 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:58] PROBLEM - puppet last run on mw2282 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:58] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:58] PROBLEM - puppet last run on mw2254 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:58] PROBLEM - puppet last run on mw2273 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:15:58] PROBLEM - puppet last run on mw2159 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:10] PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:14] PROBLEM - puppet last run on ganeti2003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:14] PROBLEM - puppet last run on webperf2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:16] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:16] PROBLEM - puppet last run on mw2227 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:16] PROBLEM - puppet last run on elastic2025 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:22] godog: prometheus node exporter is broken :) [15:16:24] PROBLEM - puppet last run on ms-be2039 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:27] I suspect prometheus-node-exporter 0.17 is not well :) [15:16:28] PROBLEM - puppet last run on db2078 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:28] PROBLEM - puppet last run on wtp2011 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:45] shdubsh: ^^^ :) prometheus node exporter is broken :) [15:16:50] PROBLEM - puppet last run on elastic2034 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:16:56] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:17:06] PROBLEM - puppet last run on ganeti2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:17:08] PROBLEM - puppet last run on db2079 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:17:08] PROBLEM - puppet last run on ores2001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:17:10] PROBLEM - puppet last run on mw2235 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:17:15] I’ll kill ircecho for the moment [15:17:16] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:17:25] rollback! :D [15:17:42] checking if a second puppet run makes it work [15:17:42] ouch, sorry about the spam [15:17:45] herron: you just killed it, as I found the reason, why the quiet did not work :D [15:17:48] !log stopped ircecho to squelch puppet run alerts [15:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:55] :) [15:18:00] I will remove my quiet, so it will work again, when you run ircecho again ;) [15:18:19] ok thanks [15:18:20] seems a second puppet run clears the failure, when I tested on ores2001 [15:18:28] Notice: /Stage[main]/Prometheus::Node_exporter/Service[prometheus-node-exporter-ipmitool-sensor.timer]/enable: enable changed 'true' to 'mask' [15:18:31] Notice: /Stage[main]/Prometheus::Node_exporter/Service[prometheus-node-exporter-smartmon.timer]/ensure: ensure changed 'running' to 'stopped' [15:18:34] Notice: Applied catalog in 10.89 seconds [15:18:42] bblack: can confirm on mw2206 [15:18:56] may just need to cumin a double-agent-run across codfw? [15:19:19] I think so too, will run that on failed hosts [15:19:24] ok [15:19:47] https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed if needed [15:20:18] seems like an extra require or two is in order... [15:21:03] !log force puppet run on failed agents in codfw [15:21:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:58] (03PS1) 10Cwhite: prometheus: require package before disabling services [puppet] - 10https://gerrit.wikimedia.org/r/493047 (https://phabricator.wikimedia.org/T213708) [15:22:40] (03PS2) 10Cwhite: prometheus: require package before masking services [puppet] - 10https://gerrit.wikimedia.org/r/493047 (https://phabricator.wikimedia.org/T213708) [15:24:12] (03CR) 10Alexandros Kosiaris: Switch mathoid_requests_duration to a histogram (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/492734 (owner: 10Alexandros Kosiaris) [15:24:20] (03PS6) 10Alexandros Kosiaris: Switch mathoid_requests_duration to a histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/492734 [15:25:36] PROBLEM - puppet last run on mw2187 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 5 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:25:48] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Switch mathoid_requests_duration to a histogram [deployment-charts] - 10https://gerrit.wikimedia.org/r/492734 (owner: 10Alexandros Kosiaris) [15:25:56] PROBLEM - puppet last run on restbase2013 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:25:56] PROBLEM - puppet last run on wtp2018 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Service[prometheus-node-exporter-ipmitool-sensor.timer],Service[prometheus-node-exporter-smartmon.timer] [15:26:25] stopped ircecho again, puppet restarted it [15:26:25] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: require package before masking services [puppet] - 10https://gerrit.wikimedia.org/r/493047 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [15:26:36] (03PS1) 10Mathew.onipe: cirrus: fallback option to localssl using acme subject [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) [15:26:40] (03CR) 10Cwhite: [C: 03+2] prometheus: require package before masking services [puppet] - 10https://gerrit.wikimedia.org/r/493047 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [15:27:56] (03CR) 10jerkins-bot: [V: 04-1] cirrus: fallback option to localssl using acme subject [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:28:14] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Followup for a2b0fdbfc60a5d1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/491989 (owner: 10Alexandros Kosiaris) [15:30:45] (03PS2) 10Mathew.onipe: cirrus: fallback option to localssl using acme subject [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) [15:30:51] (03CR) 10Ottomata: "Ah, ok if we override this even if zookeeper_hosts is set, then leave it in as a parameter as is." [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492978 (owner: 10Elukey) [15:32:42] sync-apaches: 60% (ok: 158; fail: 0; left: 104) [15:32:45] that takes a while :/ [15:34:18] (03PS12) 10Elukey: Remove all inheritance occurrences [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492978 [15:34:52] (03PS1) 10BBlack: k8s pod zone_validator cleanup [dns] - 10https://gerrit.wikimedia.org/r/493050 [15:35:47] (03PS1) 10Alexandros Kosiaris: Bump mathoid chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/493051 [15:36:28] (03PS1) 10Fsero: added new hieradata for docker-registry-ha [labs/private] - 10https://gerrit.wikimedia.org/r/493052 [15:37:03] (03CR) 10Fsero: [V: 03+2 C: 03+2] added new hieradata for docker-registry-ha [labs/private] - 10https://gerrit.wikimedia.org/r/493052 (owner: 10Fsero) [15:37:10] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Bump mathoid chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/493051 (owner: 10Alexandros Kosiaris) [15:38:43] volans: akosiaris: https://gerrit.wikimedia.org/r/493050 (kills 61% of the current warning count actually, but either of you may have significant nits about it) [15:39:08] (03CR) 10Elukey: [V: 03+2 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/14866/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/492978 (owner: 10Elukey) [15:40:35] (03PS1) 10Elukey: Update the cdh module to its latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/493053 [15:41:01] (actually I'm pretty sure the "skip mgmt for k8s pods" part of that patch is wrong in a couple of ways, even though it seems to work on the surface, heh) [15:42:13] (03PS1) 10Gilles: Oversample navtiming on ruwiki and eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493055 (https://phabricator.wikimedia.org/T187299) [15:42:43] (03CR) 10Elukey: [C: 03+2] Update the cdh module to its latest SHA [puppet] - 10https://gerrit.wikimedia.org/r/493053 (owner: 10Elukey) [15:42:57] (03CR) 10Volans: k8s pod zone_validator cleanup (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/493050 (owner: 10BBlack) [15:43:13] bleh it's still not fixing all the k8s pod entries anyways, only some of them [15:43:20] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-staging-values.yaml staging stable/mathoid [namespace: mathoid, clusters: staging] [15:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:27] !log akosiaris@deploy1001 scap-helm mathoid cluster staging completed [15:43:27] !log akosiaris@deploy1001 scap-helm mathoid finished [15:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:37] (03CR) 10Fsero: "https://puppet-compiler.wmflabs.org/compiler1001/14868/registry1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [15:45:23] (03CR) 10Ppchelko: [C: 03+1] "Indeed. I've got lost in all our refactorings." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492770 (owner: 10Ppchelko) [15:47:00] !log akosiaris@deploy1001 scap-helm mathoid upgrade -f mathoid-values.yaml production stable/mathoid [namespace: mathoid, clusters: eqiad,codfw] [15:47:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:05] !log akosiaris@deploy1001 scap-helm mathoid cluster eqiad completed [15:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:12] !log akosiaris@deploy1001 scap-helm mathoid cluster codfw completed [15:47:12] !log akosiaris@deploy1001 scap-helm mathoid finished [15:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:02] (03CR) 10Vgutierrez: [C: 04-1] "current nodes using profile::elasticsearch::cirrus class fail on compilation time: https://puppet-compiler.wmflabs.org/compiler1001/14870/" [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:54:17] (03PS10) 10Fsero: Enabling docker registry swift replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [15:54:53] (03CR) 10Jcrespo: "BTW, jenkins will not be happy with this commit, because transfer.py has test errors and other files do not lint either." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/491818 (https://phabricator.wikimedia.org/T210292) (owner: 10Jcrespo) [15:54:55] (03PS1) 10Fsero: added new hieradata for docker-registry-ha, swift accounts [labs/private] - 10https://gerrit.wikimedia.org/r/493058 [15:55:31] (03CR) 10Fsero: [V: 03+2 C: 03+2] added new hieradata for docker-registry-ha, swift accounts [labs/private] - 10https://gerrit.wikimedia.org/r/493058 (owner: 10Fsero) [15:56:16] (03PS2) 10BBlack: k8s pod zone_validator cleanup [dns] - 10https://gerrit.wikimedia.org/r/493050 [15:57:29] (03CR) 10Alexandros Kosiaris: k8s pod zone_validator cleanup (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/493050 (owner: 10BBlack) [16:00:06] (03CR) 10BBlack: k8s pod zone_validator cleanup (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/493050 (owner: 10BBlack) [16:00:27] !log hashar@deploy1001 Finished scap: testwiki to php-1.33.0-wmf.19 and rebuild l10n cache # T206673 (duration: 58m 17s) [16:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:31] T206673: 1.33.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T206673 [16:00:32] !log re-enabling ircecho [16:00:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:58] (03PS1) 10Vgutierrez: get rid of certcentral secrets [labs/private] - 10https://gerrit.wikimedia.org/r/493059 (https://phabricator.wikimedia.org/T207389) [16:01:28] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] get rid of certcentral secrets [labs/private] - 10https://gerrit.wikimedia.org/r/493059 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [16:01:30] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:32] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:01:32] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:09] herron shdubsh thanks folks! [16:03:26] np! [16:05:00] RECOVERY - puppet last run on icinga2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [16:05:14] (03CR) 10Fsero: "PCC is not so happy regarding swift nodes, the container-realm-sync fails https://puppet-compiler.wmflabs.org/compiler1002/14869/ms-fe1005" [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [16:07:53] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging stable/eventgate-analytics --set main_app.version=v1.0.0-rc2 [namespace: eventgate-analytics, clusters: staging] [16:07:54] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:54] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:46] !log elasticsearch stopped on logstash100[456] T213898 [16:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:49] T213898: Replace and expand Elasticsearch storage in eqiad and upgrade the cluster from Debian jessie to stretch - https://phabricator.wikimedia.org/T213898 [16:09:59] ok time to deploy [16:10:11] (03CR) 10Hashar: [C: 03+2] Group0 to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493028 (https://phabricator.wikimedia.org/T206673) (owner: 10Hashar) [16:10:33] 10Operations, 10ops-codfw: ms-be2030 spontaneous reboot - https://phabricator.wikimedia.org/T204567 (10Papaul) 05Open→03Resolved Firmware upgrade complete. Resolving this task for now. [16:10:53] (03PS9) 10Herron: logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) [16:11:15] (03Merged) 10jenkins-bot: Group0 to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493028 (https://phabricator.wikimedia.org/T206673) (owner: 10Hashar) [16:12:32] (03PS3) 10Mathew.onipe: cirrus: fallback option to localssl using acme subject [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) [16:12:46] (03CR) 10Herron: [C: 03+2] logstash: remove elasticsearch role from logstash100[456] [puppet] - 10https://gerrit.wikimedia.org/r/492695 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [16:13:04] (03CR) 10jenkins-bot: Group0 to 1.33.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493028 (https://phabricator.wikimedia.org/T206673) (owner: 10Hashar) [16:14:18] !log hashar@deploy1001 rebuilt and synchronized wikiversions files: group0 to 1.33.0-wmf.19 # T206673 [16:14:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:22] T206673: 1.33.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T206673 [16:14:52] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:17:02] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:18:57] (03PS4) 10Mathew.onipe: cirrus: fallback option to localssl using acme subject [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) [16:21:08] (03PS11) 10Fsero: Enabling docker registry swift cross dc replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [16:22:19] (03CR) 10Mathew.onipe: "> Patch Set 4: Verified+2" [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [16:22:37] (03PS1) 10Effie Mouzeli: Apply -R 200 to memcached on mc1028 [puppet] - 10https://gerrit.wikimedia.org/r/493060 (https://phabricator.wikimedia.org/T208844) [16:23:20] (03PS5) 10Mathew.onipe: cirrus: fallback option to use localssl via acme subject [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) [16:23:33] (03CR) 10Effie Mouzeli: [C: 03+2] Apply -R 200 to memcached on mc1028 [puppet] - 10https://gerrit.wikimedia.org/r/493060 (https://phabricator.wikimedia.org/T208844) (owner: 10Effie Mouzeli) [16:24:37] !log Restarting memcached on mc1028 - T208844 [16:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:40] T208844: Apply -R 200 to all the memcached mw object cache instances running in eqiad/codfw - https://phabricator.wikimedia.org/T208844 [16:26:18] (03CR) 10Giuseppe Lavagetto: Improve logging of errors, remove spurious print statements (034 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/492647 (owner: 10Giuseppe Lavagetto) [16:27:42] godog: mutante: going to back up /var/lib/grafana on krypton, merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491780/, and then manually apt-get remove grafana on krypton [16:28:18] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fsero) It seems there are some issues on the swift side regarding container-real-synchronization. I'll hold this for now and... [16:28:42] cdanis: +1 sounds good to me [16:29:21] (03PS1) 10GTirloni: Show currently building image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/493062 [16:29:41] (03CR) 10CDanis: [C: 03+2] webserver_misc_apps: unbundle grafana [puppet] - 10https://gerrit.wikimedia.org/r/491780 (owner: 10CDanis) [16:29:50] (03PS3) 10CDanis: webserver_misc_apps: unbundle grafana [puppet] - 10https://gerrit.wikimedia.org/r/491780 [16:30:32] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) [16:30:34] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fsero) 05Open→03Stalled [16:30:39] (03PS1) 10Mathew.onipe: relforge: switch relforge to use localssl [puppet] - 10https://gerrit.wikimedia.org/r/493063 (https://phabricator.wikimedia.org/T214921) [16:30:41] (03CR) 10GTirloni: [C: 03+2] Show currently building image [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/493062 (owner: 10GTirloni) [16:31:30] I think 1.33.0-wmf.19 is fine on group0 Will monitor again tonight [16:31:31] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10CDanis) I'm happy to help in the future, although it will also be a learning exercise for me :) [16:31:39] a couple issues with EventBus but might have bene one off errors [16:31:43] (I filled tasks) [16:34:57] !log otto@deploy1001 scap-helm eventgate-analytics install -n production eventgate-analytics-eqiad-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [16:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:14] !log otto@deploy1001 scap-helm eventgate-analytics install -n production -f eventgate-analytics-eqiad-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [16:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:18] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [16:35:18] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:39] !log otto@deploy1001 scap-helm eventgate-analytics install -n production -f eventgate-analytics-codfw-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [16:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:44] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [16:35:44] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:01] (03PS1) 10Paladox: Merge branch 'stable-2.15' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/493066 [16:38:10] (03PS2) 10Paladox: Merge branch 'stable-2.15' into wmf/stable-2.15 [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/493066 [16:38:13] (03PS1) 10Paladox: Update LFS [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/493067 [16:38:44] !log cdanis@krypton sudo apt-get remove grafana [16:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:21] (03PS5) 10Ottomata: Set up DNS for eventgate-analytics [dns] - 10https://gerrit.wikimedia.org/r/491860 (https://phabricator.wikimedia.org/T211247) [16:43:51] (03CR) 10Ottomata: [C: 03+2] "TODOs will be resolved in next patch after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/491861/ is merged." [dns] - 10https://gerrit.wikimedia.org/r/491860 (https://phabricator.wikimedia.org/T211247) (owner: 10Ottomata) [16:47:30] (03CR) 10Paladox: [V: 03+2 C: 03+2] "Builds locally." [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/493066 (owner: 10Paladox) [16:47:40] (03CR) 10Paladox: [V: 03+2 C: 03+2] "Builds locally." [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/493067 (owner: 10Paladox) [16:48:48] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10RobH) I neglected to update this task with the email I sent out yesterday. > This task is being tracked on: https://phabricator.wikimedia.org/T212348 > > TL;DR: If you got this email, look... [16:52:59] (03CR) 10Sbisson: [C: 03+2] Enable the GrowthExperiments Homepage in lab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493023 (https://phabricator.wikimedia.org/T215982) (owner: 10Sbisson) [16:53:16] (03PS1) 10Volans: icinga: add both hosts to certificate SNI [puppet] - 10https://gerrit.wikimedia.org/r/493071 [16:53:58] (03PS4) 10Jcrespo: WMFMariaDB refactoring and adding tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/449185 [16:54:08] (03CR) 10Vgutierrez: [C: 03+2] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/493071 (owner: 10Volans) [16:54:10] (03Merged) 10jenkins-bot: Enable the GrowthExperiments Homepage in lab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493023 (https://phabricator.wikimedia.org/T215982) (owner: 10Sbisson) [16:54:22] (03CR) 10jerkins-bot: [V: 04-1] WMFMariaDB refactoring and adding tests [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/449185 (owner: 10Jcrespo) [16:54:45] (03PS2) 10Volans: icinga: add both hosts to certificate SNI [puppet] - 10https://gerrit.wikimedia.org/r/493071 [16:55:07] (03PS2) 10DCausse: [cirrus] decrease regex timeouts by 25% and drop timeout hack [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492989 (https://phabricator.wikimedia.org/T216860) [16:56:20] jouncebot: now [16:56:20] No deployments scheduled for the next 0 hour(s) and 3 minute(s) [16:56:23] jouncebot: next [16:56:24] In 0 hour(s) and 3 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T1700) [16:58:50] (03CR) 10jenkins-bot: Enable the GrowthExperiments Homepage in lab [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493023 (https://phabricator.wikimedia.org/T215982) (owner: 10Sbisson) [16:59:36] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10RobH) So, the list in my email is too long, and some of those hosts were previously moved by Chris in advance of my email (likely well in advance, awhile ago, via independent projects.) The act... [17:00:04] godog and _joe_: I, the Bot under the Fountain, allow thee, The Deployer, to do Puppet SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T1700). [17:00:04] No GERRIT patches in the queue for this window AFAICS. [17:09:13] (03PS1) 10Herron: logstash: enable alerting on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/493075 [17:09:27] (03PS1) 10Fsero: added new hieradata for docker-registry-ha, replication key [labs/private] - 10https://gerrit.wikimedia.org/r/493076 [17:09:53] (03CR) 10Fsero: [V: 03+2 C: 03+2] added new hieradata for docker-registry-ha, replication key [labs/private] - 10https://gerrit.wikimedia.org/r/493076 (owner: 10Fsero) [17:10:20] (03PS6) 10Mathew.onipe: cirrus: fallback option to use localssl via acme subject [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) [17:10:22] (03PS2) 10Mathew.onipe: relforge: switch relforge to use localssl [puppet] - 10https://gerrit.wikimedia.org/r/493063 (https://phabricator.wikimedia.org/T214921) [17:10:24] (03CR) 10Herron: [C: 03+2] logstash: enable alerting on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/493075 (owner: 10Herron) [17:10:29] (03PS2) 10Herron: logstash: enable alerting on new hosts [puppet] - 10https://gerrit.wikimedia.org/r/493075 [17:14:22] (03CR) 10CRusnov: [C: 03+2] ganeti: Change ownership of rapi users file to match required ownership [puppet] - 10https://gerrit.wikimedia.org/r/492203 (https://phabricator.wikimedia.org/T215229) (owner: 10CRusnov) [17:14:32] (03PS4) 10CRusnov: ganeti: Change ownership of rapi users file to match required ownership [puppet] - 10https://gerrit.wikimedia.org/r/492203 (https://phabricator.wikimedia.org/T215229) [17:14:36] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fsero) With the help of @CDanis now PCC looks happy, @fgiunchedi is good for merge if you think so too. [17:14:51] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero) [17:14:58] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 (10fsero) 05Stalled→03Open [17:15:13] (03CR) 10Fsero: "https://puppet-compiler.wmflabs.org/compiler1002/14876/ PCC looks happy" [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [17:18:33] (03CR) 10Alexandros Kosiaris: [C: 03+1] k8s pod zone_validator cleanup [dns] - 10https://gerrit.wikimedia.org/r/493050 (owner: 10BBlack) [17:21:58] (03CR) 10Volans: [C: 04-1] "Nice, but I think there is still something to tweak, see inline." (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/493050 (owner: 10BBlack) [17:25:59] volans: yeah agreed, I was about to tackle that part but then got lost in meeting-land (still there!) [17:26:21] eheheh no prob :) [17:27:05] volans: I don't yet grok part of the last of your comments though. how is the per-line comment vs per-name match fundmentally different? [17:28:12] or maybe another way to slice that: if I just made is_ganeti into is_virtual and had jinja put a "; k8s pod" on the k8s lines and extended the comment match, would that still be borked? [17:28:52] actually I might be at fault here, I was mixing my memories with the actual _validate_ganeti_comments() method [17:29:13] * volans re-checking the logic [17:30:59] bblack: I'm basically wondering if the whole elif len(ganeti)... is already covered by the _validate_ganeti_comments() [17:32:22] I need to dig a bit more but in the middle of something else, I'll come back to it in a bit, sorry [17:33:44] np! [17:37:28] (03PS1) 10Bstorm: dumpsdistribution: Make pages go off for disk space [puppet] - 10https://gerrit.wikimedia.org/r/493083 (https://phabricator.wikimedia.org/T217068) [17:38:30] (03CR) 10Ottomata: Add eventbus analytics logging alongside with kafka logging. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) (owner: 10Ppchelko) [17:38:53] (03CR) 10Andrew Bogott: [C: 03+1] dumpsdistribution: Make pages go off for disk space [puppet] - 10https://gerrit.wikimedia.org/r/493083 (https://phabricator.wikimedia.org/T217068) (owner: 10Bstorm) [17:40:52] (03PS4) 10Ottomata: [WIP] Set up eventgate-analytics.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/491861 (https://phabricator.wikimedia.org/T211247) [17:42:14] (03PS12) 10Ppchelko: Add eventbus analytics logging alongside with kafka logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) [17:42:26] (03PS1) 10CRusnov: Add dummy netbox tokens [labs/private] - 10https://gerrit.wikimedia.org/r/493084 [17:49:44] (03PS1) 10GTirloni: docker: Run docker-gc daily [puppet] - 10https://gerrit.wikimedia.org/r/493085 (https://phabricator.wikimedia.org/T217159) [17:58:10] (03PS3) 10Mathew.onipe: relforge: switch relforge to use localssl [puppet] - 10https://gerrit.wikimedia.org/r/493063 (https://phabricator.wikimedia.org/T214921) [18:00:04] cscott, arlolra, subbu, halfak, and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T1800). [18:00:15] Nothing for ORES today [18:06:41] !log arlolra@deploy1001 Started deploy [parsoid/deploy@ae76aa2]: Updating Parsoid to e82347d [18:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:04] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10RobH) Ok, @ayonsi double checked my ports and migration plan on the gsheet, and I've started to make the needed port changes (he setup ganeti1008 for me as its a special case.) ` robh@asw2-a-e... [18:12:06] (03PS7) 10Mathew.onipe: cirrus: fallback option to use localssl via acme subject [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) [18:12:08] (03PS4) 10Mathew.onipe: relforge: switch relforge to use localssl [puppet] - 10https://gerrit.wikimedia.org/r/493063 (https://phabricator.wikimedia.org/T214921) [18:17:05] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10RobH) So I neglected to remove those above from disabled group, did so in next update and removed all the others that were also in disabled but need to be used for this: ` robh@asw2-a-eqiad# s... [18:17:44] !log arlolra@deploy1001 Finished deploy [parsoid/deploy@ae76aa2]: Updating Parsoid to e82347d (duration: 11m 03s) [18:17:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:10] (03PS13) 10Ottomata: Add eventbus analytics logging alongside with kafka logging. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/490668 (https://phabricator.wikimedia.org/T216163) (owner: 10Ppchelko) [18:23:45] (03CR) 10DCausse: [C: 03+1] [cirrus] autocomplete: enable subphrase matching for officewiki (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/492032 (https://phabricator.wikimedia.org/T150153) (owner: 10EBernhardson) [18:24:33] !log Updated Parsoid to e82347d (T204608, T214099, T217093) [18:24:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:39] T217093: Cannot read property 'docId' of undefined - https://phabricator.wikimedia.org/T217093 [18:24:40] T214099: Stress test Parsoid's HTTP API - https://phabricator.wikimedia.org/T214099 [18:24:40] T204608: Use a bag-on-the-side implementation, rather than an internal .dataobject for node data - https://phabricator.wikimedia.org/T204608 [18:30:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Andrew) @ayounsi I assume you're talking about this? ` labs-instances1-b-eqiad: ipv4: 10.68.16.0/21 ipv6: 2620:0:861:202::/64 ` If so, that range is d... [18:30:10] (03CR) 10Mathew.onipe: "PCC seems ok. https://puppet-compiler.wmflabs.org/compiler1002/14879/relforge1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/493063 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [18:30:29] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10Andrew) a:05Andrew→03ayounsi [18:38:22] (03PS1) 10Andrew Bogott: nova fullstack tests: use Stretch for test VMs [puppet] - 10https://gerrit.wikimedia.org/r/493092 (https://phabricator.wikimedia.org/T215778) [18:39:34] (03CR) 10Andrew Bogott: [C: 03+2] nova fullstack tests: use Stretch for test VMs [puppet] - 10https://gerrit.wikimedia.org/r/493092 (https://phabricator.wikimedia.org/T215778) (owner: 10Andrew Bogott) [18:45:11] (03PS8) 10Mathew.onipe: cirrus: fallback option to use localssl via acme subject [puppet] - 10https://gerrit.wikimedia.org/r/493048 (https://phabricator.wikimedia.org/T214921) [18:45:13] (03PS5) 10Mathew.onipe: relforge: switch relforge to use localssl [puppet] - 10https://gerrit.wikimedia.org/r/493063 (https://phabricator.wikimedia.org/T214921) [18:45:32] (03PS3) 10BBlack: k8s pod zone_validator cleanup [dns] - 10https://gerrit.wikimedia.org/r/493050 [18:45:34] (03PS1) 10BBlack: Remove dead ulsfo cp servers [dns] - 10https://gerrit.wikimedia.org/r/493094 (https://phabricator.wikimedia.org/T167377) [18:45:36] (03PS1) 10BBlack: Remove redundant ganeti comments check [dns] - 10https://gerrit.wikimedia.org/r/493095 [18:49:57] (03CR) 10Mathew.onipe: "PCC is still Ok: https://puppet-compiler.wmflabs.org/compiler1002/14880/" [puppet] - 10https://gerrit.wikimedia.org/r/493063 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [18:55:10] hi [18:55:12] [XHWLfwpAME4AAGqelYYAAABD] 2019-02-26 18:54:56: Fatal exception of type "TypeError" [18:55:18] when doing ?action=purge [18:55:19] you broke it [18:55:42] i iz borked da wiki, i runs nao [18:55:48] 2019-02-26 18:54:56 [XHWLfwpAME4AAGqelYYAAABD] mw1243 mediawikiwiki 1.33.0-wmf.19 exception ERROR: [XHWLfwpAME4AAGqelYYAAABD] /w/index.php?title=Gerrit/New_repositories/Requests&action=purge TypeError from line 43 of /srv/mediawiki/php-1.33.0-wmf.19/extensions/EventBus/includes/EventBusHooks.php: Argument 1 passed to EventBusHooks::sendResourceChangedEvent() must be an instance of LinkTarget, instance of Title given, called in / [18:55:48] srv/mediawiki/php-1.33.0-wmf.19/extensions/EventBus/includes/EventBusHooks.php on line 324 {"exception_id":"XHWLfwpAME4AAGqelYYAAABD","exception_url":"/w/index.php?title=Gerrit/New_repositories/Requests&action=purge","caught_by":"mwe_handler"} {} [18:56:13] Reedy: let me know if we need a task [18:56:34] Pchelolo: ^^ [18:56:36] hauskatze: Please file a task [18:56:47] Reedy: sure in a momentito [18:56:51] ^^ Reedy i think should be fixed by https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/EventBus/+/493077/ [18:56:58] aha [18:57:06] task alraedy filed https://phabricator.wikimedia.org/T217145 [18:57:22] was merged in eventbus master, needs to be mered into release branch i guess [18:57:42] * Reedy merges [18:57:51] danke, we weren't sure if we should do that your releng [18:57:55] etc. [18:58:22] * hauskatze stops task creation [18:59:09] (03CR) 10Herron: [C: 03+1] rsyslog: always use mmjsonparse when shipping to kafka [puppet] - 10https://gerrit.wikimedia.org/r/492632 (https://phabricator.wikimedia.org/T213189) (owner: 10Filippo Giunchedi) [18:59:40] nice ^ [18:59:42] :) [19:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T1900) [19:04:12] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:11:32] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:12:02] !log reedy@deploy1001 Synchronized php-1.33.0-wmf.19/extensions/EventBus/: T217145 (duration: 00m 54s) [19:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:06] T217145: Catchable fatal error: Argument 1 passed to EventBusHooks::sendResourceChangedEvent() must be an instance of LinkTarget, Title given in /srv/mediawiki/php-1.33.0-wmf.19/extensions/EventBus/includes/EventBusHooks.php on line 324 - https://phabricator.wikimedia.org/T217145 [19:12:41] hauskatze: Try again [19:12:57] ok [19:13:19] works :) [19:13:22] ty [19:13:25] sweet. thanks ottomata [19:13:31] huh.. great. [19:13:31] faster bugfix ever [19:13:36] have a cookie, both [19:13:55] sorry everyone.. [19:23:42] PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:28] PROBLEM - graphite.wikimedia.org render on graphite1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:46] RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.091 second response time [19:25:32] RECOVERY - graphite.wikimedia.org render on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 1637 bytes in 0.091 second response time [19:29:35] (03PS1) 10Herron: logstash: shrink es cluster back to 3 nodes, remove retired hosts [puppet] - 10https://gerrit.wikimedia.org/r/493098 (https://phabricator.wikimedia.org/T213898) [19:30:09] (03PS3) 10Herron: lists: drop connection if remote tries to send HELO [puppet] - 10https://gerrit.wikimedia.org/r/490417 (https://phabricator.wikimedia.org/T215251) [19:31:24] (03CR) 10Herron: [C: 03+2] lists: drop connection if remote tries to send HELO [puppet] - 10https://gerrit.wikimedia.org/r/490417 (https://phabricator.wikimedia.org/T215251) (owner: 10Herron) [19:36:10] (03Abandoned) 10Herron: WIP: logstash: ingest udp_localhost messages by severity [puppet] - 10https://gerrit.wikimedia.org/r/490198 (owner: 10Herron) [19:36:26] (03Abandoned) 10Herron: WIP: logstash: split rsyslog udp_localhost kafka topics by channel [puppet] - 10https://gerrit.wikimedia.org/r/490193 (owner: 10Herron) [19:39:23] (03PS1) 10BBlack: [WIP] Mark Non-WMF IPs for zone_validator [dns] - 10https://gerrit.wikimedia.org/r/493101 [19:42:08] (03CR) 10BBlack: ""Non-WMF" might not quite be the right term (cf ns1.corp). Also, we could just infer this set based on the available reverse zones, but I" [dns] - 10https://gerrit.wikimedia.org/r/493101 (owner: 10BBlack) [19:49:44] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:49:53] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10ayounsi) 05Open→03Resolved Ok, thanks, deleting the interface range from that switch only: `lang=diff [edit interfaces] - interface-range vlan-cloud-instances1-b-eqiad { -... [19:50:31] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Decommission old and unused/spare servers in eqiad - https://phabricator.wikimedia.org/T187473 (10ayounsi) [19:50:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: decom promethium/WMF3571 - https://phabricator.wikimedia.org/T191362 (10ayounsi) 05Resolved→03Open a:05ayounsi→03RobH [19:50:56] (03PS2) 10Giuseppe Lavagetto: Improve logging of errors, remove spurious print statements [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/492647 [19:50:58] (03PS2) 10Giuseppe Lavagetto: Do not try to pull/push if no registry is defined. [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/492648 [19:51:00] (03PS2) 10Giuseppe Lavagetto: Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 [19:51:02] (03PS1) 10Herron: logstash: add kafka logging role to logstash101[012] [puppet] - 10https://gerrit.wikimedia.org/r/493102 (https://phabricator.wikimedia.org/T213898) [19:52:07] (03CR) 10jerkins-bot: [V: 04-1] Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [19:52:12] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [19:55:53] (03PS2) 10Herron: logstash: add kafka logging role to logstash101[012] [puppet] - 10https://gerrit.wikimedia.org/r/493102 (https://phabricator.wikimedia.org/T213898) [20:00:05] Deploy window MediaWiki train - Americas version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190226T2000) [20:01:49] (03PS1) 10Bmansurov: Labs: Enable reader demographics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493103 (https://phabricator.wikimedia.org/T217171) [20:02:00] 10Operations, 10ops-eqiad, 10netops: Move servers off asw2-a5-eqiad - https://phabricator.wikimedia.org/T212348 (10ayounsi) >>! In T212348#4985425, @RobH wrote: > I thought it was odd some of them that I ran the command to remove from disabled, turned out to not be in use, but not disabled, like xe-7/0/37.... [20:03:27] (03CR) 10jerkins-bot: [V: 04-1] Labs: Enable reader demographics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493103 (https://phabricator.wikimedia.org/T217171) (owner: 10Bmansurov) [20:12:40] (03PS2) 10Bmansurov: Labs: Enable reader demographics survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493103 (https://phabricator.wikimedia.org/T217171) [20:13:08] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:14:15] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10Mholloway) This branch will successfully build mapnik from source on... [20:14:24] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:37:21] (03CR) 10Herron: "https://puppet-compiler.wmflabs.org/compiler1002/14883/" [puppet] - 10https://gerrit.wikimedia.org/r/493102 (https://phabricator.wikimedia.org/T213898) (owner: 10Herron) [21:06:08] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 53 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:06:26] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 40 probes of 393 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:11:22] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 3 probes of 430 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:11:40] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 18 probes of 393 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:14:33] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Core Platform Team Backlog (Watching / External), 10Services (watching): Create Debian packages for Node.js 8 upgrade for Maps - https://phabricator.wikimedia.org/T216521 (10MSantos) >>! In T216521#4986216, @Mholloway wrote: > This branch wil... [21:27:10] (03PS1) 10Alexandros Kosiaris: deployment_server: Remove deprecated deployment-charts clone [puppet] - 10https://gerrit.wikimedia.org/r/493111 [21:27:12] (03PS1) 10Alexandros Kosiaris: Run helm repo update more often [puppet] - 10https://gerrit.wikimedia.org/r/493112 [21:27:14] (03PS1) 10Alexandros Kosiaris: Pull every minute from git the charts [puppet] - 10https://gerrit.wikimedia.org/r/493113 [21:45:31] (03PS7) 10Herron: WIP: rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390 [21:46:54] (03CR) 10jerkins-bot: [V: 04-1] WIP: rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390 (owner: 10Herron) [22:00:37] (03PS1) 10Alexandros Kosiaris: Bump eventgate-analytics to 0.0.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/493117 [22:04:12] akosiaris: just curious why the bump ? [22:05:17] ottomata: the statsd prometheus mapping [22:05:21] btw [22:06:09] akosiaris@deploy1001:~$ curl -i 10.64.64.191:8192/robots.txt [22:06:10] ... [22:06:16] User-agent: * [22:06:16] Disallow: / [22:06:22] please tell me you are joking [22:06:41] or at least that this is the result of some copy paste error [22:07:11] ottomata: cause I just saw it as well in https://gerrit.wikimedia.org/r/#/c/mediawiki/services/mathoid/+/493026/ [22:07:22] ah right, cool [22:07:32] user agent...? [22:08:01] never touched robots.txt, must come from service template [22:08:02] ... [22:08:06] for some reason the robots.txt file instead of being returned as a text/plain [22:08:14] is being returned as HTTP headers [22:08:16] mobrovac: ^ [22:08:31] seems like https://gerrit.wikimedia.org/r/#/c/mediawiki/services/mathoid/+/493026/ is more widespread that I initially thought [22:08:50] https://github.com/wikimedia/service-template-node/blob/master/routes/root.js#L16-L27 [22:09:12] lol [22:09:32] oh my... https://github.com/wikimedia/service-template-node/commit/67d02b6975dff827f2647957703c2854f7a3a3ce [22:10:23] yeah that needs a revert [22:13:06] !log delete local pref for peering sessions in eqsin - T204281 [22:13:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:11] T204281: Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 [22:23:30] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) eqsin, the private-peer term has been removed a while back to do traffic engineering specific to this site. `lang=diff [edit policy-options policy-statement BGP_c... [22:24:26] 10Operations, 10Jade, 10TechCom, 10Core Platform Team Backlog (Watching / External), and 4 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10Halfak) [22:27:47] 10Operations, 10netops, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) ~80Mbps traffic shift to transit too. [22:28:38] 10Operations, 10Jade, 10TechCom, 10Core Platform Team Backlog (Watching / External), and 4 others: Deploy JADE extension to production - https://phabricator.wikimedia.org/T183381 (10Halfak) [22:33:12] 10Operations, 10Acme-chief, 10Traffic, 10Goal: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Krenair) [22:33:15] 10Operations, 10Acme-chief, 10Traffic: certcentral: Provide script for certificate revocation - https://phabricator.wikimedia.org/T203423 (10Krenair) [22:35:24] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Bump eventgate-analytics to 0.0.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/493117 (owner: 10Alexandros Kosiaris) [22:36:00] (03PS8) 10Herron: WIP: rsyslog: change udp_localhost_compat to define, add mwlog_compat [puppet] - 10https://gerrit.wikimedia.org/r/492390 [22:40:45] (03CR) 10Herron: "please see comments inline!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/492390 (owner: 10Herron) [22:42:51] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-codfw-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [22:42:51] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-eqiad-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [22:42:52] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-staging-values.yaml stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [22:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:06] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-codfw-values.yaml production stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [22:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:12] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster codfw completed [22:43:12] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [22:43:12] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-eqiad-values.yaml production stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [22:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:17] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [22:43:17] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [22:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:26] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-staging-values.yaml staging stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [22:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:28] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster staging completed [22:43:28] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [22:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:50] 10Operations, 10netops: ulsfo <-> router1.corp BGP sessions down - https://phabricator.wikimedia.org/T217207 (10ayounsi) p:05Triage→03High [23:10:58] (03CR) 10Krinkle: [C: 03+1] Oversample navtiming on ruwiki and eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/493055 (https://phabricator.wikimedia.org/T187299) (owner: 10Gilles) [23:16:10] !log depooled wdqs1006 to see if it's catch up [23:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:02] PROBLEM - puppet last run on cloudvirt1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:37:22] !log T217203 running mwscript ~/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'Citycarclubfi' 'Urbaanimies' [23:37:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:25] T217203: Please unblock stuck global renames - https://phabricator.wikimedia.org/T217203 [23:39:11] !log T217203 running mwscript ~/fixStuckGlobalRename.php --wiki=metawiki --logwiki=metawiki 'LaurenceKingPublishing' 'Fiona at Laurence King Publishing' [23:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:10] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:43:16] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 75330 bytes in 0.282 second response time [23:48:08] RECOVERY - puppet last run on cloudvirt1016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:50:08] (03PS13) 10Paladox: Introduce gr-wikimedia-prettify-ci-comments [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/489483 (https://phabricator.wikimedia.org/T215658)