[00:00:04] RoanKattouw ostriches Krenair MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160310T0000). [00:00:04] Krenair dr0ptp4kt yurik: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:14] yep [00:00:16] hey [00:00:26] (03PS1) 10Krinkle: multiversion: Remove logic for branch pointers in /w/static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276383 (https://phabricator.wikimedia.org/T99096) [00:00:41] who's that gonna be? [00:00:53] greg-g: yo yo yo [00:00:58] greg-g: argh [00:00:59] sorry! [00:00:59] (03PS2) 10Krinkle: multiversion: Remove logic for branch pointers in /w/static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276383 (https://phabricator.wikimedia.org/T99096) [00:01:02] jouncebot: pick [00:01:04] i didn't mean to name you greg-g. [00:01:08] haha [00:01:10] greg-g: i had your name in the window! [00:01:27] OK, I'll deploy [00:02:05] * RoanKattouw raises hand [00:02:05] I got distracted, sorry, but I'll do it [00:02:24] go ahead then [00:02:48] Wait so who's deploying? [00:02:55] 6Operations, 10Mail, 10Wikipedia-Store: why is shop@ -> board@ ? - https://phabricator.wikimedia.org/T127503#2105303 (10Dzahn) Hi All, i have just removed shop@ on the exim side. Thank you Byron, Daniel [00:03:09] 6Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2105307 (10Dzahn) [00:03:11] 6Operations, 10Mail, 10Wikipedia-Store: why is shop@ -> board@ ? - https://phabricator.wikimedia.org/T127503#2105305 (10Dzahn) 5Open>3Resolved a:3Dzahn [00:03:50] says: i picked Roan [00:04:02] dr0ptp4kt: ohai :) [00:04:10] greg-g: ohai [00:04:27] i got so excited to be monitoring a swat i couldn't help myself [00:05:26] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Varnish support for shutting users out of a DC - https://phabricator.wikimedia.org/T129424#2105320 (10BBlack) [00:06:13] Resolved IRL: it's me [00:06:33] yurik: Does https://gerrit.wikimedia.org/r/#/c/276375/ need to go to wmf16 or also wmf15? [00:06:33] dr0ptp4kt: Ditto for https://gerrit.wikimedia.org/r/#/c/276266/ : wmf16 or wmf15 or both? [00:06:58] jhobs, ^^? [00:06:58] jouncebot pick support sounds cool. it could select the most-recently-active nick in the channel, of those actually present (if anyone meets the criteria) :) [00:07:13] bblack: I like it :) [00:07:40] then we'll see the channel get eerily quiet about 30 minutes before any swat [00:07:47] :) [00:07:47] (03CR) 10Catrope: [C: 032] Properly remove SVN Admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275824 (https://phabricator.wikimedia.org/T105676) (owner: 10Alex Monk) [00:08:36] (03Merged) 10jenkins-bot: Properly remove SVN Admins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275824 (https://phabricator.wikimedia.org/T105676) (owner: 10Alex Monk) [00:09:18] it could try to be "fair" and pick the one who has done less deploys in a timeframe [00:09:51] or just randomized [00:10:19] mutante: but then it'd have to know who does the deploys... I guess by looking at sal entries? [00:10:31] 6Operations, 6Labs, 10Mail, 10Tool-Labs: remove toolserver mail aliases - https://phabricator.wikimedia.org/T127543#2105335 (10Dzahn) removed ``` -# Not actually an OTRS queue -ts-admins: ts-admins@toolserver.org -zedler-admins: ts-admins@toolserver.org ``` [00:10:35] or... we could beat the computers and make decisions ourselves :) [00:11:17] yeah but computers won Go today, so clearly we're now inferior [00:11:29] nah :) needs some kind of "rock paper scissor" game in the channel to determine it [00:11:53] yea, Alpha-Go rules now [00:12:17] RoanKattouw, it should be on both, because the bug report was for the current wikipedia, which is on 15 [00:12:42] 6Operations, 6Labs, 10Mail, 10Tool-Labs: remove toolserver mail aliases - https://phabricator.wikimedia.org/T127543#2105340 (10Dzahn) 5Open>3Resolved a:3Dzahn [00:12:44] 6Operations, 10Mail: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144#2105342 (10Dzahn) [00:13:03] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Remove svnadmins group (duration: 00m 41s) [00:13:04] RoanKattouw: both [00:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:18] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:13:56] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [00:15:07] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [00:15:24] yurik: definitely both [00:15:37] jhobs, yep, already figured, thx [00:16:52] sorry that was me on unmerged [00:16:56] fixed [00:17:10] (03PS1) 10Dzahn: remove login.m.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/276385 (https://phabricator.wikimedia.org/T123431) [00:17:14] RoanKattouw, thanks [00:17:26] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [00:18:15] Dereckson: You around for your SWAT? [00:18:36] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [00:19:36] Aye. [00:20:35] (03CR) 10Catrope: [C: 032] Set NS_PROJECT to Vikilüğət on az.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276382 (https://phabricator.wikimedia.org/T128296) (owner: 10Dereckson) [00:21:01] (03Merged) 10jenkins-bot: Set NS_PROJECT to Vikilüğət on az.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276382 (https://phabricator.wikimedia.org/T128296) (owner: 10Dereckson) [00:23:08] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Set NS_PROJECT on azwiktionary (duration: 00m 29s) [00:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:23:25] Testing. [00:23:59] RoanKattouw: 276382 works. [00:24:21] Awesome, thnaks [00:24:33] Thanks for the deploy. [00:26:45] OK, ZeroBanner and MobileFrontend cherry-picks made and sitting in the Jenkins queue [00:27:38] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [00:30:58] !log Upgrade of HHVM package to 3.12.1+dfsg-1 complete on all eqiad hosts save terbium [00:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:09] (03PS2) 10Krinkle: Move /w/static to /static (keeping symlink) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276377 [00:39:10] !log catrope@tin Synchronized php-1.27.0-wmf.16/extensions/MobileFrontend: SWAT (duration: 00m 36s) [00:39:11] (03PS1) 10Krinkle: Update file-system references from /w/static to /static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276392 [00:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:38] !log catrope@tin Synchronized php-1.27.0-wmf.16/extensions/ZeroBanner: SWAT (duration: 00m 28s) [00:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:44:48] !log catrope@tin Synchronized php-1.27.0-wmf.15/extensions/MobileFrontend: SWAT (duration: 00m 32s) [00:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:45:17] !log catrope@tin Synchronized php-1.27.0-wmf.15/extensions/ZeroBanner: SWAT (duration: 00m 28s) [00:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:46:59] dr0ptp4kt, yurik: That's your patches deployed ---^^ [00:47:02] Please verify [00:47:04] RoanKattouw: thx [00:54:33] RoanKattouw, all clear [00:54:36] thx [00:54:36] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2105424 (10Jdlrobson) @D... [00:56:12] RoanKattouw: looks ok [00:56:25] RoanKattouw: thx [00:59:37] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [01:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160310T0100). Please do the needful. [01:03:17] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [01:05:42] 6Operations, 10DNS, 6Discovery, 10Maps, 10Traffic: Redirect wikimaps.org to maps.wikimedia.org - https://phabricator.wikimedia.org/T129428#2105485 (10MaxSem) Per IRC discussion we don't really need/want it? [01:06:54] 6Operations, 10DNS, 6Discovery, 10Maps, 10Traffic: Redirect wikimaps.org to maps.wikimedia.org - https://phabricator.wikimedia.org/T129428#2105487 (10Yurik) 5Open>3declined Better not have it yet :) [01:07:30] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2105489 (10Papaul) 5Open>3Resolved Icinga shows that the system has been up for a day . I am resolving this task . [01:16:24] (03PS1) 10Jforrester: Default to Flow for new talk pages on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276395 [01:21:37] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2105539 (10MaxSem) I sti... [01:21:51] jdlrobson, ^ [01:48:17] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2105582 (10ori) >>! In T119944#2099111, @Aklapper wrote: >>>! In T119944#2098833, @faidon wrote: >>>>! In T119944#2098829, @Aklapper wrote: >>> A #devops tag was cr... [01:50:58] (03CR) 1020after4: [C: 031] multiversion: Remove logic for branch pointers in /w/static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276383 (https://phabricator.wikimedia.org/T99096) (owner: 10Krinkle) [01:53:24] (03CR) 10Catrope: [C: 031] Default to Flow for new talk pages on MediaWiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276395 (owner: 10Jforrester) [02:03:03] (03CR) 10Peachey88: "Where has this been discussed or posted to on mw wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276395 (owner: 10Jforrester) [02:25:49] (03PS1) 10EBernhardson: Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 [02:27:22] (03CR) 10jenkins-bot: [V: 04-1] Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 (owner: 10EBernhardson) [02:31:36] (03PS2) 10EBernhardson: Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 [02:32:41] (03CR) 10jenkins-bot: [V: 04-1] Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 (owner: 10EBernhardson) [02:33:34] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.15) (duration: 14m 08s) [02:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:23] (03PS3) 10EBernhardson: Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 [02:40:55] (03PS4) 10EBernhardson: Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 [02:55:00] (03PS5) 10EBernhardson: Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 [02:59:16] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.16) (duration: 12m 33s) [02:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:08:54] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Mar 10 03:08:54 UTC 2016 (duration 9m 38s) [03:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:10:58] (03PS6) 10EBernhardson: Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 [03:24:39] (03PS7) 10EBernhardson: Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 [04:10:56] (03PS8) 10EBernhardson: Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 (https://phabricator.wikimedia.org/T129427) [04:13:50] (03CR) 10EBernhardson: "tested on beta cluster, this appears to be working as expected. It's not as simple as i would like though, we can use this for now and rep" [puppet] - 10https://gerrit.wikimedia.org/r/276397 (https://phabricator.wikimedia.org/T129427) (owner: 10EBernhardson) [04:22:01] 6Operations, 6Labs, 10wikitech.wikimedia.org: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2105742 (10Krenair) Also: * Updated the hostname on the new host from wikitech-static-jessie to wikitech-static - the previous name no longer resolves and sudo was not happy about t... [04:43:27] PROBLEM - cassandra CQL 10.64.0.220:9042 on restbase1001 is CRITICAL: Connection refused [05:06:37] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - failed to fetch timestamp from wikitech-static [05:26:43] (03PS1) 10KartikMistry: Enable Machine Translation for some languages [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) [05:33:47] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [05:38:36] (03CR) 10KartikMistry: [C: 04-1] "It should be non-default." [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) (owner: 10KartikMistry) [05:40:11] (03CR) 10Santhosh: [C: 04-1] "This makes all these languages having Yandex as default MT. That is not the plan" [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) (owner: 10KartikMistry) [05:40:38] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [05:43:35] andrewbogott: can we permanently kill that alert? [05:43:45] it has been flapping for weeks [05:45:22] I’m not sure. I think chase has things in progress to address the problem, but I don’t know if the alert is helping in the meantime. [05:53:18] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [05:53:42] 6Operations, 6Discovery, 10Maps, 10Tilerator, and 2 others: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#2105869 (10Yurik) [05:54:37] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 7.23 ms [06:29:49] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2105907 (10Jdlrobson) Ca... [06:30:17] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:59] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:07] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:07] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 3 failures [06:32:38] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:47] PROBLEM - puppet last run on mw2077 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:57] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: Puppet has 1 failures [06:46:29] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2105913 (10MaxSem) Cache... [06:55:06] PROBLEM - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100% [06:55:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 204, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [06:55:57] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:28] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:56:37] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:46] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:47] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:57:16] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:47] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:57] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:36] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:46] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 206, down: 0, dormant: 0, excluded: 0, unused: 0 [07:26:37] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: puppet fail [07:43:45] (03PS2) 10Muehlenhoff: Enable ferm on graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/276152 [07:44:48] (03PS1) 10Pmlineditor: Added filemover and flood user group to bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276410 (https://phabricator.wikimedia.org/T129087) [07:54:28] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:56:35] (03PS2) 10KartikMistry: Enable non-default Machine Translation for some languages [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) [08:00:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/276152 (owner: 10Muehlenhoff) [08:05:00] 6Operations, 10Beta-Cluster-Infrastructure, 6Services, 13Patch-For-Review: Move Node.JS services to Jessie and Node 4 - https://phabricator.wikimedia.org/T124989#2105997 (10mobrovac) [08:05:25] 6Operations, 10Beta-Cluster-Infrastructure, 6Services, 13Patch-For-Review: Move Node.JS services to Jessie and Node 4 - https://phabricator.wikimedia.org/T124989#1971690 (10mobrovac) [08:10:07] (03CR) 10Jcrespo: "While xtrabackup differentials are ok, have you considered binlogs? They are easier to apply (maybe not faster). Check the recovery proces" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [08:15:23] (03PS2) 10Ema: Add h2_spdy_stats.stp [puppet] - 10https://gerrit.wikimedia.org/r/276233 (https://phabricator.wikimedia.org/T96848) [08:15:36] (03CR) 10Ema: [C: 032 V: 032] Add h2_spdy_stats.stp [puppet] - 10https://gerrit.wikimedia.org/r/276233 (https://phabricator.wikimedia.org/T96848) (owner: 10Ema) [08:20:17] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [08:20:50] (03PS1) 10Muehlenhoff: Update to 4.4.5 [debs/linux44] - 10https://gerrit.wikimedia.org/r/276411 [08:22:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Update to 4.4.5 [debs/linux44] - 10https://gerrit.wikimedia.org/r/276411 (owner: 10Muehlenhoff) [08:27:17] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [08:28:18] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2106020 (10Gilles) >>! In T66214#2105088, @brion wrote: > Currently this requires either an extra HTTPS round... [08:28:35] (03PS1) 10ArielGlenn: filter out summary lines from wget for wikitech dumps copy [puppet] - 10https://gerrit.wikimedia.org/r/276412 [08:29:57] (03CR) 10ArielGlenn: [C: 032] filter out summary lines from wget for wikitech dumps copy [puppet] - 10https://gerrit.wikimedia.org/r/276412 (owner: 10ArielGlenn) [08:30:04] 6Operations, 10DBA: Investigate/decom db2001-db2008 - https://phabricator.wikimedia.org/T125827#2106021 (10jcrespo) > Do you mean they don't need replacement because we already have enough capacity and don't need these old systems? Yes. I would like to decommission the first 8 for parts (mainly disks)- their... [08:34:25] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2106026 (10Gilles) >>! In T66214#2105215, @brion wrote: >similar to how not-yet-cached thumbnail URLs are rou... [08:41:10] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [08:42:19] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [08:42:55] (03PS1) 10Muehlenhoff: Add ferm rules for local carbon relay [puppet] - 10https://gerrit.wikimedia.org/r/276413 [08:47:23] 6Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2106035 (10ArielGlenn) Thu Mar 10 08:45:59 UTC 2016 I have run a salt test.ping. I see jobs in the cache from Mar 9 15:01 and nothing earlier. Tonight I'll run another such command and se... [08:50:58] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [08:51:38] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 [08:54:00] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.082 second response time [08:54:19] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 68631 bytes in 0.281 second response time [08:55:19] RECOVERY - HHVM rendering on mw1119 is OK: HTTP OK: HTTP/1.1 200 OK - 68641 bytes in 2.859 second response time [08:55:20] RECOVERY - Apache HTTP on mw1119 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.171 second response time [08:55:49] !log apache2 and hhvm restarted on mw1107, mw1122 and mw1119 [08:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:12:54] (03CR) 10Luke081515: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276410 (https://phabricator.wikimedia.org/T129087) (owner: 10Pmlineditor) [09:15:15] (03PS1) 10Volans: Rebalance external storage topology in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276416 (https://phabricator.wikimedia.org/T127330) [09:15:44] (03PS2) 10Luke081515: Added filemover and flood user group to bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276410 (https://phabricator.wikimedia.org/T129087) (owner: 10Pmlineditor) [09:15:52] (03CR) 10Luke081515: [C: 031] Added filemover and flood user group to bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276410 (https://phabricator.wikimedia.org/T129087) (owner: 10Pmlineditor) [09:18:13] 6Operations, 6Analytics-Kanban, 10Traffic, 13Patch-For-Review: varnishkafka integration with Varnish 4 for analytics - https://phabricator.wikimedia.org/T124278#2106071 (10elukey) Update: 1) completed the porting of the new tags. After a chat with Brandon and Ema we decided to use only the "client" tags... [09:24:42] (03PS1) 10Muehlenhoff: Add ferm rules for saltmaster [puppet] - 10https://gerrit.wikimedia.org/r/276419 [09:29:49] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.011 second response time [09:31:19] (03CR) 10Jcrespo: [C: 031] "Aaron, Krinkle: sadly, there is no compromise between local masters and non local masters." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [09:31:19] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [09:33:09] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.984 second response time [09:33:29] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 68656 bytes in 2.808 second response time [09:34:11] (03PS1) 10Muehlenhoff: Add ferm rules for salt master/labs [puppet] - 10https://gerrit.wikimedia.org/r/276420 [09:35:58] (03CR) 10Jcrespo: "And this is why having a long conversation about this is not really needed (and why I need to take the pooling/depooling outside of mediaw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [09:46:20] (03PS1) 10Muehlenhoff: Enable base::firewall on graphite1002 [puppet] - 10https://gerrit.wikimedia.org/r/276422 [09:47:54] (03CR) 10Filippo Giunchedi: [C: 031] Enable base::firewall on graphite1002 [puppet] - 10https://gerrit.wikimedia.org/r/276422 (owner: 10Muehlenhoff) [09:48:00] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 68654 bytes in 0.301 second response time [09:48:09] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.046 second response time [09:48:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on graphite1002 [puppet] - 10https://gerrit.wikimedia.org/r/276422 (owner: 10Muehlenhoff) [09:48:36] (03CR) 10Volans: [C: 032] Rebalance external storage topology in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276416 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [09:49:00] (03Merged) 10jenkins-bot: Rebalance external storage topology in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276416 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [09:50:16] !log restbase deploy start of 26bd4aa28 on restbase1001 [09:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:50:52] !log volans@tin Synchronized wmf-config/db-codfw.php: Rebalance external storage servers in codfw T127330 (duration: 00m 34s) [09:50:53] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [09:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:55:03] (03CR) 10Aaron Schulz: [C: 031] Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [09:55:49] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2106089 (10Tgr) >>! In T66214#2106020, @Gilles wrote: > What data is fetched and used before knowing what thu... [09:57:55] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 offline - troubleshoot - https://phabricator.wikimedia.org/T129196#2106098 (10jcrespo) 5Resolved>3Open a:5Papaul>3None This needs proper testing before pooling it again. [10:00:37] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 had several downtimes recently - test before repool - https://phabricator.wikimedia.org/T129196#2106104 (10jcrespo) [10:01:30] 6Operations, 10ops-codfw, 13Patch-For-Review: mw2212 had several downtimes recently - test before repool - https://phabricator.wikimedia.org/T129196#2097723 (10jcrespo) Remember to add it to mediawiki installation and run common sync before pooling it! ( I have acked the dsh alert) [10:03:42] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [10:06:47] 6Operations, 10DBA: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T129452#2106136 (10Volans) [10:07:17] 6Operations, 10DBA: External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available - https://phabricator.wikimedia.org/T119056#2106156 (10Volans) [10:07:55] 6Operations, 10DBA: External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available - https://phabricator.wikimedia.org/T119056#1816735 (10Volans) 5Open>3Resolved es2001-es2010 are out of MediaWiki config. Scheduling them for decommissioning i... [10:09:16] !log decommissioning restbase1002.eqiad.wmnet : T125842 [10:09:17] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [10:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:09:36] (03PS9) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [10:10:42] ACKNOWLEDGEMENT - cassandra CQL 10.64.0.220:9042 on restbase1001 is CRITICAL: Connection refused Filippo Giunchedi decommissioned [10:12:40] (03CR) 10Filippo Giunchedi: [C: 031] Introduce the SC[AB] clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [10:13:48] (03CR) 10Filippo Giunchedi: [C: 031] Add ferm rules for local carbon relay [puppet] - 10https://gerrit.wikimedia.org/r/276413 (owner: 10Muehlenhoff) [10:16:51] (03PS1) 10Gehel: Make r8s module use base::expose_puppet_certs [puppet] - 10https://gerrit.wikimedia.org/r/276427 (https://phabricator.wikimedia.org/T124444) [10:18:17] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2106182 (10Tgr) >>! In T66214#2106026, @Gilles wrote: > Which means that a default implementation in mediawik... [10:18:34] (03CR) 10Aklapper: "ArielGlenn: Who could make a decision (-1, +1, +2, abandon?) here, to stop having this sixteen month old patch rotting here? Or should thi" [software] - 10https://gerrit.wikimedia.org/r/174408 (owner: 10ArielGlenn) [10:18:59] (03PS2) 10Muehlenhoff: Add ferm rules for local carbon relay [puppet] - 10https://gerrit.wikimedia.org/r/276413 [10:19:16] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for local carbon relay [puppet] - 10https://gerrit.wikimedia.org/r/276413 (owner: 10Muehlenhoff) [10:27:48] (03PS3) 10ArielGlenn: audit ssh key use on production cluster [software] - 10https://gerrit.wikimedia.org/r/174408 [10:28:56] (03CR) 10ArielGlenn: "Added WIP, the audit related code is stalled for right now til the last salt stuff is off my plate." [software] - 10https://gerrit.wikimedia.org/r/174408 (owner: 10ArielGlenn) [10:29:53] 6Operations, 6Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2106201 (10fgiunchedi) @tgr, thanks that makes sense. I was looking at the blocker chain of imagescaling in beta (e.g. https://phabricator.wikimedia.org/T84950#1711999) and T64835 is at the bo... [10:31:51] !log setting up the rest of the cross-datacenter master-master connections pending in wmf databases [10:31:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:25] (03PS4) 10Ema: Port varnishlog to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) [10:36:28] 6Operations, 6Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2106210 (10Tgr) T84950 is a mixed bag of things, most not broken out into a separate task. The second and third bullet point of that list is probably what's relevant here. I would like to work... [10:39:51] (03PS1) 10Muehlenhoff: Fix copy&paste error in carbon ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/276429 [10:39:57] !log restbase deploy end of 26bd4aa28 [10:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix copy&paste error in carbon ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/276429 (owner: 10Muehlenhoff) [10:41:00] PROBLEM - MariaDB Slave IO: es3 on es1019 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1045, Errmsg: error connecting to master repl@es2018.codfw.wmnet:3306 - retry-time: 60 retries: 86400 message: Access denied for user repl@10.64.48.116 (using password: YES) [10:41:15] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:41:29] that is me, there is some permission error on es2018 [10:41:34] :) [10:41:37] noticed [10:41:40] no user impact [10:42:06] I am checking it [10:43:06] ah [10:43:08] I know [10:43:26] forgot the TLS, and replication requires it [10:43:42] so it went as it should [10:43:45] yes the grant is require SSL [10:45:16] fixed [10:45:53] half, if not all of the problems we have would be fixed if things could be applied instantly, and not over 1 year's time [10:46:29] RECOVERY - MariaDB Slave IO: es3 on es1019 is OK: OK slave_io_state Slave_IO_Running: Yes [10:47:50] (03PS1) 10Muehlenhoff: Enable base::firewall on graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/276432 [10:48:48] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "no problems in 3 weeks, removing remnats of cxserver config from sca" [puppet] - 10https://gerrit.wikimedia.org/r/270911 (owner: 10Alexandros Kosiaris) [10:48:54] (03PS3) 10Alexandros Kosiaris: cxserver: Remove from SCA nodes in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/270911 [10:49:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Remove from SCA nodes in conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/270911 (owner: 10Alexandros Kosiaris) [10:49:14] what do you mean by that? [10:49:57] (03PS3) 10Alexandros Kosiaris: cxserver: Remove from services conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/270912 [10:50:02] we cannot depool all our servers and apply puppet :-) [10:50:04] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Remove from services conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/270912 (owner: 10Alexandros Kosiaris) [10:50:58] (03PS3) 10Alexandros Kosiaris: cxserver: Remove LVS IP from SCA [puppet] - 10https://gerrit.wikimedia.org/r/270913 [10:51:18] that's why I prefer more stateless things [10:51:21] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Remove LVS IP from SCA [puppet] - 10https://gerrit.wikimedia.org/r/270913 (owner: 10Alexandros Kosiaris) [10:51:24] changes to replication require playing tetris if they are incompatible to each other, such as TLS [10:51:26] granted, you can't do it there either, but it's still a lottt quicker typically ;-) [10:51:44] on non-stateless servers, I agree [10:51:58] but most of mediawiki state is in MariaDB :-) [10:52:43] that is I am allways the blocker when things require restarts- I have to do it 1 by one (or one shard at a time) [10:53:04] (03PS1) 10ArielGlenn: add wikidataclient and flow dblists for dumps config [puppet] - 10https://gerrit.wikimedia.org/r/276434 [10:53:06] (03PS1) 10Alexandros Kosiaris: Remove cxserver role from SCA [puppet] - 10https://gerrit.wikimedia.org/r/276435 [10:53:11] things will be waaaaaay easier when we could depool whole datacenters at a time [10:53:14] the sate has to be somewhere ;) [10:53:20] s/sate/state/ [10:53:46] jynus: we understand that, no worries ;) [10:54:11] (03PS2) 10ArielGlenn: add wikidataclient and flow dblists for dumps config [puppet] - 10https://gerrit.wikimedia.org/r/276434 [10:55:00] hey, allow me to complain (to no one) just for the sake of it! :-) Also, it makes my job more interesting [10:55:32] (03CR) 10ArielGlenn: [C: 032] add wikidataclient and flow dblists for dumps config [puppet] - 10https://gerrit.wikimedia.org/r/276434 (owner: 10ArielGlenn) [10:55:32] pets, no cattle! [10:55:35] (03PS2) 10Alexandros Kosiaris: Remove cxserver role from SCA [puppet] - 10https://gerrit.wikimedia.org/r/276435 [10:55:42] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove cxserver role from SCA [puppet] - 10https://gerrit.wikimedia.org/r/276435 (owner: 10Alexandros Kosiaris) [11:01:33] (03PS1) 10ArielGlenn: dumps needs to use the noflow.dblist file for config [puppet] - 10https://gerrit.wikimedia.org/r/276436 [11:02:16] (03PS2) 10ArielGlenn: dumps needs to use the noflow.dblist file for config [puppet] - 10https://gerrit.wikimedia.org/r/276436 [11:03:36] (03CR) 10ArielGlenn: [C: 032] dumps needs to use the noflow.dblist file for config [puppet] - 10https://gerrit.wikimedia.org/r/276436 (owner: 10ArielGlenn) [11:08:30] (03PS1) 10ArielGlenn: fix typo in wiki dumps config [puppet] - 10https://gerrit.wikimedia.org/r/276437 [11:09:52] (03CR) 10ArielGlenn: [C: 032] fix typo in wiki dumps config [puppet] - 10https://gerrit.wikimedia.org/r/276437 (owner: 10ArielGlenn) [11:15:32] (03PS5) 10Ema: Port varnishlog to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) [11:18:07] 6Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, and 3 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#2106324 (10Addshore) For easy access from the ticket the dashboard for the above patchset is at https://graf... [11:27:37] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2106339 (10Gehel) I'm not convinced by raid10 for elasticsearch. Elasticsearch itself provides redundancy (shard replicated on multiple nodes, multiple masters, ...). I usua... [11:35:01] (03PS1) 10Elukey: First draft for the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [11:39:40] (03PS2) 10Alexandros Kosiaris: lvs: SC[AB] services lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/276199 (https://phabricator.wikimedia.org/T129234) [11:39:42] (03PS3) 10Alexandros Kosiaris: Introduce the SC[AB] clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [11:40:27] PROBLEM - HHVM rendering on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [11:41:25] PROBLEM - Apache HTTP on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.018 second response time [11:41:42] 6Operations, 10Monitoring, 13Patch-For-Review, 7Tracking: consolidate graphite metrics monitoring frontends into grafana - https://phabricator.wikimedia.org/T125644#2106358 (10fgiunchedi) see also https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Save_dashboards_in_puppet for instructions [11:47:16] (03CR) 10Elukey: "Whitespace/tabs issue while applying the diff, the patch is incomplete. Working on it." [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [11:57:01] (03PS2) 10Elukey: First draft for the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [12:00:09] (03PS1) 10Filippo Giunchedi: grafana: import swift [puppet] - 10https://gerrit.wikimedia.org/r/276440 (https://phabricator.wikimedia.org/T125644) [12:01:06] 6Operations, 6Performance-Team, 7Performance: HHVM 3.12 has a race-condition when starting up - https://phabricator.wikimedia.org/T129467#2106462 (10Joe) [12:01:38] <_joe_> !log restarted hhvm on mw1122, stuck at startup [12:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:02:02] <_joe_> whoever has to restart HHVM this week, please look at this ticket ^^ and maybe add info [12:02:46] (03PS2) 10Filippo Giunchedi: grafana: import swift [puppet] - 10https://gerrit.wikimedia.org/r/276440 (https://phabricator.wikimedia.org/T125644) [12:02:48] (03PS5) 10Filippo Giunchedi: grafana: import varnish-http-errors [puppet] - 10https://gerrit.wikimedia.org/r/268085 (https://phabricator.wikimedia.org/T125644) [12:03:47] 6Operations, 10Traffic, 13Patch-For-Review: Port varnishlog.py to new VSL API - https://phabricator.wikimedia.org/T128788#2106494 (10ema) varnishreqstats seems to be relatively easy to port: https://phabricator.wikimedia.org/P2736 The only part I'm not really sure about is the fact that ReqEnd is now gone.... [12:06:02] (03PS6) 10Filippo Giunchedi: grafana: import varnish-http-errors [puppet] - 10https://gerrit.wikimedia.org/r/268085 (https://phabricator.wikimedia.org/T125644) [12:06:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] grafana: import varnish-http-errors [puppet] - 10https://gerrit.wikimedia.org/r/268085 (https://phabricator.wikimedia.org/T125644) (owner: 10Filippo Giunchedi) [12:09:22] (03CR) 10Elukey: [C: 04-1] First draft for the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [12:10:59] (03PS3) 10Filippo Giunchedi: grafana: import swift [puppet] - 10https://gerrit.wikimedia.org/r/276440 (https://phabricator.wikimedia.org/T125644) [12:11:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] grafana: import swift [puppet] - 10https://gerrit.wikimedia.org/r/276440 (https://phabricator.wikimedia.org/T125644) (owner: 10Filippo Giunchedi) [12:11:44] 7Puppet, 5Continuous-Integration-Scaling, 13Patch-For-Review: Hiera is not properly configured on Nodepool instances - https://phabricator.wikimedia.org/T129092#2106522 (10hashar) 5Open>3Resolved I have used the updated /utils/hiera_lookup on a Nodepool instance and it seems to work. ``` lang=yaml $ cat... [12:33:23] (03PS3) 10Elukey: First draft for the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [12:33:36] (03PS4) 10Aklapper: [WIP] audit ssh key use on production cluster [software] - 10https://gerrit.wikimedia.org/r/174408 (owner: 10ArielGlenn) [12:42:10] 7Blocked-on-Operations, 10Datasets-Archiving, 10Dumps-Generation, 10Flow, 3Collaboration-Team-Current: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#2106578 (10ArielGlenn) I'm trying to run the flow maintenance script from the command line on an a... [12:42:30] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2106580 (10dcausse) @chasemp do you remember why we used a different RAID setup for codfw? If I remember correctly it was mostly because we did not find the perfect SSD size... [12:44:46] PROBLEM - HHVM rendering on mw1252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.023 second response time [12:45:55] PROBLEM - Apache HTTP on mw1252 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [12:47:26] PROBLEM - Apache HTTP on mw1091 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.015 second response time [12:47:45] PROBLEM - HHVM rendering on mw1091 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [12:50:45] PROBLEM - HHVM rendering on mw1249 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [12:51:55] PROBLEM - Apache HTTP on mw1249 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [12:53:45] PROBLEM - HHVM rendering on mw1217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.026 second response time [12:53:46] PROBLEM - Apache HTTP on mw1217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [12:55:21] _joe_: ^ ? [12:56:11] !log restarting hhvm on mw1217 [12:56:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:56:36] PROBLEM - HHVM rendering on mw1242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.017 second response time [12:57:16] RECOVERY - HHVM rendering on mw1217 is OK: HTTP OK: HTTP/1.1 200 OK - 68652 bytes in 0.097 second response time [12:57:16] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.050 second response time [12:57:45] PROBLEM - Apache HTTP on mw1242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [12:58:21] I don't think mw1122 ever recovered from its restart either earlier? [12:59:42] !log restarting hhvm on mw1091 mw1249 mw1252 mw1242 ... [12:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:59:55] RECOVERY - Apache HTTP on mw1091 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.046 second response time [13:00:06] PROBLEM - Apache HTTP on mw1243 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.011 second response time [13:00:06] RECOVERY - HHVM rendering on mw1091 is OK: HTTP OK: HTTP/1.1 200 OK - 68653 bytes in 0.121 second response time [13:00:17] PROBLEM - HHVM rendering on mw1243 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [13:00:45] (03PS10) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [13:00:55] RECOVERY - Apache HTTP on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.038 second response time [13:01:25] RECOVERY - Apache HTTP on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.055 second response time [13:01:26] RECOVERY - HHVM rendering on mw1249 is OK: HTTP OK: HTTP/1.1 200 OK - 68652 bytes in 0.085 second response time [13:01:28] !log restarting hhvm on mw1243 [13:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:01:56] RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.032 second response time [13:01:57] RECOVERY - Apache HTTP on mw1243 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.514 second response time [13:02:07] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 68652 bytes in 0.086 second response time [13:02:07] RECOVERY - HHVM rendering on mw1243 is OK: HTTP OK: HTTP/1.1 200 OK - 68654 bytes in 0.324 second response time [13:02:10] (03CR) 10jenkins-bot: [V: 04-1] Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [13:02:26] PROBLEM - Apache HTTP on mw1258 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [13:02:35] RECOVERY - HHVM rendering on mw1252 is OK: HTTP OK: HTTP/1.1 200 OK - 68652 bytes in 0.078 second response time [13:02:50] !log restarting hhvm on mw1258 ... [13:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:02:56] (03PS11) 10Gehel: Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) [13:02:56] PROBLEM - HHVM rendering on mw1258 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.017 second response time [13:04:15] RECOVERY - Apache HTTP on mw1258 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.025 second response time [13:04:17] (03CR) 10jenkins-bot: [V: 04-1] Factorized code exposing Puppet SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [13:04:45] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 68652 bytes in 0.073 second response time [13:05:25] PROBLEM - HHVM rendering on mw1251 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.022 second response time [13:06:24] !log restarting hhvm on mw1251 ... [13:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:46] PROBLEM - Apache HTTP on mw1251 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.028 second response time [13:07:15] RECOVERY - HHVM rendering on mw1251 is OK: HTTP OK: HTTP/1.1 200 OK - 68654 bytes in 0.228 second response time [13:08:35] (03PS2) 10BBlack: Enable varnish caching for related pages. [puppet] - 10https://gerrit.wikimedia.org/r/276254 (https://phabricator.wikimedia.org/T125983) (owner: 10Ppchelko) [13:08:36] RECOVERY - Apache HTTP on mw1251 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.030 second response time [13:08:43] (03CR) 10BBlack: [C: 032 V: 032] Enable varnish caching for related pages. [puppet] - 10https://gerrit.wikimedia.org/r/276254 (https://phabricator.wikimedia.org/T125983) (owner: 10Ppchelko) [13:08:47] PROBLEM - Apache HTTP on mw1248 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [13:09:46] PROBLEM - HHVM rendering on mw1248 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.044 second response time [13:11:16] PROBLEM - HHVM rendering on mw1151 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [13:11:50] (03PS4) 10Elukey: First draft for the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [13:12:16] PROBLEM - Apache HTTP on mw1151 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [13:12:41] !log restarting hhvm on mw1248, mw1151 [13:12:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:13:23] (03PS5) 10Elukey: First draft for the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [13:13:26] RECOVERY - HHVM rendering on mw1248 is OK: HTTP OK: HTTP/1.1 200 OK - 68654 bytes in 0.296 second response time [13:14:05] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.045 second response time [13:14:06] PROBLEM - HHVM rendering on mw1112 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.003 second response time [13:14:16] RECOVERY - Apache HTTP on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.045 second response time [13:14:16] PROBLEM - Apache HTTP on mw1112 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [13:14:47] RECOVERY - HHVM rendering on mw1151 is OK: HTTP OK: HTTP/1.1 200 OK - 68654 bytes in 0.158 second response time [13:15:22] !log restarting hhvm on mw1112 [13:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:22] !log mw1122: restarted hhvm, killed stale procs from Jan24 looking like: /usr/bin/hhvm --php -c /etc/hhvm/fcgi.ini -r echo ini_get("hhvm.jit_warmup_requests")?:11; [13:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:26] PROBLEM - HHVM rendering on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [13:17:45] RECOVERY - HHVM rendering on mw1112 is OK: HTTP OK: HTTP/1.1 200 OK - 68654 bytes in 0.130 second response time [13:17:55] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.080 second response time [13:18:05] PROBLEM - Apache HTTP on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.013 second response time [13:18:34] * bblack feels like he might be better off running a scipt that sshes to a random rendering machine and restarts hhvm every 30 seconds... [13:19:09] !log mw1240: restarted hhvm [13:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:19:17] RECOVERY - HHVM rendering on mw1240 is OK: HTTP OK: HTTP/1.1 200 OK - 68654 bytes in 0.786 second response time [13:19:55] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.040 second response time [13:20:32] !log mw1188: restarted hhvm (before alert hit IRC, was already pending in icinga) [13:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:14] 6Operations, 6Performance-Team, 7Performance: HHVM 3.12 has a race-condition when starting up - https://phabricator.wikimedia.org/T129467#2106462 (10BBlack) Rendering hhvms keep falling over like crazy. This is all reaction to icinga alerts on IRC basically: ``` 13:20 bblack: mw1188: restarted hhvm (before... [13:22:15] PROBLEM - Apache HTTP on mw1210 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [13:23:55] PROBLEM - HHVM rendering on mw1210 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [13:25:06] "rendering" is confusing BTW, given we have an imagescaler pool which backs "rendering.svc.eqiad.wmnet" :P [13:25:16] PROBLEM - Apache HTTP on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [13:25:25] PROBLEM - HHVM rendering on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [13:25:40] !log mw1241: restarted hhvm [13:25:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:25:45] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/274382 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [13:26:35] !log mw1210: restarted hhvm [13:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:26:58] clearly basically all of them are going to fail. I don't know if a pre-emptive restart would help, of the ones not already restarted [13:27:06] RECOVERY - Apache HTTP on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.026 second response time [13:27:07] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 68652 bytes in 0.071 second response time [13:27:22] <_joe_> I think a rollback to 3.6.5 is needed [13:27:26] RECOVERY - HHVM rendering on mw1210 is OK: HTTP OK: HTTP/1.1 200 OK - 68653 bytes in 0.114 second response time [13:27:29] (03PS1) 10Elukey: Remove rdb1001 from the Redis Job Queues for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276452 (https://phabricator.wikimedia.org/T123675) [13:27:30] <_joe_> If I'm needed, let me know [13:27:36] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.034 second response time [13:28:08] _joe_: they keep dying like clockwork, but so far no single host has died twice [13:28:28] <_joe_> bblack: or.i upgraded the cluster to HHVM 3.12 [13:28:40] <_joe_> I think we might want to rollback [13:29:34] (03Abandoned) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/273254 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [13:29:39] <_joe_> but that is partly a problem now that everything but terbium has been upgraded [13:30:41] so far counting from the one you saw earlier (1122), we've had icinga alert -> restart fixed it for 16/179 mw* [13:31:15] (well my 179 is counting appservers and api_appservers from etcd in eqiad) [13:31:50] 1122 specifically doesn't seem to want to come back, and had some crazy stuck JIT thing. I'll try one more restart there... [13:32:07] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.599 second response time [13:32:11] !log restarted mw1122 hhvm [13:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:06] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 68681 bytes in 0.164 second response time [13:33:45] <_joe_> bblack: it worked it seems [13:33:55] <_joe_> but clearly hhvm is crashing too often [13:46:49] (03PS3) 10Alexandros Kosiaris: lvs: SC[AB] services lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/276199 (https://phabricator.wikimedia.org/T129234) [13:46:51] (03PS4) 10Alexandros Kosiaris: Introduce the SC[AB] clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [13:57:25] it's kind of fascinating that it has stopped now [13:57:33] (the hhvm deaths) [13:57:40] at ~10% of them [13:57:59] was a random 10% subset set aside for some kind of testing and doing something different than the others? [13:58:05] <_joe_> bblack: It's kind of expected [13:58:24] <_joe_> bblack: nope, as I said before, tonight or.i upgraded the fleet to 3.12 [13:58:33] <_joe_> that race condition happens randomly [13:58:40] <_joe_> when restarting from crashes [13:59:03] <_joe_> that's what I'm getting, but I'm at a conference and I'm not really in the ideal environment to investigate [13:59:16] PROBLEM - Apache HTTP on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [13:59:47] <_joe_> this is down again, wow [14:00:00] <_joe_> bblack: can I take a look? [14:00:16] PROBLEM - HHVM rendering on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.014 second response time [14:00:43] go for it [14:01:14] (03PS4) 10Alexandros Kosiaris: lvs: SC[AB] services lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/276199 (https://phabricator.wikimedia.org/T129234) [14:01:16] (03PS5) 10Alexandros Kosiaris: Introduce the SC[AB] clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [14:02:30] (03PS3) 10Gehel: Expose elasticsearch through HTTP [puppet] - 10https://gerrit.wikimedia.org/r/274711 (https://phabricator.wikimedia.org/T124444) [14:03:02] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2106748 (10faidon) >>! In T66214#2105088, @brion wrote: > Quick note from IRC regarding the thumb-URL needs f... [14:05:25] (03PS3) 10Muehlenhoff: Add systemd unit for logstash [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) [14:08:00] !log increasing outbound stream throughput on restbase1002.eqiad.wmnet to 200mbps : T125842 [14:08:01] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [14:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:08:10] 6Operations, 10Continuous-Integration-Config, 10Dumps-Generation, 13Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2106758 (10hashar) 5Open>3declined Not much interest [14:08:31] (03PS6) 10Ema: Port varnishlog to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) [14:10:07] 6Operations, 10Continuous-Integration-Config, 10Dumps-Generation, 13Patch-For-Review, 7WorkType-Maintenance: operations/dumps repo should pass flake8 - https://phabricator.wikimedia.org/T114249#2106764 (10ArielGlenn) Don't despair. I have still on my roadmap to lint the strictly dump-related scripts (no... [14:11:23] (03CR) 10Ottomata: [C: 031] "COoOOol" [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [14:13:56] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.091 second response time [14:14:55] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 68681 bytes in 0.356 second response time [14:15:39] <_joe_> !log restarted multiple times hhvm on mw112; it didn't help, so removed manually the warmup upstart task and the old sqlite cache; also disabling puppet on the machine [14:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:48] 6Operations, 10Continuous-Integration-Infrastructure, 10netops: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#2106820 (10hashar) [14:24:35] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.003 second response time [14:24:36] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [14:29:02] !log mw1107: restarted hhvm [14:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:30:06] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.781 second response time [14:30:15] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 68658 bytes in 2.922 second response time [14:30:26] (03CR) 10Elukey: Port varnishlog to new VSL API (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [14:30:37] (03PS1) 10Alexandros Kosiaris: Fix maps-test200{1,2,3,4} role classes [puppet] - 10https://gerrit.wikimedia.org/r/276465 [14:32:29] 6Operations, 10Beta-Cluster-Infrastructure, 6Services, 13Patch-For-Review: Move Node.JS services to Jessie and Node 4 - https://phabricator.wikimedia.org/T124989#2106900 (10hashar) [14:36:15] (03CR) 10Ottomata: "COOlOoOOL!" (033 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [14:37:18] ---^ best review message ever [14:38:25] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (47306 200000s) [14:40:52] 6Operations, 10DBA, 13Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1705156 (10jcrespo) [14:41:13] 6Operations, 10DBA, 13Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#1705156 (10jcrespo) Adding graphite, too as a TODO. [14:45:22] (03CR) 10Elukey: First draft for the Varnish 4 porting. (033 comments) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) (owner: 10Elukey) [14:48:38] (03PS6) 10Elukey: First draft for the Varnish 4 porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [14:48:45] (03PS3) 10Filippo Giunchedi: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) [14:50:11] (03CR) 10jenkins-bot: [V: 04-1] prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) (owner: 10Filippo Giunchedi) [14:56:41] (03PS2) 10Alexandros Kosiaris: Fix maps-test200{1,2,3,4} role classes [puppet] - 10https://gerrit.wikimedia.org/r/276465 [14:56:56] (03PS3) 10Alexandros Kosiaris: Fix maps-test200{1,2,3,4} role classes [puppet] - 10https://gerrit.wikimedia.org/r/276465 [14:57:02] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix maps-test200{1,2,3,4} role classes [puppet] - 10https://gerrit.wikimedia.org/r/276465 (owner: 10Alexandros Kosiaris) [15:03:34] (03PS1) 10Alexandros Kosiaris: maps: Also move hiera config files to match maps::server [puppet] - 10https://gerrit.wikimedia.org/r/276471 [15:04:59] (03PS2) 10Alexandros Kosiaris: maps: Also move hiera config files to match maps::server [puppet] - 10https://gerrit.wikimedia.org/r/276471 [15:06:31] 6Operations, 6Performance-Team, 7Performance: HHVM 3.12 has a race-condition when starting up - https://phabricator.wikimedia.org/T129467#2107062 (10Joe) A wild guess: our hhvm-warmup job might have something to do with this - on mw1122 I found a hanging hhvm process created by the ``` /usr/bin/hhvm --php... [15:06:55] (03CR) 10Gehel: Add systemd unit for logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) (owner: 10Muehlenhoff) [15:09:04] (03PS3) 10Alexandros Kosiaris: maps: Also move hiera config files to match maps::server [puppet] - 10https://gerrit.wikimedia.org/r/276471 [15:09:59] (03CR) 10Alexandros Kosiaris: [C: 032] "https://puppet-compiler.wmflabs.org/2007/ says what I expect to see, so merging" [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [15:10:06] (03PS6) 10Alexandros Kosiaris: Introduce the SC[AB] clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [15:10:52] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2107104 (10Gehel) Patch deployed on beta cluster (again). This time it does not include any dependency to k8s, so I should be able to push... [15:11:17] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.022 second response time [15:11:35] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.011 second response time [15:11:39] (03PS2) 10Muehlenhoff: Enable base::firewall on graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/276432 [15:12:15] !log restart hhvm on mw1107 [15:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:12:55] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable base::firewall on graphite1001 [puppet] - 10https://gerrit.wikimedia.org/r/276432 (owner: 10Muehlenhoff) [15:12:57] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce the SC[AB] clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [15:13:06] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.800 second response time [15:13:14] (03PS7) 10Alexandros Kosiaris: Introduce the SC[AB] clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [15:13:17] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 68959 bytes in 2.502 second response time [15:13:20] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce the SC[AB] clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [15:17:44] (03PS4) 10Alexandros Kosiaris: maps: Also move hiera config files to match maps::server [puppet] - 10https://gerrit.wikimedia.org/r/276471 [15:17:57] (03PS1) 10BBlack: Varnish: stream all pass traffic [puppet] - 10https://gerrit.wikimedia.org/r/276475 [15:22:24] (03PS7) 10Ema: Port varnishlog to new VSL API [puppet] - 10https://gerrit.wikimedia.org/r/274946 (https://phabricator.wikimedia.org/T128788) [15:23:05] (03PS1) 10Filippo Giunchedi: statsdlb: set notrack and allow from $ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/276477 [15:23:09] (03CR) 10Muehlenhoff: "Fixed a couple of issues, with these fixed, logstash started correctly in a manual test on deployment-logstash2, will submit a revised pat" [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) (owner: 10Muehlenhoff) [15:23:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] statsdlb: set notrack and allow from $ALL_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/276477 (owner: 10Filippo Giunchedi) [15:24:22] (03CR) 10Gehel: [C: 031] Don't create new log files for cirrus-suggest with logrotate [puppet] - 10https://gerrit.wikimedia.org/r/268215 (owner: 10EBernhardson) [15:25:46] PROBLEM - puppet last run on scb2001 is CRITICAL: CRITICAL: Puppet has 5 failures [15:26:55] PROBLEM - puppet last run on sca2002 is CRITICAL: CRITICAL: Puppet has 2 failures [15:27:56] PROBLEM - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code [15:29:04] ^it's still running [15:30:26] PROBLEM - puppet last run on scb2002 is CRITICAL: CRITICAL: Puppet has 5 failures [15:31:11] (03CR) 10Anomie: Enable completion suggester as default on all but top 12 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275605 (https://phabricator.wikimedia.org/T128775) (owner: 10EBernhardson) [15:32:50] (03CR) 10Ema: [C: 031] Varnish: stream all pass traffic [puppet] - 10https://gerrit.wikimedia.org/r/276475 (owner: 10BBlack) [15:33:05] twentyafterfour: hey [15:35:18] PROBLEM - cxserver endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=8080): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:35:58] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=19000): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:36:08] PROBLEM - Check size of conntrack table on graphite1001 is CRITICAL: CRITICAL: nf_conntrack is 100 % full [15:36:17] PROBLEM - mathoid endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=10042): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:36:27] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.132, port=1970): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:36:27] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=8888): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:36:38] PROBLEM - cxserver endpoints health on scb2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.132, port=8080): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:36:59] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.132, port=19000): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:37:15] paravoid: hi [15:37:18] PROBLEM - apertium apy on sca2002 is CRITICAL: Connection refused [15:37:18] PROBLEM - mathoid endpoints health on scb2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.132, port=10042): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:37:38] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.32.132, port=8888): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:37:47] akosiaris/mobrovac: are these to be expected? [15:37:57] PROBLEM - zotero on sca2002 is CRITICAL: Connection refused [15:38:33] twentyafterfour: hi! we're getting these: "Subject: Cron /srv/phab/tools/public_task_dump.py NOTICE: rtppl not found" every so often [15:38:36] every minute or so [15:38:48] PROBLEM - puppet last run on sca2001 is CRITICAL: CRITICAL: Puppet has 2 failures [15:38:52] scratch that, multiple times a minute [15:38:53] hmm [15:38:58] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.48.43, port=1970): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:39:02] wow [15:39:03] paravoid: yes, sc[ab]2001 hosts are coming online for the very first time.. [15:39:09] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:39:16] akosiaris: yeah, figured :) [15:39:18] akosiaris: known? [15:39:25] chasemp would know more about that than me [15:39:25] kart_: yes [15:39:45] I don't know what rtppl is [15:40:12] can you find out? [15:40:41] akosiaris: thanks [15:40:44] it's a static file that had rt user identities from the migration many moons ago...that is probably not there, that dump runs every night and somewhere it must be looking for it [15:40:58] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:41:01] this started happening after the latest upgrade [15:41:18] I'm looking into it [15:41:31] thanks! [15:41:44] twentyafterfour: it's mentioned in the readme of https://phabricator.wikimedia.org/diffusion/PHTO/ [15:41:47] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [15:41:52] something to do with RT migration? [15:41:57] RECOVERY - puppet last run on scb2002 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [15:42:00] RECOVERY - puppet last run on scb2001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:42:00] public_task_dump.py doesn't reference rtppl [15:42:18] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy [15:42:38] ah, recoveries [15:42:39] RECOVERY - mathoid endpoints health on scb2001 is OK: All endpoints are healthy [15:42:58] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy [15:42:58] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [15:43:08] RECOVERY - Check size of conntrack table on graphite1001 is OK: OK: nf_conntrack is 30 % full [15:43:17] RECOVERY - mathoid endpoints health on scb2002 is OK: All endpoints are healthy [15:43:33] akosiaris: \o/ [15:43:45] and all of them are running! [15:43:50] yaaaay [15:44:49] paravoid: are the notices gone now? [15:45:05] I just commented out the line that complains [15:45:40] did you try to run it manually? [15:46:08] RECOVERY - apertium apy on sca2002 is OK: HTTP OK: HTTP/1.1 200 OK - 4999 bytes in 0.077 second response time [15:46:28] RECOVERY - puppet last run on sca2002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:47:14] paravoid: no [15:47:38] RECOVERY - puppet last run on sca2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:47:50] I don't understand why it'd be happening multiple times per minute either it's a daily cron job [15:48:13] mobrovac: almost... I think the last alerts in icinga will need another 4-5 minutes to clear [15:48:35] dunno what to tell you [15:48:38] also why is this running as root? [15:48:46] all of these [15:49:09] that I don't know either [15:49:19] chase probably knows, is my guess [15:50:33] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2107220 (10EBernhardson) Sounds good. In the weekly codfw meeting we agree to swap elasticsearch traffic over to codfw next week. We didn't... [15:50:43] (03PS4) 10Muehlenhoff: Add systemd unit for logstash [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) [15:51:51] (03CR) 10Faidon Liambotis: [C: 032] Gerrit manifest cleanup [puppet] - 10https://gerrit.wikimedia.org/r/275911 (owner: 10Chad) [15:51:59] RECOVERY - zotero on sca2002 is OK: HTTP OK: HTTP/1.0 200 OK - 62 bytes in 0.085 second response time [15:54:38] (03PS1) 10Muehlenhoff: Add ferm rules for carbon-c-relay for labs graphite [puppet] - 10https://gerrit.wikimedia.org/r/276482 [15:54:44] (03PS1) 10Papaul: Add mgmt DNS entries for rdb200[5-6] Bug:T129178 [dns] - 10https://gerrit.wikimedia.org/r/276483 (https://phabricator.wikimedia.org/T129178) [15:57:24] (03CR) 10Jforrester: "> Where has this been discussed or posted to on mw wiki?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276395 (owner: 10Jforrester) [15:57:51] as far as I can tell the public_task_dump.py doesn't need root [15:58:20] (03CR) 10Alexandros Kosiaris: [C: 04-1] "needs manual rebase. File now is hieradata/role/common/sca.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) (owner: 10KartikMistry) [15:58:25] I can kill it if it's still complaining [15:58:31] and rerun it manually [15:59:39] ACKNOWLEDGEMENT - citoid endpoints health on scb2001 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) alexandros kosiaris LVS config still pending for dependent services [15:59:39] ACKNOWLEDGEMENT - cxserver endpoints health on scb2001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200): /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) is CRITICAL: Test Machine translate an HTML fragment using Apertium. returned the unexpected status 500 (expecting: 200) alex [15:59:39] ACKNOWLEDGEMENT - citoid endpoints health on scb2002 is CRITICAL: /api (Zotero alive) is CRITICAL: Test Zotero alive returned the unexpected status 404 (expecting: 200) alexandros kosiaris LVS config still pending for dependent services [15:59:39] ACKNOWLEDGEMENT - cxserver endpoints health on scb2002 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) is CRITICAL: Test Fetch enwiki Oxygen page returned the unexpected status 404 (expecting: 200): /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) is CRITICAL: Test Machine translate an HTML fragment using Apertium. returned the unexpected status 500 (expecting: 200) alex [16:00:04] anomie ostriches thcipriani marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160310T1600). Please do the needful. [16:00:04] dcausse James_F anomie benestar: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:15] Adsum. [16:00:32] * anomie is here, but would prefer not to do the actual SWATting today [16:00:38] I can SWAT. [16:00:50] o/ [16:01:45] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2107277 (10Gehel) I have a slight preference for Monday, Tuesday or Thursday. I'm having lunch with friends on Wednesday and Friday, so I'm... [16:01:49] wowza. Lots of weird jobqueue errors: type":"JobQueueError","file":"/srv/mediawiki/php-1.27.0-wmf.15/includes/jobqueue/JobQueueFederated.php","line":495,"message":"No queue partitions available." [16:02:02] ^ is that known? [16:02:09] thcipriani: would you let me swat? [16:02:21] Shit. [16:02:38] elukey: Related to what we were talking about? ^ [16:02:57] jzerebecki: absolutely :) [16:03:42] akosiaris: scb.yaml :) [16:03:44] thcipriani: Job queue backlog spiked earlier and were trying to debug. [16:04:26] ostriches: yeah, lots of stuff in logstash fatalmonitor seems relevant [16:04:36] This hasn't improved and I got distracted from figuring out why. [16:04:45] important fact: yesterday I de-pooled/re-pooled rdb1003 to reimage it with Debian [16:04:48] (or is possibly a side-effect of the root cause) [16:04:58] 6Operations, 10Monitoring, 7Tracking: Improve access to and control over incident and metrics monitoring infrastructure - https://phabricator.wikimedia.org/T124179#1948329 (10Eevans) Not sure if this is in-scope here, but as part of T103124, we had hoped to separate Icinga notifications for the RESTBase stag... [16:05:23] kart_: er, yes, sorry [16:06:30] wowa just seen https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors [16:06:32] thcipriani: I'd rather not swat... [16:06:49] ostriches: fine with me jzerebecki: ^ [16:07:39] :( [16:08:40] (03CR) 10Faidon Liambotis: [C: 04-1] "This doesn't rebase cleanly." [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) (owner: 10KartikMistry) [16:08:54] paravoid: fixing. [16:09:10] thx [16:09:30] (03PS3) 10KartikMistry: Enable non-default Machine Translation for some languages [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) [16:09:55] I'm pretty sure it's refreshLinks. [16:10:02] (03PS1) 10Dereckson: Enable SandboxLink on sr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276485 (https://phabricator.wikimedia.org/T129485) [16:10:05] Commonswiki isn't helping, it's roughly 1/6 of that backlog [16:12:42] Hi. When you're in logstash, this task has an exception reported, with no stacktrace attached to it: https://phabricator.wikimedia.org/T129359 [16:12:54] <_joe_> elukey: which hosts are reporting that error? jobrunners only? [16:13:13] Yeah [16:13:48] Err, maybe not [16:13:53] <_joe_> did you restart rbd1001? [16:14:03] <_joe_> and, when did those start? [16:14:09] <_joe_> that's what I'd look at [16:14:16] https://logstash.wikimedia.org/#dashboard/temp/AVNhTnlAO3D718AOkUXz [16:14:44] _joe_: Roughly ~10h ago [16:14:47] (03PS5) 10Alexandros Kosiaris: maps: Also move hiera config files to match maps::server [puppet] - 10https://gerrit.wikimedia.org/r/276471 [16:15:35] <_joe_> does that correlate with any deploy? [16:15:54] fwiw I can see spikes of abandonned cirrus jobs at 4am UTC [16:16:03] <_joe_> 10 hours ago is 6 am utc... [16:16:26] I see the errors starting at 23:00 UTC yesterday [16:16:43] So, just after the train? [16:17:14] I saw 10h is when the job queue backlog started spiking. [16:17:21] yes exactly, it is more than 10hrs.. [16:17:45] 21:06 logmsgbot: twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.16 [16:18:12] If the errors didn't start for another two hours, that probably isn't it. [16:18:28] I concur [16:19:02] _joe_ mostly jobrunners [16:19:40] I'm curious if switching /just/ commonswiki back to wmf.15 would help. The backlog is *mostly* from it. And this is clearly "growing and more obvious" type of problem as opposed to "it started failing the second we switched" [16:20:28] ostriches: If it's slow to grow it'd also be slow to determine if doing that had fixed anything. :-( [16:20:41] true dat [16:21:32] do these errors show up in https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors ? I was watching that before, during and after the train and I didn't notice anything... I don't even know how to monitor job queues during a deployment [16:22:25] twentyafterfour: They're there /now/ and increasing in volume. [16:22:30] But they started slow, afaict. [16:22:37] It wasn't a "BOOM YOU'RE BROKE!" [16:23:23] Dereckson: got the backtrace for that task [16:23:27] (03PS2) 10Alexandros Kosiaris: Sync up eqiad/codfw LVS IP assignments for services [dns] - 10https://gerrit.wikimedia.org/r/276196 (https://phabricator.wikimedia.org/T129234) [16:24:54] greg-g: thanks for the trace. [16:25:39] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2107423 (10RobH) [16:28:11] (03PS6) 10Alexandros Kosiaris: maps: Also move hiera config files to match maps::server [puppet] - 10https://gerrit.wikimedia.org/r/276471 [16:28:40] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2107441 (10RobH) @gehel: SSD and HDD failure are still some of the highest[1] failure rate hardware in the datacenter. If we raid the OS disk (as these will be hot swap dis... [16:29:42] 6Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review: Dynamic backend selection via X-Wikimedia-Debug header - https://phabricator.wikimedia.org/T129000#2107445 (10bd808) [16:29:44] 6Operations, 10Wikimedia-General-or-Unknown, 15User-bd808: Update Wikimedia Debug extensions for Chrome and Firefox for configurable backend selection - https://phabricator.wikimedia.org/T129283#2107443 (10bd808) 5Open>3Resolved Wikimedia Debug Header 0.5.0 for Firefox is now available from https://addon... [16:30:05] (03CR) 10BBlack: "I think this is good, but I'd like to find a window without other breakage or experiments in the way, so we can see any effect more-clearl" [puppet] - 10https://gerrit.wikimedia.org/r/276475 (owner: 10BBlack) [16:34:02] PROBLEM - HHVM rendering on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [16:34:13] PROBLEM - Apache HTTP on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.014 second response time [16:34:38] (03PS1) 10Ema: Do not include dynamic directors in VCL test files [puppet] - 10https://gerrit.wikimedia.org/r/276493 [16:36:29] !log mw1122: restarted hhvm [16:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:37:33] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 69360 bytes in 0.214 second response time [16:37:43] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.062 second response time [16:43:06] 7Puppet, 6Commons: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2107565 (10Dereckson) p:5Triage>3Normal [16:46:04] 7Puppet, 6Commons, 7I18n: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2107574 (10Dereckson) [16:48:23] (03CR) 10BBlack: [C: 031] Do not include dynamic directors in VCL test files [puppet] - 10https://gerrit.wikimedia.org/r/276493 (owner: 10Ema) [16:50:03] So… no SWAT? [16:50:43] (03PS1) 10Dereckson: Add Gujarati fonts to mediawiki::packages::fonts [puppet] - 10https://gerrit.wikimedia.org/r/276501 (https://phabricator.wikimedia.org/T129500) [16:51:37] (03PS7) 10Alexandros Kosiaris: maps: Also move hiera config files to match maps::server [puppet] - 10https://gerrit.wikimedia.org/r/276471 [16:52:11] (03CR) 10Ema: [C: 032 V: 032] Do not include dynamic directors in VCL test files [puppet] - 10https://gerrit.wikimedia.org/r/276493 (owner: 10Ema) [16:54:24] James_F: looks like it (i'm in a meeting but it looks like investigation is still on-going) [16:54:27] (03PS8) 10Alexandros Kosiaris: maps: Also move hiera config files to match maps::server [puppet] - 10https://gerrit.wikimedia.org/r/276471 [16:54:30] * James_F nods. [16:54:32] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: Also move hiera config files to match maps::server [puppet] - 10https://gerrit.wikimedia.org/r/276471 (owner: 10Alexandros Kosiaris) [16:55:45] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2107631 (10Gehel) Looking at some tcpdump traces to see if there is something to improve. My SSL-fu is quite rusty, so I might be doing thi... [16:56:17] (03CR) 10Mattflaschen: [C: 04-1] "$wgFlowOccupyNamespaces does not exist anymore (sorry, this shouldn't even be in here). It needs to be done with $wgNamespaceContentModel" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276395 (owner: 10Jforrester) [16:58:33] (03PS1) 10Alexandros Kosiaris: Remove cxserver from conftool-data for sca in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276507 [16:58:52] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 13Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188#2107637 (10ema) [16:58:59] (03PS2) 10Alexandros Kosiaris: Remove cxserver from conftool-data for sca in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276507 [16:59:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Remove cxserver from conftool-data for sca in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276507 (owner: 10Alexandros Kosiaris) [16:59:18] 7Puppet, 6Commons, 10Wikimedia-SVG-rendering, 7I18n, 13Patch-For-Review: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2107655 (10Dereckson) [17:00:04] paravoid chasemp: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160310T1700). Please do the needful. [17:00:04] mdholloway ostriches kart_: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [17:00:04] godog: Dear anthropoid, the time has come. Please deploy codfw-switchover: swift sync replication (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160310T1700). [17:00:04] godog godog godog godog: A patch you scheduled for codfw-switchover: swift sync replication is about to be deployed. Please be available during the process. [17:00:25] hahahaha [17:01:17] godog: godog godog godog [17:02:16] (03PS4) 10Rush: Gerrit manifest cleanup [puppet] - 10https://gerrit.wikimedia.org/r/275911 (owner: 10Chad) [17:02:17] * godog goes [17:02:39] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 13Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188#2107662 (10ema) Please note that there is no dependency on varnish 4 for this, we have already started adding VTC tests: https://gerrit.wikimedia.or... [17:02:40] godog: say jouncebot 3 times :) [17:02:51] hahaha [17:03:16] mdholloway|brb: about? [17:03:32] sorry, i'm here :) [17:03:44] not sure if puppetswat will be happening, either [17:03:49] why not? [17:03:51] it's happening :) [17:04:04] ori AaronSchulz I'd like to go ahead with the change, what's the status on the job queue backlog? [17:04:15] paravoid: We generally don't deploy with production in a broken state… [17:04:29] oh ok, greg-g should be hold off? [17:04:34] godog: all evidence points to it being organic [17:04:35] deploy what [17:04:39] I thought I'd added a de-dup for nicks in jouncebot. Apparently not [17:04:42] edit to popular template, that sort of thing. [17:04:45] paravoid: if swat was cancelled because of an issue then until that issue is resolved we shouldn't do puppetswat either, correct? [17:04:56] * greg-g isn't sure of status of the issue [17:05:00] * greg-g was in a meeting [17:05:07] PROBLEM - puppet last run on maps-test2001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:05:08] ori: ah, thanks [17:05:15] maybe per patch call depending on complexity ... https://gerrit.wikimedia.org/r/#/c/275911/ seems harmless to interfere [17:05:26] https://gerrit.wikimedia.org/r/#/c/275853/ I have no idea [17:05:29] my patch is neither complex nor urgent [17:05:31] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2107669 (10Jdlrobson) >>... [17:05:42] (03CR) 10Mattflaschen: "I forgot about the mapping from $wgFlowOccupyNamespaces to $wgNamespaceContentModels ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276395 (owner: 10Jforrester) [17:05:48] ori: ok, thanks [17:06:08] PROBLEM - DPKG on maps-test2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:06:41] greg-g: let's not overreact - it's not like a gerrit puppet manifest cleanup is linked in any way with the job queue [17:08:27] (03PS4) 10Rush: Enable non-default Machine Translation for some languages [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) (owner: 10KartikMistry) [17:08:42] paravoid: by that rule then half of the swat patches could have gone out :/ [17:08:42] AaronSchulz: I had a comment in https://gerrit.wikimedia.org/r/#/c/276071/1/wmf-config/filebackend-production.php FYI but other than that looks good to me to be merged now, atm looks like the job queue backlogged but not in trobule (?) [17:08:45] chasemp: (I merged the gerrit one and i'm applying it) [17:08:52] cool saw that [17:09:36] AaronSchulz: argh I didn't realize you weren't on the other channel [17:09:59] AaronSchulz: I'll paste findings in a pastebin, so far looks like an organic spike in refreshLinks jobs [17:10:45] (03PS1) 10Mattflaschen: Remove vestiges of the old Occupy feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276508 [17:11:51] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 13Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188#2107677 (10hashar) @ema and @hashar going to pair on this and seek to get a proof of concept. [17:12:08] PROBLEM - HHVM rendering on mw1247 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.013 second response time [17:12:18] (03CR) 10Mattflaschen: "Follow-up is https://gerrit.wikimedia.org/r/276508 . Please deploy that at the same time as this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276395 (owner: 10Jforrester) [17:12:19] PROBLEM - Apache HTTP on mw1247 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.039 second response time [17:13:44] jligvehfclu [17:13:52] 6Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2107687 (10ArielGlenn) Just checked the job cache again: root@labcontrol1001:/var/cache/salt/master# ls -lt jobs/ total 28 drwxr-xr-x 3 root root 4096 Mar 10 15:02 d6 drwxr-xr-x 3 root root... [17:13:57] RECOVERY - HHVM rendering on mw1247 is OK: HTTP OK: HTTP/1.1 200 OK - 69356 bytes in 0.221 second response time [17:14:08] RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.027 second response time [17:14:10] (03PS4) 10EBernhardson: Enable completion suggester as default on all but top 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275605 (https://phabricator.wikimedia.org/T128775) [17:15:08] PROBLEM - Apache HTTP on mw1082 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [17:15:14] paravoid: thx for the merge [17:15:19] PROBLEM - HHVM rendering on mw1082 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.014 second response time [17:15:24] i'll take a look at 1082 [17:16:19] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2107694 (10brion) >>! In T66214#2106748, @faidon wrote: > This isn't currently possible with the existing tec... [17:16:23] 6Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2107695 (10ArielGlenn) p:5Triage>3Normal [17:16:48] PROBLEM - Apache HTTP on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.013 second response time [17:17:00] ok going to merge https://gerrit.wikimedia.org/r/#/c/276071 shortly if there are no objections [17:17:38] PROBLEM - HHVM rendering on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [17:17:58] PROBLEM - Apache HTTP on mw1245 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.012 second response time [17:18:38] PROBLEM - HHVM rendering on mw1245 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.013 second response time [17:18:38] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.058 second response time [17:18:49] RECOVERY - HHVM rendering on mw1082 is OK: HTTP OK: HTTP/1.1 200 OK - 69356 bytes in 0.138 second response time [17:19:25] paravoid: are we holding off the scb related changes then? [17:19:32] !log cleared hhbc bytecode repo on mw1082 and mw1245 on suspicion that old translations were reused [17:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:19:48] RECOVERY - Apache HTTP on mw1245 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.041 second response time [17:20:14] * kart_ is waiting.. [17:20:19] RECOVERY - HHVM rendering on mw1245 is OK: HTTP OK: HTTP/1.1 200 OK - 69354 bytes in 0.085 second response time [17:20:58] PROBLEM - HHVM rendering on mw1078 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.023 second response time [17:21:08] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 69356 bytes in 0.397 second response time [17:21:38] PROBLEM - Apache HTTP on mw1078 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [17:21:47] 7Puppet, 10Beta-Cluster-Infrastructure, 6Discovery, 10Wikimedia-Portals, 13Patch-For-Review: beta-mediawiki-config-update-eqiad failing with merge conflict in portals - https://phabricator.wikimedia.org/T129427#2107728 (10greg) [17:22:06] 7Puppet, 10Continuous-Integration-Infrastructure: mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges" - https://phabricator.wikimedia.org/T126699#2107731 (10greg) [17:22:08] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.041 second response time [17:22:45] greg-g: I'm going to drop a comment then on https://gerrit.wikimedia.org/r/#/c/275853/ and https://gerrit.wikimedia.org/r/#/c/275853/ they we held off due to fire, I'm not sure if they muddy the waters but there seems little rush to risk it [17:22:47] RECOVERY - HHVM rendering on mw1078 is OK: HTTP OK: HTTP/1.1 200 OK - 69356 bytes in 0.402 second response time [17:23:27] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.083 second response time [17:23:28] chasemp: same link twice [17:23:34] ha [17:23:50] but, yeah, ya'lls call, I suppose [17:23:51] greg-g: https://gerrit.wikimedia.org/r/#/c/276405/ [17:23:59] PROBLEM - Apache HTTP on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [17:24:13] * greg-g nods [17:24:52] chasemp: 276405 is not happening, right? [17:24:57] PROBLEM - HHVM rendering on mw1241 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [17:25:29] kart_: as I understand it there is some ongoing investigation in prod and so I'm holding off atm [17:25:49] chasemp: ok. I can schedule it on Monday then. [17:26:05] chasemp: as it need to communicate to VPs [17:26:06] I think I'm holding off with https://gerrit.wikimedia.org/r/#/c/276071 and friends too [17:26:41] kart_: sure I understand, I'll make a comment on the changeset [17:26:41] why? [17:26:55] chasemp: thanks. [17:27:19] PROBLEM - Apache HTTP on mw1253 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [17:27:39] RECOVERY - Apache HTTP on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.031 second response time [17:27:48] PROBLEM - HHVM rendering on mw1253 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [17:28:18] paravoid: I'm not sure what the status is wrt hhvm, I'm not seeing Aaron online too and I had a comment on the previous PS [17:28:29] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 69317 bytes in 0.078 second response time [17:29:08] RECOVERY - Apache HTTP on mw1253 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.051 second response time [17:29:37] ori: ^^^ [17:29:38] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 69317 bytes in 0.083 second response time [17:29:43] (03CR) 10Rush: "held over during puppet swat as there is an ongoing investigation for issues in prod, apologies" [puppet] - 10https://gerrit.wikimedia.org/r/276405 (https://phabricator.wikimedia.org/T129329) (owner: 10KartikMistry) [17:29:52] ori: the swift codfw switchover window started half an hour ago [17:29:54] (03CR) 10Rush: "held over during puppet swat as there is an ongoing investigation for issues in prod, apologies" [puppet] - 10https://gerrit.wikimedia.org/r/275853 (https://phabricator.wikimedia.org/T128237) (owner: 10Mholloway) [17:30:13] ori: so a) shall we proceed in light of the hhvm issues?, b) do you know where aaron is? [17:30:28] PROBLEM - HHVM rendering on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [17:31:53] a) yes; b) no [17:32:48] PROBLEM - HHVM rendering on mw1258 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.037 second response time [17:33:07] OK, the rate at which they're locking up just spiked, yes? [17:33:21] as in, in the past 20 minutes [17:34:15] chasemp: thanks! [17:34:18] PROBLEM - Apache HTTP on mw1240 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [17:34:19] * kart_ goes to bed [17:34:42] paravoid: ok, I'm to err on the safe side but otoh doing this next week isn't going to be better either [17:35:04] we have a full schedule of switchovers next week, I wouldn't wait [17:35:07] I'd proceed [17:35:08] PROBLEM - puppet last run on maps-test2003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:35:08] PROBLEM - puppet last run on maps-test2004 is CRITICAL: CRITICAL: Puppet has 1 failures [17:35:08] PROBLEM - puppet last run on maps-test2002 is CRITICAL: CRITICAL: Puppet has 1 failures [17:35:40] paravoid: yeah that's true, I'll merge it now [17:35:46] (03PS3) 10Filippo Giunchedi: Set cross-DC swift writes to be sync for originals for switchover testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276071 (https://phabricator.wikimedia.org/T129089) (owner: 10Aaron Schulz) [17:35:48] PROBLEM - Apache HTTP on mw1258 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [17:36:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Set cross-DC swift writes to be sync for originals for switchover testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276071 (https://phabricator.wikimedia.org/T129089) (owner: 10Aaron Schulz) [17:36:18] PROBLEM - DPKG on maps-test2004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:36:18] PROBLEM - DPKG on maps-test2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:36:18] PROBLEM - DPKG on maps-test2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:36:46] AaronSchulz: FYI: https://phabricator.wikimedia.org/T128624#2107578 [17:36:47] PROBLEM - HHVM rendering on mw1174 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [17:37:03] (03PS1) 10Papaul: Add production DNS for rdb200[5-6] Bug:T129178 [dns] - 10https://gerrit.wikimedia.org/r/276515 (https://phabricator.wikimedia.org/T129178) [17:37:09] PROBLEM - Apache HTTP on mw1174 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [17:37:24] I'm with mysql, but ping me if support is needed for hhvm or swift [17:38:19] !log filippo@tin Synchronized wmf-config/filebackend-production.php: swift codfw sync replication T129089 (duration: 00m 39s) [17:38:20] T129089: switch upload varnish backends to codfw ahead of full switch - https://phabricator.wikimedia.org/T129089 [17:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:38:25] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2107850 (10faidon) >>! In T124444#2107277, @Gehel wrote: > I have a slight preference for Monday, Tuesday or Thursday. I'm having lunch wit... [17:40:27] PROBLEM - HHVM rendering on mw1239 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [17:40:37] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.026 second response time [17:40:38] PROBLEM - Apache HTTP on mw1239 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [17:41:31] godog: AaronSchulz should be here [17:41:34] now, I mean [17:42:07] ori: thanks! LGTM so far though, no errors in the logs [17:42:09] RECOVERY - HHVM rendering on mw1239 is OK: HTTP OK: HTTP/1.1 200 OK - 69325 bytes in 0.085 second response time [17:42:14] I'll proceed with the varnish changes shortyl [17:42:18] PROBLEM - Apache HTTP on mw1175 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.029 second response time [17:42:18] RECOVERY - Apache HTTP on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.033 second response time [17:42:27] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 69321 bytes in 2.856 second response time [17:43:07] PROBLEM - HHVM rendering on mw1175 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [17:43:24] (03PS2) 10Filippo Giunchedi: varnish: route upload cache backends to codfw [puppet] - 10https://gerrit.wikimedia.org/r/276223 (https://phabricator.wikimedia.org/T129089) [17:43:32] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnish: route upload cache backends to codfw [puppet] - 10https://gerrit.wikimedia.org/r/276223 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [17:44:02] Can someone take a look at the jobqueue? We got 4 million jobs actually [17:44:08] RECOVERY - Apache HTTP on mw1175 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.589 second response time [17:44:10] I will create a task [17:44:18] !log running puppet in batches on cache_upload in eqiad after https://gerrit.wikimedia.org/r/#/c/276223/ [17:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:23] Luke081515: it is/was being investigated, but I don't know of a task, so thank you [17:44:49] RECOVERY - HHVM rendering on mw1175 is OK: HTTP OK: HTTP/1.1 200 OK - 69318 bytes in 0.111 second response time [17:45:47] PROBLEM - HHVM rendering on mw1211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [17:45:58] PROBLEM - Apache HTTP on mw1211 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.012 second response time [17:46:37] greg-g: Sure that a task exists? I can't find a operations open task with high or UBN prio [17:46:58] maybe I should reopen my last task about that [17:47:01] Luke081515: I meant/said I don't know of one, so, thank you (for creating one) :) [17:47:05] 7Blocked-on-Operations, 10Datasets-Archiving, 10Dumps-Generation, 10Flow, 3Collaboration-Team-Current: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#2107911 (10Mattflaschen) >>! In T119511#2106578, @ArielGlenn wrote: > I'm trying to run the flow m... [17:47:20] ok ;) [17:47:37] 6Operations, 10Wikimedia-General-or-Unknown: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#2107926 (10Luke081515) 5Resolved>3Open Currently we have 4 million jobs, and the queue is still growing... [17:47:46] 6Operations, 10Wikimedia-General-or-Unknown: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#2107930 (10Luke081515) a:5ori>3None [17:47:55] mh the vcl reload failed, looking [17:48:07] PROBLEM - Apache HTTP on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.003 second response time [17:48:41] Luke081515, you should open a new ticket, it is probably not related [17:49:07] jynus: ok [17:49:08] PROBLEM - HHVM rendering on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.016 second response time [17:49:18] PROBLEM - Apache HTTP on mw1256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [17:49:18] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 1 failures [17:49:28] PROBLEM - HHVM rendering on mw1256 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.011 second response time [17:49:30] reference it just in case :-) [17:49:56] 6Operations, 10MediaWiki-JobQueue: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2107949 (10Luke081515) [17:49:58] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:08] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:11] 6Operations, 10MediaWiki-JobQueue: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2107962 (10Luke081515) [17:50:19] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:21] 7Blocked-on-Operations, 10Datasets-Archiving, 10Dumps-Generation, 10Flow, 3Collaboration-Team-Current: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#2107968 (10ArielGlenn) sure will. [17:50:28] PROBLEM - puppet last run on cp1074 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:43] 6Operations, 10Wikimedia-General-or-Unknown: Job queue is growing and growing - https://phabricator.wikimedia.org/T124194#1948627 (10Luke081515) 5Open>3Resolved a:3ori (created T129517 for the new case) [17:50:47] done [17:50:48] PROBLEM - puppet last run on cp2011 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:49] PROBLEM - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:49] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:53] that's me ^ [17:50:58] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:58] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Puppet has 1 failures [17:50:59] 6Operations, 10MediaWiki-JobQueue: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2107949 (10Luke081515) p:5Triage>3Unbreak! [17:51:08] RECOVERY - Apache HTTP on mw1256 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.031 second response time [17:51:08] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [17:51:09] Backend be_ms_fe redefined 'wikimedia-common_upload-backend.inc.vcl' Line 174 Pos 9 [17:51:10] Luke081515, thanks- that way we do not distrurb unrelated people, etc. [17:51:17] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Puppet has 1 failures [17:51:18] RECOVERY - HHVM rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 69346 bytes in 0.083 second response time [17:51:38] bblack paravoid ema ^ seen this before? [17:51:47] PROBLEM - Apache HTTP on mw1179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [17:51:48] PROBLEM - puppet last run on cp1049 is CRITICAL: CRITICAL: Puppet has 1 failures [17:51:48] PROBLEM - puppet last run on cp2014 is CRITICAL: CRITICAL: Puppet has 1 failures [17:51:58] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Puppet has 1 failures [17:52:17] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [17:52:27] PROBLEM - HHVM rendering on mw1179 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [17:52:38] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Puppet has 1 failures [17:52:48] 6Operations, 10MediaWiki-JobQueue: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2107992 (10greg) Subscribing people who I believe were investigating this earlier. [17:53:23] jynus: I tihnk this is not the same reason now, the queue is growing faster then last time [17:53:27] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [17:53:27] ACKNOWLEDGEMENT - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:27] ACKNOWLEDGEMENT - puppet last run on cp1049 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:27] ACKNOWLEDGEMENT - puppet last run on cp1050 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:27] ACKNOWLEDGEMENT - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:27] ACKNOWLEDGEMENT - puppet last run on cp1063 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:27] ACKNOWLEDGEMENT - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:28] ACKNOWLEDGEMENT - puppet last run on cp1071 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:28] ACKNOWLEDGEMENT - puppet last run on cp1072 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:30] ACKNOWLEDGEMENT - puppet last run on cp1073 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:30] ACKNOWLEDGEMENT - puppet last run on cp1074 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:30] ACKNOWLEDGEMENT - puppet last run on cp1099 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi troubles reloading VCL, redefined backend be_ms_fe [17:53:38] o.O [17:54:15] Luke081515, do not worry, ongoing just maintenance [17:54:17] RECOVERY - HHVM rendering on mw1179 is OK: HTTP OK: HTTP/1.1 200 OK - 69354 bytes in 0.099 second response time [17:54:47] PROBLEM - HHVM rendering on mw1216 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.003 second response time [17:54:57] (03PS1) 10CSteipp: Enforce password policies on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276518 (https://phabricator.wikimedia.org/T119100) [17:54:58] PROBLEM - Apache HTTP on mw1216 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [17:56:08] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [17:56:18] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 1 failures [17:56:38] RECOVERY - HHVM rendering on mw1216 is OK: HTTP OK: HTTP/1.1 200 OK - 69348 bytes in 0.268 second response time [17:56:48] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.047 second response time [17:57:07] PROBLEM - HHVM rendering on mw1176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.019 second response time [17:57:38] PROBLEM - Apache HTTP on mw1176 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [17:58:09] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 1 failures [17:58:37] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures [17:58:58] RECOVERY - HHVM rendering on mw1176 is OK: HTTP OK: HTTP/1.1 200 OK - 69348 bytes in 0.348 second response time [17:59:29] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.070 second response time [17:59:29] RECOVERY - Apache HTTP on mw1240 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.561 second response time [18:00:04] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160310T1800). Please do the needful. [18:00:07] PROBLEM - Apache HTTP on mw1186 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.016 second response time [18:00:28] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [18:00:37] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [18:00:38] PROBLEM - HHVM rendering on mw1186 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.014 second response time [18:00:56] nothing to deploy for parsoid [18:01:17] RECOVERY - HHVM rendering on mw1240 is OK: HTTP OK: HTTP/1.1 200 OK - 69346 bytes in 0.078 second response time [18:01:28] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures [18:03:58] PROBLEM - Apache HTTP on mw1244 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [18:04:09] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures [18:04:17] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [18:04:39] PROBLEM - HHVM rendering on mw1244 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [18:04:49] godog: status? [18:05:44] paravoid: there's a duplicate "backend be_ms_fe" entry for the upload vcl that prevents it from load, I believe it is because both backends start with "ms-fe" [18:06:13] (03CR) 10Brian Wolff: "This change may be causing https://phabricator.wikimedia.org/T129359 (unconfirmed)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272922 (owner: 10Aaron Schulz) [18:06:25] paravoid: I think it can probably be fixed by removing the eqiad backend entry from hieradata/common/cache/upload.yaml [18:06:38] PROBLEM - Apache HTTP on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.004 second response time [18:07:19] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 1 failures [18:07:36] ok? [18:07:38] PROBLEM - HHVM rendering on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [18:08:17] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [18:08:27] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 1 failures [18:08:28] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.069 second response time [18:08:46] paravoid: but the sync replication seems to be working, I'm reading the template again to confirm if we could remove the entry and be ok, but it seems safer to me to revert the varnish change and keep sync replication on [18:08:48] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Puppet has 1 failures [18:08:51] so, the HHVM lock-up is (in all the cases I can see) correlated with these messages in /var/log/hhvm/error.log: [18:08:54] Mar 10 18:04:12 mw1090 hhvm: LightProcess::proc_open failed due to exception: Failed in afdt::recvRaw: Connection reset by peer [18:08:54] Mar 10 18:04:12 mw1090 hhvm: #012Warning: fork failed - Connection reset by peer in /srv/mediawiki/php-1.27.0-wmf.15/includes/GlobalFunctions.php on line 2490 [18:09:12] so I'm going to add debug logging to wfShellExec() [18:09:24] currently it logs at the end, but not if there is a hard crash partway through [18:09:27] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures [18:09:28] RECOVERY - HHVM rendering on mw1090 is OK: HTTP OK: HTTP/1.1 200 OK - 69344 bytes in 0.128 second response time [18:09:48] PROBLEM - Apache HTTP on mw1246 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.022 second response time [18:10:16] godog: let's just fix varnish? [18:10:28] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [18:10:37] PROBLEM - HHVM rendering on mw1246 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [18:11:28] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Puppet has 1 failures [18:13:04] (03PS1) 10Filippo Giunchedi: varnish: leave only one upload backend for swift [puppet] - 10https://gerrit.wikimedia.org/r/276523 [18:13:06] paravoid: yeah I think ^ will fix it [18:13:07] PROBLEM - HHVM rendering on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [18:13:13] !log ori@tin Synchronized php-1.27.0-wmf.15/includes/GlobalFunctions.php: wfShellExec() debug logging for T129467 (duration: 00m 28s) [18:13:14] T129467: HHVM 3.12 has a race-condition when starting up - https://phabricator.wikimedia.org/T129467 [18:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:28] how did it work before though? [18:13:37] !log importing zhwiki missing records from production to labs [18:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:58] PROBLEM - Apache HTTP on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.002 second response time [18:14:14] both backends were defined before you switched from eqiad to codfw, right? [18:14:17] paravoid: see modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb I think because of "be_seen.key?(backend)" [18:15:28] PROBLEM - HHVM rendering on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.021 second response time [18:15:47] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.052 second response time [18:15:53] so that if there are two apps swift / swift-thumbs if they both use the backend with the same host it works fine from varnish POV, not otherwise [18:15:58] PROBLEM - Apache HTTP on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.015 second response time [18:16:08] uhm, ok [18:16:38] RECOVERY - HHVM rendering on mw1090 is OK: HTTP OK: HTTP/1.1 200 OK - 69344 bytes in 0.181 second response time [18:16:40] sorry, my attention is a bit diverted [18:17:03] paravoid: and thinking again that might not even fix it because it'll still generate both heh [18:17:06] no worries [18:17:49] 6Operations, 10Salt: on labcontrol1001, /var/cache/salt has too many files! - https://phabricator.wikimedia.org/T129224#2108146 (10Andrew) This is good news and bad news, both at once. [18:17:52] but I don't see bblack or ema around that are more familiar with this than I am, I think reverting will fix it though [18:18:53] 6Operations, 10MediaWiki-JobQueue: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2107949 (10JanZerebecki) Not sure if errors like these are related: `Redis exception connecting to "rdb1005.eqiad.wmnet:6381"` with exception message: `read error on connection`. They appear in logst... [18:19:00] !log ori@tin Synchronized php-1.27.0-wmf.15/includes/GlobalFunctions.php: wfShellExec() debug logging for T129467 (take 2) (duration: 00m 26s) [18:19:01] T129467: HHVM 3.12 has a race-condition when starting up - https://phabricator.wikimedia.org/T129467 [18:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:19:04] ok I'll revert as the hostname assumption seems baked into the template [18:19:08] PROBLEM - Apache HTTP on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [18:19:57] PROBLEM - HHVM rendering on mw1180 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.018 second response time [18:20:44] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2108160 (10Gehel) >>! In T124444#2107850, @faidon wrote: > I've added a tentative date of Thursday the 17th under [[ https://wikitech.wikim... [18:20:57] RECOVERY - Apache HTTP on mw1180 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.063 second response time [18:21:28] PROBLEM - Apache HTTP on mw1242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.009 second response time [18:21:37] PROBLEM - HHVM rendering on mw1242 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [18:21:47] RECOVERY - HHVM rendering on mw1180 is OK: HTTP OK: HTTP/1.1 200 OK - 69347 bytes in 0.098 second response time [18:21:57] (03PS1) 10Filippo Giunchedi: Revert "varnish: route upload cache backends to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/276524 [18:22:31] godog: what's up? [18:22:37] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "varnish: route upload cache backends to codfw" [puppet] - 10https://gerrit.wikimedia.org/r/276524 (owner: 10Filippo Giunchedi) [18:22:56] bblack: paravoid: there's a duplicate "backend be_ms_fe" entry for the upload vcl that prevents it from load, I believe it is because both backends start with "ms-fe" paravoid: I think it can probably be fixed by removing the eqiad backend entry from hieradata/common/cache/upload.yaml [18:23:14] well yeah but that's a bug heh [18:23:17] bblack: when merging https://gerrit.wikimedia.org/r/#/c/276223/ duplicate backend be_ms_fe backend comes up [18:23:22] yeah what ori said [18:23:32] we have dual defines for the other two [18:23:37] s/two/too/ [18:23:42] let me poke at it a little [18:23:52] bblack: I've merged but not puppet-merged https://gerrit.wikimedia.org/r/#/c/276524/ yet btw [18:24:04] bblack: ok, puppet is disabled across cache_upload [18:24:52] 18:14 paravoid: see modules/varnish/templates/vcl/wikimedia-common.inc.vcl.erb I think because of "be_seen.key?(backend)" [18:24:59] ok yeah so what makes the case unique is we have two backend defs using the same hostname, yes [18:25:08] RECOVERY - Apache HTTP on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.025 second response time [18:25:08] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 69347 bytes in 0.074 second response time [18:25:22] it de-duplicates the same hostname, but not the same backend name... [18:25:49] bblack: yeah I thought https://gerrit.wikimedia.org/r/#/c/276523/1 would fix it but likely not, there will still be two backend entries with the same name but different .host [18:25:54] which we're already using elsewhere in a way, but not in the same way as thumbs vs thumbs_swift [18:25:57] PROBLEM - Apache HTTP on mw1243 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.016 second response time [18:26:04] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2108187 (10Florian) [18:26:09] PROBLEM - HHVM rendering on mw1243 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.020 second response time [18:26:35] (03PS1) 10Ottomata: Update cdh submodule with oozie SLA config patch [puppet] - 10https://gerrit.wikimedia.org/r/276526 [18:27:03] bblack: mh perhaps adding the app name to the backend? not sure what other things that changes [18:27:39] RECOVERY - Apache HTTP on mw1243 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.030 second response time [18:28:06] godog: I donno yet, I'm still staring and thinking. there's a right answer in there somewhere. [18:28:07] RECOVERY - HHVM rendering on mw1243 is OK: HTTP OK: HTTP/1.1 200 OK - 69347 bytes in 0.083 second response time [18:28:36] (a right way that works for all cases, which there are a lot of) [18:28:44] 6Operations, 6Discovery, 10Kartotherian, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2108206 (10mark) We'd need to order a bit earlier than May 16th, as otherwise we risk them not arriving in time and hitting next FY budget. Let's aim for... [18:28:58] (03CR) 10Ottomata: [C: 032] Update cdh submodule with oozie SLA config patch [puppet] - 10https://gerrit.wikimedia.org/r/276526 (owner: 10Ottomata) [18:29:34] gwicke cscott arlolra subbu greg-g i'm about to push kartotherian [18:29:35] godog: am merging [18:29:36] s'ok? [18:29:44] k [18:30:00] ottomata: heh sorry about that, yeah good to merge, thanks [18:30:01] Filippo Giunchedi: Revert "varnish: route upload cache backends to codfw" (f2431fc) [18:30:02] k [18:30:19] so the conflicting case is this: when we do cache->cache, we define directors (same as app names in this context) as e.g. cache_eqiad + cache_eqiad_random, which share a host def intentionally [18:30:39] so in some cases, we're sharing + de-duping intentionally [18:31:18] the odd thing about swift that makes it a problem is the dual "swift" and "swift_thumbs" being the same backend just for independent dc-switching, which is kind of a hack anyways [18:31:51] the easiest way to fix it would be to do it from outside of VCL: actually define a second set of hostnames in DNS for swift_thumbs [18:32:16] chasemp: hiyyyaaa [18:33:02] alternatively, we could add the dcname to the hostname [18:33:13] bblack: oh ok, yeah that would work, possibly for the duration of the switchover and we could go back to ms-fe for both afterwards [18:33:28] (03PS11) 10Krinkle: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [18:33:35] I *think* adding the dcname to the templating that constructs the name be_${foo} would work, but I'm not sure yet [18:33:38] bblack: the dcname to the hostname only for swift or in general? it is true that it works only because we don't share hostnames in dcs [18:33:42] in general [18:33:47] (03CR) 10Krinkle: [C: 032] Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [18:33:56] (03PS12) 10Krinkle: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (https://phabricator.wikimedia.org/T124697) (owner: 10Jcrespo) [18:33:57] yay [18:34:06] in the cache<->cache case, it would just s/be_cp1046/be_cp1046_eqiad/ and such [18:34:09] (03CR) 10Krinkle: [C: 032] Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (https://phabricator.wikimedia.org/T124697) (owner: 10Jcrespo) [18:34:49] let me CI-check a change for that right quick. it's better than DNS ugliness [18:35:03] (03Merged) 10jenkins-bot: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (https://phabricator.wikimedia.org/T124697) (owner: 10Jcrespo) [18:35:55] bblack: ok! thanks [18:36:22] hey ottomata what's up? [18:38:35] hey soOoOoOo [18:38:41] system user group problem for ya [18:39:15] so far, all of our prod hadoop jobs have run as the 'hdfs' system user, which has lots of permissions, so no problem [18:39:42] recently, dan and marcel have adapted a tool that was made for running on the stat boxes (as the 'stats' user) to use hadoop [18:39:49] previously it just queried mysql andregularly and generated reports [18:40:11] they would much prefer if this didn't have to run as the 'hdfs' user, so that we won't have to deal with other file permissions, and refactoring some puppet stuff [18:40:14] (03PS1) 10BBlack: common VCL: use more of the hostname for backend naming [puppet] - 10https://gerrit.wikimedia.org/r/276529 [18:40:26] it [18:40:54] I think I get it so far [18:40:59] what's the hold up? [18:41:01] (it'd probably be better not to run prod hadoop jobs as hdfs user too, but it was the only non human user that had an account and good perms in hdfs, so we just used it) [18:41:02] (03Abandoned) 10Chad: Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [18:41:16] Krinkle, the other pending task is: what do you need me to do to guarantee non-issues if I wanted to do some load testing? Can you schedule a time to discuss the issues? [18:41:19] so, we'd like to have a non human user that could run hadoop jobs [18:41:25] but [18:41:34] file access in hadoop is controlled via a couple (really just one) admin group [18:41:38] analytics-privatedata-users [18:41:59] so, we could: A somehow put the stats user in that admin group [18:42:02] (is that possible?) [18:42:02] or [18:42:17] B: make up a user managed via admin.yaml that is non human and put it in that group [18:42:31] ah [18:42:44] it seems A is hard, because the admin module removes user from groups that are non in the admin module, yes? [18:43:10] it does yes but we also have drawn a line in the sand previously that says admin.yaml is humans only [18:43:12] !log deployed latest kartotherian [18:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:43:16] so either way is a little tricky [18:43:20] I'm thinking [18:43:46] if we added 'stats' to the list of ignored users [18:43:47] would it work? [18:43:54] if we hackily puppetized group addition? [18:44:12] actually, investigating this, i noticed that every run of puppet now removes 'wikidev' from the stats grou plist [18:44:25] i think somewhere in puppet stats is put in wikidev, and then admin module (or something) removes it [18:44:33] well that's interesting in itself [18:45:51] ottomata: ok so back down this rabbit hole [18:46:10] there is a group called statistics-admins that atually manages the stats group on host posix_name: stats [18:46:13] OOo chasemp ya sorry, uHhh brb and I have a meeting starting, but keep typing and i'll read [18:46:23] !log sync wikipedia-commons-gwtoolset-metadata with swiftrepl eqiad -> codfw T129359 [18:46:25] T129359: GWToolset gives error: An unknown error occurred in storage backend "local-swift-codfw - https://phabricator.wikimedia.org/T129359 [18:46:32] oh of course, my first attempt won't work right because that work has to be mirrored in the dynamic directors Go template too [18:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:46:41] oh yeah thats' right! :) [18:46:47] we put users into the stats grou [18:46:51] that works pretty well. [18:46:53] ok brb [18:47:28] ottomata: ok I'll think on it and circle back around when you are avail as I'm not sure exactly [18:47:53] iirc something lie this is why the posix: field was born for analytics [18:48:22] bblack: sigh [18:49:06] 6Operations, 10ops-eqiad: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2108367 (10jcrespo) @jcrespo check potential issue with firewall for carbon installation (dhcp) on labs-support vlan. [18:51:02] godog: I don't think I can fix it both correctly and quickly, other than the DNS hack (which is to define ms-fe-thumbs hostnames identical to ms-fe ones, and use those for swift_thumbs) [18:51:28] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:51:48] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:52:05] (03CR) 10Krinkle: "@Jcrespo : I'm not doubting that authority in the least. Let me clarify." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (https://phabricator.wikimedia.org/T124697) (owner: 10Jcrespo) [18:52:08] (03CR) 10Dzahn: [C: 032] Add mgmt DNS entries for rdb200[5-6] Bug:T129178 [dns] - 10https://gerrit.wikimedia.org/r/276483 (https://phabricator.wikimedia.org/T129178) (owner: 10Papaul) [18:52:30] bblack: ok, we're ~1h out of the original window, I'm fine with proceeding with the dns change and try again, though I have to go in one hour [18:52:58] jynus: deploying now [18:53:01] the other option is doing it tomorrow of course [18:53:07] ok, working on patches now [18:53:17] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [18:53:33] should be relativelly safe [18:53:57] !log krinkle@tin Synchronized wmf-config/db-codfw.php: I47954e21 (duration: 00m 30s) [18:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:54:33] Krinkle, I wasn't saying "I am the one who decides", I was saying "If I f* it up, I will own it" [18:54:48] 6Operations, 10MediaWiki-Uploading, 6Multimedia, 10Traffic, 10Wikimedia-Video: Uploading 1.2GB ogv results in 503 - https://phabricator.wikimedia.org/T128358#2071563 (10Bawolff) >>! In T128358#2098525, @zhuyifei1999 wrote: > A possible workaround is to use async chunked uploading, but pywikibot does not... [18:55:04] jynus: Well, either way, I'd like you to decide. [18:55:21] bblack: how much time would you have left today to keep an eye on it if things go wrong btw? [18:55:27] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [18:55:36] FWIW, I have the HHVM issue isolated / contained [18:55:41] if that's a consideration [18:55:43] godog: long enough I think, hours [18:55:58] ori, yay! [18:55:58] (03PS1) 10BBlack: Add ms-fe-thumbs.svc hostnames mirroring ms-fe.svc [dns] - 10https://gerrit.wikimedia.org/r/276531 [18:56:01] ok! thanks [18:56:26] jynus: As for non-issues, now that codfw requests use their own masters, we'll need to be careful about anything that goes from codfw to eqiad as it is essentially poisoned from mediawiki's perspective. It will perform selects from master that are in fact lagged. [18:56:27] (03CR) 10Filippo Giunchedi: [C: 031] Add ms-fe-thumbs.svc hostnames mirroring ms-fe.svc [dns] - 10https://gerrit.wikimedia.org/r/276531 (owner: 10BBlack) [18:57:01] Krinkle, true, but it is non-active [18:57:10] jynus: What do you mean by that? [18:57:18] anything on codfw is non-canonical [18:57:21] right now [18:57:21] (03CR) 10Dzahn: [C: 04-1] "10.192.32.132 but 133 1H IN PTR rdb2005.codfw.wmnet." [dns] - 10https://gerrit.wikimedia.org/r/276515 (https://phabricator.wikimedia.org/T129178) (owner: 10Papaul) [18:57:38] mediawiki and db, I mean [18:57:47] (03PS1) 10BBlack: Add ms-fe-thumbs.svc hostnames mirroring ms-fe.svc [puppet] - 10https://gerrit.wikimedia.org/r/276532 [18:57:49] jynus: Sure, but if you make a request to an app server over http and it makes a logic decision and sends data to eqiad, eqiad doesn't know that. [18:58:01] We don't have a firewall between them. It can and will talk to eqiad [18:58:07] (03PS1) 10Chad: demux.py: don't import os, unused [puppet] - 10https://gerrit.wikimedia.org/r/276533 [18:58:11] I understand the risks, and as I said, we can reevaluate at any time [18:58:13] non-canonical is only relevant in so far that eqiad won't use codfw [18:58:20] (03CR) 10BBlack: [C: 032] Add ms-fe-thumbs.svc hostnames mirroring ms-fe.svc [dns] - 10https://gerrit.wikimedia.org/r/276531 (owner: 10BBlack) [18:58:38] the other option also has risks [18:58:58] It makes sense, it's just new to MW and as such we need to take precautions and audit stuff [18:59:03] (03CR) 10BBlack: [C: 032] Add ms-fe-thumbs.svc hostnames mirroring ms-fe.svc [puppet] - 10https://gerrit.wikimedia.org/r/276532 (owner: 10BBlack) [18:59:05] I'm all for that [18:59:09] and it's a blocker for multi-dc as well [18:59:11] 100% agree [18:59:32] in fact, my weight for that is that I do not trust mediawiki enough [18:59:40] it'd would've been "nice" if we could do without that this time around, but if we have to do it, we can and we will :) [18:59:43] so read-only mode at all levels [19:00:07] I really need to do load-testiong [19:00:20] which leads me to the other question [19:00:25] what can I break? [19:00:29] jynus: OK. so before I audit too many things, let's do it bit by bit based on what you need. [19:00:44] how do you do this load testing? I'm mostly unfamiliar with this [19:00:45] I need to do GET requests massivelly [19:01:01] do you have a favourite url you use? [19:01:13] or set of urls rather [19:01:23] (03PS1) 10Filippo Giunchedi: varnish: route upload cache backends to codfw [puppet] - 10https://gerrit.wikimedia.org/r/276536 (https://phabricator.wikimedia.org/T129089) [19:01:26] godog: should work now [19:01:33] I tested cp1071 [19:01:36] I don't, I was going to start with the supidest option, 1 url [19:01:48] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [19:01:59] then progresivelly do several, based on the most common url requests [19:02:02] (03CR) 10BBlack: [C: 031] varnish: route upload cache backends to codfw [puppet] - 10https://gerrit.wikimedia.org/r/276536 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [19:02:52] in last instance, taking an apache dump and resend it to codfw, without the posts [19:02:56] OK. so I suck at lower-level linux stuff. I'm sure there's a way we can monitor what an app server connects to for a certain time window, right? [19:03:04] (03CR) 10Dzahn: "this probably need to be added to the deployment calendar to get merged" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [19:03:13] Like, start it in a tab, then do a certain request on an eqiad app server and see what it does [19:03:15] we can tcpdump and extract the urls from that [19:03:19] (03PS2) 10Papaul: Add production DNS for rdb200[5-6] Bug:T129178 [dns] - 10https://gerrit.wikimedia.org/r/276515 (https://phabricator.wikimedia.org/T129178) [19:03:25] Ips/hostnames would suffice [19:03:27] bblack: sweet, thanks! I'll go ahead with 276536 [19:03:27] (03CR) 10Andrew Bogott: [C: 04-1] "It's possible I'm not understanding this, but just in case:" [puppet] - 10https://gerrit.wikimedia.org/r/276420 (owner: 10Muehlenhoff) [19:03:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnish: route upload cache backends to codfw [puppet] - 10https://gerrit.wikimedia.org/r/276536 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [19:04:03] jynus: I'd use testwiki or test2wiki and Special:BlankPage as the most basic and safest starting point [19:04:17] my question is [19:04:37] what are your main concerns, so I can try to avoid/identify them? [19:04:37] You should definitely use a beefier url as well (like a parsed page with commons db involvement and stuff) [19:04:53] but not until we identify what an app server connect to to do that [19:05:03] services not deployed? caches? mediawiki-db writes? [19:05:05] so we can do it on eqiad and verify if each of those have codfw equivelents or are read-only [19:05:24] I'm not sure to be honest. Things I know of either have codfw versions or are read-only. [19:05:35] parser cache works as intended, as I had to do maintenance on those [19:05:43] (03CR) 10Dzahn: "can you check if "fonts-gujr-extra" exists under that name in the different distro versions (Ubuntu vs. Debian)" [puppet] - 10https://gerrit.wikimedia.org/r/276501 (https://phabricator.wikimedia.org/T129500) (owner: 10Dereckson) [19:05:51] Even then it'd be good to start with 1 request and also verify on the codfw sides that those are in fact used. [19:06:00] shouldn't take long, just a sanity check [19:06:15] so bad caches can not be invalidated, but mediawiki will check that it is not the latest version [19:06:27] jynus: I can do it any time today, Let's set aside an hour asap and do this? [19:06:34] bblack: yup, applied on cp1071 and checking [19:06:46] you mean talking or testing? [19:06:50] (03CR) 10Dzahn: [C: 031] "oh, sorry, you already said so on the ticket. ok!" [puppet] - 10https://gerrit.wikimedia.org/r/276501 (https://phabricator.wikimedia.org/T129500) (owner: 10Dereckson) [19:06:51] both [19:07:00] I won't stick around for the full testing [19:07:11] it is getting late for me, send updates to this ticket: [19:07:16] Just to sanity check a few things from mw perspective and then you can go nuts on it [19:07:27] but I can postpone until both are present [19:08:06] (03CR) 10Dzahn: "https://packages.debian.org/jessie/fonts-gujr-extra" [puppet] - 10https://gerrit.wikimedia.org/r/276501 (https://phabricator.wikimedia.org/T129500) (owner: 10Dereckson) [19:08:41] Krinkle, https://phabricator.wikimedia.org/T124697 [19:09:19] (03PS3) 10Dzahn: Add production DNS for rdb200[5-6] Bug:T129178 [dns] - 10https://gerrit.wikimedia.org/r/276515 (https://phabricator.wikimedia.org/T129178) (owner: 10Papaul) [19:09:34] 6Operations, 6Performance-Team, 7Availability, 7Epic, and 3 others: Cleanup active-DC based MW config code and make it more robust and easy to change - https://phabricator.wikimedia.org/T114273#2108460 (10jcrespo) [19:09:39] 6Operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 3 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#2108455 (10jcrespo) 5Open>3Resolved Fixed with @Krinkle's merge. [19:10:11] (03CR) 10Dzahn: [C: 032] Add production DNS for rdb200[5-6] Bug:T129178 [dns] - 10https://gerrit.wikimedia.org/r/276515 (https://phabricator.wikimedia.org/T129178) (owner: 10Papaul) [19:10:29] ^let's coordinate on that ticket for anthing before doing it (you do not need my permision if you want to do something already), but I want to do some measurements of db performance under stress [19:11:15] 6Operations, 10Wikimedia-SVG-rendering: Install Noto CJK (Source Han Sans) font family for SVG rendering - https://phabricator.wikimedia.org/T123223#1923929 (10Dereckson) @PhiLiP So yes, you're welcome to contribute code. There are two things to do: # As you've already indicated, we need to import the packa... [19:11:40] 6Operations, 10DBA, 13Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#2108480 (10jcrespo) The first phase will be mass-request: testwiki o... [19:13:52] 6Operations, 10DBA, 13Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#2108490 (10Krinkle) I want to do the following before running large... [19:14:01] BTWM you are too fearful, technically we have been making mediawiki requests to the main page for ages (watchdog). But I agree to be careful [19:14:20] jynus: Yeah, good point. [19:14:28] And we also expose it over X-Wikimedia-Debug now [19:14:30] for 2 servers [19:14:46] But I'd like to verify this before we do it widescale and intentional [19:14:53] it was worse- codfw has been on semi-read-semi-write-only for years [19:15:05] now the config is good for the first time! [19:15:14] The fact that there were years for worse situations to exist is also worse. [19:15:21] Meaning, we're not using it. [19:15:21] :-) [19:15:37] small steps [19:15:56] for connections, we can setup a firewall and block or log or both outgoing connections [19:16:16] godog: ok so far? [19:16:17] or use tcpdump [19:16:38] Are all of the JobQueue exceptions right now just fallout from earlier? Or are things still broken? [19:16:39] bblack: yeah I think so, enabling puppet on the rest shortly [19:16:43] ok [19:17:27] bblack: FWIW I'm looking at https://grafana.wikimedia.org/dashboard/db/swift to see swift perspective [19:17:40] jynus: does tcpdump include simple udp? [19:17:51] or netstat? [19:17:55] !log set timeline and math data containers as readable in swift codfw [19:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:09] Basically just want to gather all Ips/hosts used during a capture window [19:18:11] Krinkle, yes [19:18:33] netstat has the issue that only captures a snapshot, not the whole story [19:19:29] if one is root, we can capture everyting, but that would be for validation, not stress testing [19:19:42] Yeah [19:19:43] I [19:19:47] I'm root on app servers [19:20:07] (or rather, I can, I don't use by default obviously) [19:20:18] I am warning you because there are a lot of packages being captured, even on an idle server [19:21:36] I warn you that I will be disconnecting soon, feel free to do any testing without me and updating the ticket [19:22:26] the only thing I asked is not to do mysql writes on codfw masters (technically impossible, except for roots) [19:22:38] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:22:44] imposible for mediawiki itself [19:23:28] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:23:28] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:23:32] !log enable puppet in cache_cluster in eqiad, followed by the rest [19:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:38] that would be cache_upload [19:24:08] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [19:24:18] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:34] lag should not be an issue, except due to multi-tier slaves- that is a known issue, and we cannot do now nothing about it [19:24:47] RECOVERY - puppet last run on cp1074 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:25:08] RECOVERY - puppet last run on cp1072 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:25:38] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [19:25:58] RECOVERY - puppet last run on cp1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:28] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:27:02] jynus: Sure, no worries. [19:29:17] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [19:29:58] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.033 second response time [19:30:08] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [19:30:58] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 1 failures [19:30:59] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: Puppet has 1 failures [19:31:19] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [19:31:39] mhh didn't expect that, checking if it is transient [19:31:48] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [19:31:57] RECOVERY - puppet last run on cp2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:32:08] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:32:16] yeah, recovering by itself [19:32:47] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures [19:32:49] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:32:57] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:33:07] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [19:33:08] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:33:38] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:34:00] Someone can take a look at https://phabricator.wikimedia.org/T129517#2108153? Seems like one of the redis queues is not working [19:34:28] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:34:31] bblack: I'll move to https://gerrit.wikimedia.org/r/#/c/276224/ and disable puppet again just in case [19:34:36] 7Puppet, 10Continuous-Integration-Infrastructure: mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges" - https://phabricator.wikimedia.org/T126699#2108576 (10JanZerebecki) I worded that incorrectly, that is not what I meant. I have not looked at it enough to know if it is... [19:34:37] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:34:47] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Puppet has 1 failures [19:34:48] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:34:49] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:34:58] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:34:58] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:35:07] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures [19:35:09] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:35:09] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:35:48] 6Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, and 4 others: Move oldwikisource on www.wikisource.org to mul.wikisource.org - https://phabricator.wikimedia.org/T64717#2108583 (10Dzahn) since www.wikisource.org and mul.wikisource.org both already exist in DNS and that seemed to be the only... [19:36:01] (03CR) 10Hashar: [C: 031] "I havent really looked at the Apache conf. But the CI / beta autoupdate parts are good to me." [puppet] - 10https://gerrit.wikimedia.org/r/276397 (https://phabricator.wikimedia.org/T129427) (owner: 10EBernhardson) [19:36:07] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:36:08] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [19:36:21] 6Operations, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-Requests, and 3 others: Move oldwikisource on www.wikisource.org to mul.wikisource.org - https://phabricator.wikimedia.org/T64717#2108584 (10Dzahn) [19:36:28] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:36:37] godog: let it go and re-puppet them [19:36:48] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:36:48] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [19:36:49] it's a confd race condition, puppet x2 fixes it [19:37:05] (I think) [19:37:08] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:37:31] bblack: ok, yeah my first salt puppet run is still going [19:37:38] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures [19:37:38] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:37:41] (03PS1) 10Dereckson: Set logo and site name on gu.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276547 (https://phabricator.wikimedia.org/T122407) [19:37:47] PROBLEM - puppet last run on cp2020 is CRITICAL: CRITICAL: Puppet has 1 failures [19:37:48] I'm still undecided if puppet being so slow ATM is an advantage or a disavantage [19:37:49] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:37:54] it has some inertia to it [19:37:59] RECOVERY - puppet last run on cp2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:38:10] :) [19:38:17] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [19:38:33] usually when I make these kinds of changes, I salt puppet in large batches, up to ~19 hosts at a time [19:38:33] (03PS1) 10RobH: replacing bast2001 with install2001 in smokeping targets [puppet] - 10https://gerrit.wikimedia.org/r/276548 [19:38:45] which for upload will do the world in like 2 batches [19:39:15] oh ok, I'll try with 18 or so [19:39:28] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:39:29] RECOVERY - puppet last run on cp2020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:40:43] (03PS2) 10Filippo Giunchedi: varnish: route codfw as 'direct' for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/276224 (https://phabricator.wikimedia.org/T129089) [19:40:50] (03CR) 10RobH: [C: 032] replacing bast2001 with install2001 in smokeping targets [puppet] - 10https://gerrit.wikimedia.org/r/276548 (owner: 10RobH) [19:40:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnish: route codfw as 'direct' for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/276224 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [19:41:28] (03PS3) 10Filippo Giunchedi: varnish: route codfw as 'direct' for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/276224 (https://phabricator.wikimedia.org/T129089) [19:41:35] (03CR) 10Filippo Giunchedi: [V: 032] varnish: route codfw as 'direct' for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/276224 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [19:45:56] yuvipanda: there's a pull request for "watroles" that we would like https://phabricator.wikimedia.org/T128871#2089688 [19:46:13] "1" :) [19:46:26] bblack: LGTM, reenabling puppet everywhere and then moving on to https://gerrit.wikimedia.org/r/#/c/276225/ [19:46:57] mutante: merged! [19:47:02] yuvipanda: thanks ! [19:47:06] godog: you need to be really really really sure before that one, that the previous one is in full effect [19:47:32] (that no puppet run got skipped due to already-running or whatever, or salt failed to notice a host, or who knows what) [19:47:38] mutante: np [19:47:39] bblack: ok! I'm forcing a puppet run on cluster:cache_upload before [19:47:55] I'm just saying, there are rare flaws in a simple salt to a cluster [19:48:23] with that pair of changes, if change #1 isn't in full effect before change#2 starts hitting, we will get loops and everything will go crazy [19:48:27] PROBLEM - MariaDB Slave Lag: m3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 180036.68 seconds [19:48:42] verify after remote commands [19:48:50] with ssh you'd do the same [19:49:24] bblack: heh ok! I'll make sure puppet has ran on upload cluster first [19:51:14] ok [19:51:27] chasemp: back! [19:51:46] basically the salted-puppet flaw is the normal cron puppet could start before your change merge, and block the salted run that applies the merge as "already running". or salt could forget a host [19:52:22] and if even 1 single codfw cache is still point at eqiad when eqiad starts point at codfw, some large percentage of user requests will become infinite loops of codfw<->eqiad querying each other until they reach some terminal timeout [19:52:58] https://www.youtube.com/watch?v=jyaLZHiJJnE [19:53:08] ottomata: ok so you have some script that needs to be run as the stat user [19:53:14] was that were we left off? [19:53:24] well, that would be the most convenient solution yes [19:53:40] but, if it was the stats user, the that user would ahve to be in the analytics-privatedata-users group [19:53:41] greg-g: did we ever figure out the problem that happened around swat time? (cirrus?) [19:53:43] bblack: is there a particular way you make sure of that puppet has ran, short of waiting for its normal course? [19:54:05] ottomata: why not a sudo -u stat foo.py situation? is this a cron or user run or ? [19:54:05] whatever user we use, that user needs to be able to access files that are group readable by that group [19:54:08] godog: well, you could do a salted grep on the changed file in /etc/varnish to confirm that the change is there [19:54:11] its a cron [19:54:17] godog: if you want a fast solution I mean [19:54:35] its a cron that launches a hive query [19:54:51] that reads files in hdfs, that are group readable by analytics-privatedata-users [19:55:10] can you point me to ane xample file so I can see? [19:55:20] of the cron? [19:55:21] uhh [19:55:22] ja [19:55:29] I just want to see what perms you mean [19:55:45] oh [19:55:50] oh [19:55:51] uhh [19:55:51] yeah [19:55:55] on stat1002 [19:57:06] hdfs dfs -ls /wmf/data/wmf/webrequest/webrequest_source=text/year=2016/month=3/day=9/hour=10 [19:57:09] godog: there's only 10x caches in codfw anyways, could also count the salt output [19:57:16] (10x upload caches I mean) [19:57:17] oh one more thing chasemp [19:57:19] whatever user it is [19:57:34] aude: I haven't heard a summary yet from people, they have moved on [19:57:34] it needs to exist on analytics1001 [19:57:40] 7Puppet, 10Continuous-Integration-Infrastructure: mediawiki jobs fail intermittently with "mw-teardown-mysql.sh: Can't revoke all privileges" - https://phabricator.wikimedia.org/T126699#2108850 (10hashar) A puppet patch landed a few days ago that moved all the manifest/role/ci.pp under the modules/roles/ci/ tr... [19:58:06] we better get this into a task dude I'm losing track of nuance but I want to help [19:59:30] it sounds like you want to run a script as the stat user that can read files permed for analytics-privatedata-users [19:59:32] 6Operations, 10MediaWiki-JobQueue: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2108861 (10greg) @demon @elukey @jcrespo @ori: you all were investigating this issue earlier, can you please summarize what you found on this task? The issue was severe enough in the morning to block... [19:59:40] bblack: ok thanks, I'm verifying some 401s I've seen first [20:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160310T2000). Please do the needful. [20:00:14] greg-g: ok [20:00:32] if/when we figure things out, then we would like to deploy the fix for https://phabricator.wikimedia.org/T129450 [20:00:38] (03PS1) 10RobH: neglected to update the title of the codfw smokeping target [puppet] - 10https://gerrit.wikimedia.org/r/276550 [20:00:42] (people can't add/remove site links on wikidata) [20:00:52] not sure about add, but they definitely can't remove [20:01:21] (03CR) 10RobH: [C: 032] neglected to update the title of the codfw smokeping target [puppet] - 10https://gerrit.wikimedia.org/r/276550 (owner: 10RobH) [20:03:18] chasemp: that is correct [20:03:24] Are we good to go for the train? [20:04:18] twentyafterfour: I've been trying to get an answer from those who were investigating the jobqueue issue earlier, but apparently no one cares anymore [20:04:49] twentyafterfour: since no one is jumping up and down, I'm going to say "ef it" and go ahead. If they feel it is a blocking issue they should have A) told me and B) made a task [20:05:36] hdfs dfs -ls /wmf/data/wmf/webrequest/webrequest_source=text/year=2016/month=3/day=9/hour=0/ is a better example [20:05:50] greg-g: I don't think it was directly related to the new branch deploy or it would have shown up sooner than ~2 hours after the train yesterday [20:05:52] greg-g: i would need a task to help understand the issue [20:06:01] aude: https://phabricator.wikimedia.org/T129517 [20:06:04] ok [20:06:27] is all I have, since no one reported anything as they were investigating and I'm not going to try to tease apart irc logs from 3 different on going investigations this morning (maybe it was 2, hard to tell) [20:06:28] aude: I can deploy your patch before the train if it's ok with greg-g [20:06:46] twentyafterfour: would be helpful [20:07:26] ottomata: can we in the job itself use perms to read the files? sudo -g? [20:08:45] bblack: ok going ahead with last https://gerrit.wikimedia.org/r/#/c/276225/ [20:09:05] chasemp: https://phabricator.wikimedia.org/T129551 [20:09:07] hmmm [20:09:19] godog: ok [20:09:39] greg-g: the redis connection errors are coming from wikis which are on wmf.15 not wmf.16 [20:09:49] (03PS2) 10Filippo Giunchedi: varnish: route eqiad to codfw for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/276225 (https://phabricator.wikimedia.org/T129089) [20:09:53] godog: it should make little diff in your eqiad/codfw stats or req counts, but make some requests a little more efficient for users, and validate that we can do such things at all [20:10:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnish: route eqiad to codfw for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/276225 (https://phabricator.wikimedia.org/T129089) (owner: 10Filippo Giunchedi) [20:11:23] bblack: *nod* thanks! disabled puppet again everywhere [20:11:50] twentyafterfour: well then, maybe we'll fix it :) [20:12:28] ori: _joe_ elukey last chance to speak up about the jobqueue errors/redis stuff from this morning, I haven't heard anything for a couple hours and you all moved on so I assume it's no longer an issue [20:12:43] greg-g: I'm in a meeting, will update later [20:12:52] my findings are all in the irc backlog [20:13:02] so someone else can summarize too [20:13:07] ori: along with 2 other investigations that were ong going at the time, I can't parse that for you [20:13:22] ori: do you want to block the train or no? [20:13:25] no [20:13:28] thank you! [20:14:37] ori: and thanks for the offer of summarizing later, since it's not blocking it can definitely wait :) [20:14:41] * greg-g feels better now [20:16:01] 6Operations, 10MediaWiki-JobQueue: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2107949 (10mmodell) https://logstash.wikimedia.org/#dashboard/temp/AVNiKpq8O3D718AOM4Cg The errors have not subsided, but as you can see in the graph I pasted above ^ the error level before and afte... [20:21:04] 6Operations, 10ops-eqiad: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2108980 (10Cmjohnson) [20:22:23] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 13Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188#2108997 (10hashar) The Nodepool instances now have the `varnish` package at whatever version is defined by apt conf. which should be 3.x since it is... [20:22:25] 6Operations, 10ops-eqiad: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2108998 (10Cmjohnson) @ArielGlenn Please let me know which netboot.cfg you would prefer or if there is any h/w raid you want. [20:22:51] 6Operations, 10MediaWiki-JobQueue: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2108999 (10mmodell) {F3605180} [20:23:35] :) [20:24:13] aude: do you just need me to deploy https://gerrit.wikimedia.org/r/#/c/276461/ ? [20:24:34] wait that's merged [20:24:40] so it's good to go with the train? [20:26:08] RECOVERY - MariaDB Slave Lag: m3 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 86015.28 seconds [20:29:05] ok I'm going ahead with the train. aude: let me know if you need me to deploy a patch after. [20:29:42] twentyafterfour: job queue is exploding isn't it ? [20:31:10] !log Ran <> [20:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:27] twentyafterfour: it's merged? [20:31:36] then ok [20:31:39] bawolff: :) [20:31:55] woo :) [20:32:15] (job queue no longer an issue bah ignore me) [20:32:16] Thanks [20:32:22] godog: does swift-repl handle container headers? [20:32:26] bblack: LGTM, codfw caches are fetching locally for originals and going to eqiad for thumbs, eqiad caches are not talking to ms-fe at all cc AaronSchulz [20:32:36] !log replacing failed disk ms1001 [20:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:33:25] AaronSchulz: it is supposed to yeah, I noticed some 401s on timeline before though [20:33:36] bawolff: some cache pollution though, adding fake params gives the right result though [20:33:38] godog: ok awesome [20:33:40] 6Operations, 10MediaWiki-JobQueue: Job queue is growing and growing - https://phabricator.wikimedia.org/T129517#2109042 (10hashar) p:5Unbreak!>3High ``` [20:13:22] ori: do you want to block the train or no? [20:13:25] no ``` Solved, summary to come up later. [20:33:52] Hopefully my test case would be the only thing in cache [20:35:11] 6Operations, 10ops-eqiad: ms1001 bad disk - https://phabricator.wikimedia.org/T129008#2092369 (10Cmjohnson) Replaced the disk, it's in rebuild now. [20:35:42] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 13Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188#2109049 (10BBlack) As long as they're jessie machines that pull from the wikimedia package repos, it'll be the right version. [20:35:55] 6Operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2109051 (10Cmjohnson) @chasemp any updates on this disk? [20:36:54] 6Operations, 10ops-eqiad: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2109052 (10Cmjohnson) p:5Triage>3Low Changing Priority for now....@jcrespo update whenever we can make this happen [20:40:12] (03PS1) 10Ottomata: Run reportupdater on stat1002 as the hdfs user [puppet] - 10https://gerrit.wikimedia.org/r/276558 (https://phabricator.wikimedia.org/T129551) [20:43:36] 7Puppet, 10MediaWiki-Vagrant, 7Easy, 13Patch-For-Review: MediaWiki-Vagrant guest OS clock gets out of sync - https://phabricator.wikimedia.org/T116507#2109076 (10bd808) @Tgr can we call this one resolved? [20:44:15] (03PS2) 10Ottomata: Run reportupdater on stat1002 as the hdfs user [puppet] - 10https://gerrit.wikimedia.org/r/276558 (https://phabricator.wikimedia.org/T129551) [20:44:56] aude: merged into wmf.16 [20:45:02] which I'm about to deploy [20:45:18] 6Operations, 10ops-eqiad: ms1001 bad disk - https://phabricator.wikimedia.org/T129008#2109082 (10ArielGlenn) Yay it was hot swappable, the best! [20:46:45] 6Operations, 10Continuous-Integration-Infrastructure, 10Traffic, 13Patch-For-Review: Make CI run Varnish VCL tests - https://phabricator.wikimedia.org/T128188#2109087 (10hashar) The Nodepool instances are indeed Jessie and reuse apt configuration from operations/puppet.git. I can confirm 3.x is installed... [20:46:56] (03CR) 10Ottomata: [C: 032] Run reportupdater on stat1002 as the hdfs user [puppet] - 10https://gerrit.wikimedia.org/r/276558 (https://phabricator.wikimedia.org/T129551) (owner: 10Ottomata) [20:47:19] 6Operations, 10ops-eqiad: Rack and setup snapshot1005-1007 - https://phabricator.wikimedia.org/T129553#2109089 (10ArielGlenn) What is the disk setup like for these again? [20:50:51] 6Operations, 10Wikimedia-General-or-Unknown, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Switchover of the application servers to codfw - https://phabricator.wikimedia.org/T124671#2109095 (10Nemo_bis) [20:51:58] (03PS3) 10Yuvipanda: k8s: Pin a version of docker [puppet] - 10https://gerrit.wikimedia.org/r/276250 [20:52:08] (03CR) 10Yuvipanda: [V: 032] k8s: Pin a version of docker [puppet] - 10https://gerrit.wikimedia.org/r/276250 (owner: 10Yuvipanda) [20:56:53] (03PS1) 1020after4: all wikis to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276560 [20:57:03] 6Operations, 10ops-codfw: labstore2003-labstore2004 onsite setup taks - https://phabricator.wikimedia.org/T128764#2109118 (10chasemp) @papaul any time for this? [20:57:49] (03CR) 1020after4: [C: 032] all wikis to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276560 (owner: 1020after4) [20:58:25] (03Merged) 10jenkins-bot: all wikis to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276560 (owner: 1020after4) [21:03:29] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.27.0-wmf.16 [21:03:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:50] greg-g: no apparent change in the redis/jobqueue error rate. [21:13:08] twentyafterfour: not deployed yet? [21:16:28] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [21:18:40] aude: I deployed [21:19:00] ^ that seems like a false alarm I don't see anything unmerged in mira? also mira isn't the active deploy server? [21:19:37] PROBLEM - Apache HTTP on mw1247 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.018 second response time [21:19:47] uhm [21:20:19] PROBLEM - HHVM rendering on mw1247 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.007 second response time [21:20:19] https://www.wikidata.org/w/extensions/Wikidata/extensions/Wikibase/view/resources/jquery/wikibase/jquery.wikibase.sitelinklistview.js doesn't look updated [21:20:28] did the submodule get updated? [21:20:36] * aude trying to bust the cache [21:20:53] twentyafterfour: mira doesn't respond to ssh anymore [21:21:44] i don't see the patch on tin? (or are we using mira?) [21:21:49] not the core submodule update [21:22:09] aude: someone else merged the patch so maybe it didn't get updated [21:22:13] hmm [21:22:27] * aude needs to find power... back in 5 minutes [21:22:28] hashar: mira responds for me [21:22:58] reedy@mira:~$ [21:24:40] but mira isn't the active server, tin is [21:25:42] I guess my setup is borked somehow [21:26:03] I am off anyway. Have a good train ride! [21:26:48] RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.051 second response time [21:27:02] I don't see Wikibase in extensions/ on wmf.16...wth [21:27:18] oh it's a sub-submodule [21:27:37] RECOVERY - HHVM rendering on mw1247 is OK: HTTP OK: HTTP/1.1 200 OK - 66630 bytes in 0.072 second response time [21:29:47] twentyafterfour: it's Wikidata [21:30:25] aude: https://gerrit.wikimedia.org/r/#/c/276461/ got merged into Wikibase not Wikidata, how does that change propogate up to the Wikidata repo? [21:30:32] I don't see a .gitmodules file in Wikidata [21:31:08] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/276473/ [21:31:36] this is the results of our build, which includes wikibase [21:31:45] and all the composer things and such [21:32:00] ah! That patch I wasn't aware of ... it didn't link back to the phabricator task [21:32:07] :/ [21:32:28] probably if that is merged, then the core submodule update is automatic [21:32:57] aude: yeah I'm merging that and I'll deploy it shortly [21:34:37] ok [21:34:39] thanks [21:34:54] jenkins will probably take a little while :( [21:39:26] twentyafterfour: did the morning swat changes eventually get deployed? I think anomie had one in there to squelch some log warnings. [21:39:49] bd808: no [21:40:01] I just did aude's change ... about to sync it [21:40:16] 6Operations, 10ops-eqiad, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2109414 (10chasemp) >>! In T126946#2109051, @Cmjohnson wrote: > @chasemp any updates on this disk? I didn't have this on my radar Is the status: >>! In T126946#2035452, @Cmjohnson wrote: > Problem I a... [21:40:34] I suppose we can get anomie's backports into evening swat [21:41:14] I can do more patches right now if there are any that need to be prioritized [21:41:33] twentyafterfour: https://gerrit.wikimedia.org/r/#/c/276467/ [21:41:48] that is, after scap is finished being a slow fat pig, taking it's time to start sync-dir as usual [21:41:53] it's not a huge deal, just warning logs [21:42:13] heh, bd808: the logs are overwhelmed by jobrunner errors still ;) [21:42:36] !log twentyafterfour@tin Synchronized php-1.27.0-wmf.16/extensions/Wikidata: Deploy https://gerrit.wikimedia.org/r/#/c/276473/ (duration: 02m 10s) [21:42:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:42:50] checking [21:42:54] the sync-dir slowness is probably php lint for that one [21:43:39] looks good [21:43:40] bd808: it's the same long delay at the beginning, before it even tries to do anything [21:43:45] thanks :) [21:43:47] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [21:43:51] aude: no prob, thanks for testing [21:44:00] twentyafterfour: that's when lint runs locally for a sync-dir [21:44:08] hmm... [21:44:21] the cross master sync slowness got fixed [21:44:40] I mean it's not instant but it's not horrible now [21:44:51] sync-dir has been super slow since forever... [21:44:57] (03PS1) 10Madhuvishy: eventlogging: Remove server-side udp to kafka forwarder [puppet] - 10https://gerrit.wikimedia.org/r/276615 [21:45:01] php-lint [21:45:12] sync-dir is mostly gross [21:45:25] (03PS3) 10Dzahn: remove VMs cygnus and technetium [puppet] - 10https://gerrit.wikimedia.org/r/275871 (https://phabricator.wikimedia.org/T118763) [21:45:41] deploying https://gerrit.wikimedia.org/r/#/c/276467/ [21:46:09] (03CR) 10Dzahn: [C: 032] "the users of these VMs said they are done with their tests" [puppet] - 10https://gerrit.wikimedia.org/r/275871 (https://phabricator.wikimedia.org/T118763) (owner: 10Dzahn) [21:46:12] once gate and submit finishes. [21:46:40] twentyafterfour: see https://phabricator.wikimedia.org/diffusion/MSCA/browse/master/scap/main.py;6f7d0e7d2c3c0b2d78dfee49b1cb85ac2950ea44$442 for the lint check that would be silent [21:46:43] and not fast [21:47:01] at least for a big dir like Wikidata [21:48:09] (03PS2) 10Dzahn: admin: remove keys of akumar,mnoushad [puppet] - 10https://gerrit.wikimedia.org/r/275873 (https://phabricator.wikimedia.org/T126012) [21:48:31] every file gets opened and read 2-3 times I think and `php -l` is not super speedy itself [21:48:36] (03CR) 10Dzahn: [C: 032] "the users said "consider this as a gentle reminder to disable the accounts that were created for this test"" [puppet] - 10https://gerrit.wikimedia.org/r/275873 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [21:50:14] bd808: yeah, indeed [21:50:26] that part needs a progress bar ;) [21:50:41] could be done with a bit of a rewrite [21:51:03] or even a start checkpoint in the log so you know it's doing _something_ [21:51:18] but all this is getting redone soon (scap3 yay!) so it can stay as-is for now [21:51:28] (03PS2) 10Dzahn: admin: set akumar, mnoushad to absent [puppet] - 10https://gerrit.wikimedia.org/r/275874 (https://phabricator.wikimedia.org/T126012) [21:53:03] (03CR) 10Dzahn: [C: 032] "[cygnus:~] $ sudo /usr/local/sbin/enforce-users-groups" [puppet] - 10https://gerrit.wikimedia.org/r/275874 (https://phabricator.wikimedia.org/T126012) (owner: 10Dzahn) [21:55:35] !log cygnus , killing akumar's processes [21:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:58] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: puppet fail [21:58:37] 7Puppet, 6Commons, 10Wikimedia-SVG-rendering, 7I18n, 13Patch-For-Review: Add Gujarati fonts to Wikimedia servers - https://phabricator.wikimedia.org/T129500#2109555 (10Dereckson) [21:58:53] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2063073 (10Redrose64) So... [21:59:07] (03PS1) 10RobH: certificate renewal for archiva.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/276619 [22:00:42] (03PS2) 10Madhuvishy: eventlogging: Remove server-side udp to kafka forwarder [puppet] - 10https://gerrit.wikimedia.org/r/276615 [22:02:18] (03PS2) 10RobH: certificate renewal for archiva.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/276619 [22:03:16] !log deployed patch for T129506 to wmf15 & 16 [22:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:04:34] (03CR) 10RobH: [C: 032] certificate renewal for archiva.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/276619 (owner: 10RobH) [22:05:59] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Puppet has 1 failures [22:07:28] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet has 1 failures [22:07:38] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: Puppet has 1 failures [22:08:42] (03PS3) 10Madhuvishy: eventlogging: Remove server-side udp to kafka forwarder [puppet] - 10https://gerrit.wikimedia.org/r/276615 (https://phabricator.wikimedia.org/T129402) [22:12:10] RECOVERY - HTTPS on titanium is OK: SSL OK - Certificate archiva.wikimedia.org valid until 2017-05-08 15:16:02 +0000 (expires in 423 days) [22:18:36] 6Operations, 6Services, 3Mobile-Content-Service: Investigate server flapping after 3/7/2016 deploy - https://phabricator.wikimedia.org/T129237#2109689 (10bearND) The conclusion was that the node processes ran out of memory due to the previously mentioned patch. [22:19:37] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: Puppet has 1 failures [22:23:17] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [22:27:59] 7Blocked-on-Operations, 6Operations, 6Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#2109725 (10Nirzar) 5Resolved>3Open [22:29:21] 7Blocked-on-Operations, 6Operations, 6Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#1949461 (10Nirzar) @akosiaris @Krenair I still don't have the LDAP access to production server. [22:39:58] (03PS1) 10ArielGlenn: add config entry for list of closed wikis to dumps [puppet] - 10https://gerrit.wikimedia.org/r/276632 [22:41:21] (03CR) 10ArielGlenn: [C: 032] add config entry for list of closed wikis to dumps [puppet] - 10https://gerrit.wikimedia.org/r/276632 (owner: 10ArielGlenn) [22:45:22] (03PS1) 10ArielGlenn: script to run a maintenance command on all wikis with varying output dirs [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276633 [22:46:18] (03CR) 10ArielGlenn: "The plan is to merge wikiquery functionality with this, and use it for all the 'other' dumps that are 'do this on all wikis' but not as pa" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/276633 (owner: 10ArielGlenn) [23:03:43] deploying https://gerrit.wikimedia.org/r/#/c/276467/ [23:03:52] (sorry I got sidetracked for a while) [23:06:48] !log twentyafterfour@tin Synchronized php-1.27.0-wmf.16/languages/LanguageConverter.php: deploying https://gerrit.wikimedia.org/r/#/c/276467/ (duration: 00m 31s) [23:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:05] anything else from morning swat that needs to get deployed before evening swat? [23:12:26] 7Blocked-on-Operations, 6Operations, 6Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#2110001 (10Krenair) 5Open>3Resolved you definitely appear to: ```krenair@bastion-01:~$ ldaplist -l group wmf | grep nirzar member: u... [23:14:03] 7Blocked-on-Operations, 6Operations, 6Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#2110005 (10Nirzar) @AlexMonk-WMF my LDAP is not "Nirzar" :( I don't know the password for this. my LDAP is NPangarkar (WMF) [23:16:56] 7Blocked-on-Operations, 6Operations, 6Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#2110029 (10Krenair) Please don't subscribe that account, it's not relevant here. ```krenair@bastion-01:~$ ldaplist -l passwd nirzar dn:... [23:19:03] 7Blocked-on-Operations, 6Operations, 6Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#2110033 (10Nirzar) @AlexMonk-WMF Okay, i will try all my passwords. Thank you for confirming this :) [23:20:07] 7Blocked-on-Operations, 6Operations, 6Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#2110036 (10Krenair) Please stop it. [23:25:16] 6Operations: /mnt/upload7 does not exist anywhere, yet it is referenced in multiple places in wmf-config - https://phabricator.wikimedia.org/T129586#2110079 (10Krenair) [23:25:37] 7Puppet, 10MediaWiki-Vagrant, 7Easy, 13Patch-For-Review: MediaWiki-Vagrant guest OS clock gets out of sync - https://phabricator.wikimedia.org/T116507#2110091 (10Tgr) 5Open>3Resolved a:3Tgr Haven't seen that error since the patch was merged. [23:37:43] (03PS1) 10Ori.livneh: job{cron,runner}: don't attempt JSON validation of config file [puppet] - 10https://gerrit.wikimedia.org/r/276653 (https://phabricator.wikimedia.org/T129517) [23:38:14] (03PS2) 10Ori.livneh: job{cron,runner}: don't attempt JSON validation of config file [puppet] - 10https://gerrit.wikimedia.org/r/276653 (https://phabricator.wikimedia.org/T129517) [23:38:23] (03CR) 10Ori.livneh: [C: 032 V: 032] job{cron,runner}: don't attempt JSON validation of config file [puppet] - 10https://gerrit.wikimedia.org/r/276653 (https://phabricator.wikimedia.org/T129517) (owner: 10Ori.livneh) [23:41:48] PROBLEM - cassandra CQL 10.64.0.221:9042 on restbase1002 is CRITICAL: Connection refused [23:43:26] urandom: this is 1002 finishing decom? [23:44:33] gwicke: yeah, i'll ack it [23:45:49] ACKNOWLEDGEMENT - cassandra CQL 10.64.0.221:9042 on restbase1002 is CRITICAL: Connection refused eevans This host has been decommissioned. [23:45:50] (03PS1) 10Ori.livneh: Fix-up for Icb34377497: remove $? check, too [puppet] - 10https://gerrit.wikimedia.org/r/276655 [23:46:04] (03PS2) 10Ori.livneh: Fix-up for Icb34377497: remove $? check, too [puppet] - 10https://gerrit.wikimedia.org/r/276655 [23:46:12] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for Icb34377497: remove $? check, too [puppet] - 10https://gerrit.wikimedia.org/r/276655 (owner: 10Ori.livneh) [23:50:21] (03CR) 10Krinkle: [C: 031] X-Wikimedia-Debug: profile if 'profiler' attribute set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276220 (owner: 10Ori.livneh) [23:51:55] 6Operations, 6Editing-Department, 6Parsing-Team, 6Services: Services team goals April - June 2016 (Q4 2015/16) - https://phabricator.wikimedia.org/T118871#2110185 (10GWicke) For the service discovery goal, we would need to clarify a) which solution we are shooting for, and b) which part of the overall tas... [23:51:58] PROBLEM - salt-minion processes on mw1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:52:27] PROBLEM - salt-minion processes on mw1162 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:52:27] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Puppet has 2 failures [23:52:27] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 2 failures [23:52:28] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 2 failures [23:52:28] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 2 failures [23:52:28] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: Puppet has 2 failures [23:52:38] PROBLEM - salt-minion processes on mw1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:52:38] PROBLEM - salt-minion processes on mw1163 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:52:38] PROBLEM - salt-minion processes on mw1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:52:39] PROBLEM - salt-minion processes on mw1164 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:52:47] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet has 2 failures [23:52:47] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: Puppet has 2 failures [23:52:48] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Puppet has 2 failures [23:52:48] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 2 failures [23:52:49] PROBLEM - salt-minion processes on mw1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:52:58] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 2 failures [23:52:58] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:07] PROBLEM - salt-minion processes on mw1166 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:08] PROBLEM - salt-minion processes on mw1165 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:08] PROBLEM - salt-minion processes on mw1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:09] PROBLEM - salt-minion processes on mw1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:09] PROBLEM - puppet last run on mw1167 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:09] PROBLEM - salt-minion processes on mw1168 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:09] PROBLEM - salt-minion processes on mw1012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:17] PROBLEM - salt-minion processes on mw1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:18] PROBLEM - salt-minion processes on mw1167 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:19] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:27] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:28] !log Starting Cassandra cleanup op on restbase10{07,10,11}-{a,b}.eqiad.wmnet : T125832 [23:53:28] PROBLEM - puppet last run on mw1006 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:28] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:29] T125832: Exception: No config value for key 'sprint.phragile-uri' trying to access sprint workboard - https://phabricator.wikimedia.org/T125832 [23:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:37] PROBLEM - salt-minion processes on mw1161 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:42] wrong issue... [23:53:47] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:47] PROBLEM - salt-minion processes on mw1014 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:47] PROBLEM - salt-minion processes on mw1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:47] !log Starting Cassandra cleanup op on restbase10{07,10,11}-{a,b}.eqiad.wmnet : T125842 [23:53:48] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [23:53:49] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:49] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:53:58] PROBLEM - puppet last run on mw1013 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:58] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 2 failures [23:53:58] PROBLEM - salt-minion processes on mw1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:58] PROBLEM - salt-minion processes on mw1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:59] PROBLEM - salt-minion processes on mw1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:59] PROBLEM - salt-minion processes on mw1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:53:59] PROBLEM - salt-minion processes on mw1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [23:54:29] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 2 failures [23:54:39] RECOVERY - salt-minion processes on mw1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:54:48] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 2 failures [23:55:48] RECOVERY - salt-minion processes on mw1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:56:55] greg-g: Are we OK to do this morning's SWAT items yet? [23:57:11] (I.e., can I bung them in the afternoon SWAT?) [23:57:48] (03PS1) 10Ori.livneh: Ensure job{chron/runner} pre-start clause has exit code 0 on success [puppet] - 10https://gerrit.wikimedia.org/r/276658 [23:57:55] mw100* are me, I'll ack [23:58:07] (03PS2) 10Ori.livneh: Ensure job{chron/runner} pre-start clause has exit code 0 on success [puppet] - 10https://gerrit.wikimedia.org/r/276658 [23:58:11] (03CR) 10jenkins-bot: [V: 04-1] Ensure job{chron/runner} pre-start clause has exit code 0 on success [puppet] - 10https://gerrit.wikimedia.org/r/276658 (owner: 10Ori.livneh) [23:58:14] (03CR) 10Ori.livneh: [C: 032 V: 032] Ensure job{chron/runner} pre-start clause has exit code 0 on success [puppet] - 10https://gerrit.wikimedia.org/r/276658 (owner: 10Ori.livneh) [23:59:27] RECOVERY - salt-minion processes on mw1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion