[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T0000). Please do the needful. [00:00:05] ebernhardson, bawolff, and James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:09] \o [00:00:17] * James_F waves. [00:00:22] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [00:00:49] hi [00:02:59] 06Operations, 10Wikimedia-Apache-configuration: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164#2801154 (10Dzahn) a:03Dzahn `monospaced text` [00:03:11] (03CR) 10Gergő Tisza: Log users elevated groups on login attempts (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321938 (owner: 10Reedy) [00:03:11] Hello, I can SWAT this evening. [00:03:41] 06Operations, 10Continuous-Integration-Infrastructure, 10Wikimedia-Apache-configuration: Apache slash expansion should not redirect from HTTPS to HTTP - https://phabricator.wikimedia.org/T95164#2801161 (10Dzahn) [00:03:52] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:55] (03CR) 10Krinkle: [C: 031] contint: move .htaccess content for doc/integration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/322019 (https://phabricator.wikimedia.org/T150727) (owner: 10Dzahn) [00:07:38] (03PS2) 10Dereckson: Ban 100 most common passwords from ordinary accounts. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321991 (owner: 10Brian Wolff) [00:09:05] bawolff_: there were a question about that on fr.wikipedia village pump, the list is cluster wide currently, right? Would it be possible in the futur to build several lists by language? [00:09:28] In theory, but we're not going to do that today :) [00:09:40] well actually we want the same list for everyone [00:09:47] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321991 (owner: 10Brian Wolff) [00:09:59] but it might be reasonable to add more i18nized version (current list is quite english specific) [00:10:13] We want same list for everyone because you can log in at any wiki [00:10:25] (03Merged) 10jenkins-bot: Ban 100 most common passwords from ordinary accounts. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321991 (owner: 10Brian Wolff) [00:11:23] matt_flaschen: ping? [00:12:09] matt_flaschen: you've an undeployed commit merged in mediawiki-config: 0ae483bc73c016474eea2f1de25c48fae6007165 Add German Wiktionary in beta (2nd try) [00:13:45] Dereckson, oh, sorry, I'll deploy it. [00:14:08] ok thanks bd808, ill let you know when i finish it so that way you all can test it and stuff [00:14:22] * ebernhardson thought there was some icingi thing that complained about undeployed mediawiki-config stuff [00:14:32] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:16:10] (03PS12) 1020after4: Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) [00:17:02] !log mattflaschen@tin Synchronized dblists/all-labs.dblist: Beta Cluster only (duration: 00m 54s) [00:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:01] !log mattflaschen@tin Synchronized wikiversions-labs.json: Beta Cluster only (duration: 00m 53s) [00:18:05] (03CR) 10jenkins-bot: [V: 04-1] Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [00:18:17] Dereckson, done. Sorry about that. [00:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:56] bd808 one last thing what do i use for projects (in phab) for jouncebot [00:19:51] (03PS13) 1020after4: Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) [00:20:15] ebernhardson, can you take a look at https://gerrit.wikimedia.org/r/322022 . Pretty sure it's right. [00:20:30] I checked it against SearchConfig. [00:20:41] matt_flaschen: sure [00:20:47] Zppix: probably just #Tool-Labs-tools-Other [00:20:52] The other part is just a missing global. [00:21:29] bd808 i also tagged operations okay? [00:21:50] Zppix: ops doesn't care :) releng would [00:21:57] 06Operations, 10Tool-Labs-tools-Other: Jouncebot: Add functionality to change Nick from Jouncebot_ to Jouncebot automatically - https://phabricator.wikimedia.org/T150916#2801201 (10Zppix) [00:22:31] ok done removed ops [00:23:15] Thanks [00:23:50] matt_flaschen no problem i just would of thought ops would want to know about this :P [00:23:52] matt_flaschen: thanks [00:23:58] oh nvm [00:23:59] :P [00:24:15] bawolff_: Ban 100 most common passwords from ordinary accounts. live on mw1099 [00:24:29] ok [00:24:39] (03CR) 10Dzahn: [C: 031] "looks all good to me, the IP's match the names" [puppet] - 10https://gerrit.wikimedia.org/r/321935 (https://phabricator.wikimedia.org/T150680) (owner: 10Filippo Giunchedi) [00:25:19] ebernhardson: you dep is okay, we run on 1.29.0-wmf.2 and 1.29.0-wmf.3, both include 319485 [00:25:45] Dereckson: right, the code is all out, this test has been running a few days now (i should probably stop copy/pasting the whole thing ..) [00:25:53] Dereckson: this is just increasing the % of traffic [00:25:56] ok [00:26:20] Yes, I wondered why new code for 25% → 50% [00:26:33] (03PS2) 10Dereckson: Increase cirrus interwiki loadtest to 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321925 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [00:26:41] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321925 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [00:26:53] well, there could be new code if 25% found a problem, we fixed it but wanted it out before 50% :) but that's not the case here [00:27:26] ugh this won't me test because appearent;y my ip is banned [00:27:41] bawolff_: use the office vpn? [00:27:44] going to different wiki [00:27:51] (03Merged) 10jenkins-bot: Increase cirrus interwiki loadtest to 50% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321925 (https://phabricator.wikimedia.org/T149740) (owner: 10EBernhardson) [00:27:59] (03CR) 10Arseny1992: Log users elevated groups on login attempts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321938 (owner: 10Reedy) [00:28:20] urandom: do you want to provision a new restbase server? [00:29:20] Dereckson: confirmed, it works [00:29:24] ebernhardson: 25% → 50% live on mw1099 [00:29:36] hey bd808 i dont have access to clone the repo or something i get auth errored [00:30:17] Zppix: its just a normal gerrit repo. anyone should be able to clone it [00:30:20] (03CR) 10Dzahn: "ok cool, you should +1 it yourself to indicate when beta testing is done" [puppet] - 10https://gerrit.wikimedia.org/r/319892 (https://phabricator.wikimedia.org/T150029) (owner: 10Reedy) [00:31:05] Dereckson: all looks happy [00:31:09] !log dereckson@tin Synchronized wmf-config/CommonSettings.php: Ban 100 most common passwords from ordinary accounts (duration: 00m 49s) [00:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:08] Dereckson: However, frwiki people presumably still use passwords like 123456, so the english list is relavent to them too [00:32:30] (03PS9) 10Dzahn: Gerrit: Enable concurrent collector [puppet] - 10https://gerrit.wikimedia.org/r/316983 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [00:32:35] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [00:33:18] bawolff_: and these lists contains generally some azerty specific keyword play too [00:34:32] let me try it again [00:34:34] @ bd808 [00:35:10] bawolff_: I wouldn't worry about testing it :P [00:35:15] !log dereckson@tin Synchronized wmf-config/CirrusSearch-production.php: Increase cirrus interwiki loadtest to 50% (T149740) (duration: 00m 48s) [00:35:19] (03PS2) 10Dereckson: Beta Features: Update whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321992 (owner: 10Jforrester) [00:35:33] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321992 (owner: 10Jforrester) [00:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:39] T149740: Run load tests of cross-project searching to verify its stability - https://phabricator.wikimedia.org/T149740 [00:36:20] (03Merged) 10jenkins-bot: Beta Features: Update whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321992 (owner: 10Jforrester) [00:37:25] Dereckson: thanks, seems sane [00:37:58] !log depooling cp3039 for hw/bios work [00:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:19] !log depooling cp3039 for hw/bios work - T150879 [00:38:33] mutante: ?? [00:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:38:37] T150879: Cannot connect to cp3039.mgmt.esams.wmnet:22 - https://phabricator.wikimedia.org/T150879 [00:39:48] urandom: i just saw https://gerrit.wikimedia.org/r/#/c/321935/ and it made me think you might want to be there when it gets merged [00:40:25] no luck bd808 i even restarted ssh-agent [00:40:35] mutante: oh, cool [00:40:57] mutante: yeah, sure; doesn't hurt to make sure everything is OK [00:41:17] (03PS2) 10Dzahn: Provision restbase201[012], add restbase2010-a [puppet] - 10https://gerrit.wikimedia.org/r/321935 (https://phabricator.wikimedia.org/T150680) (owner: 10Filippo Giunchedi) [00:41:29] (03CR) 10Dzahn: [C: 032] Provision restbase201[012], add restbase2010-a [puppet] - 10https://gerrit.wikimedia.org/r/321935 (https://phabricator.wikimedia.org/T150680) (owner: 10Filippo Giunchedi) [00:41:32] mutante: umm [00:41:38] mutante: but i don't seem to have access [00:41:52] it's prompting me for a password [00:42:05] urandom: the change itself should create your access [00:42:08] true [00:42:18] because only there we are adding it to site.pp and the role [00:42:18] auh [00:42:23] which adds the admin users [00:42:25] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [00:42:41] mutante: makes sense [00:42:42] (03PS1) 10Filippo Giunchedi: prometheus: discard varnish backend UUID during collection [puppet] - 10https://gerrit.wikimedia.org/r/322025 (https://phabricator.wikimedia.org/T150479) [00:43:03] I'm around too btw mutante urandom [00:43:33] our temporary canadian [00:43:36] :) [00:44:05] James_F: for the beta features changes, on Phabricator, when we browse commits, it has good cross references. Here, it would have been interesting to note the commit removing cirrussearch-completionsuggester beta from the CirrusSearch code base [00:44:10] * godog dresses up as RCMP [00:44:37] James_F: so there would have been a line in this commit to advertise about the config change [00:44:46] Dereckson: Sure, and it would have been nice for Discovery when they promoted it to fix it then, not have me clean up after them a couple of months later. :-) [00:44:47] (this commit has been quoted in ) [00:44:53] Also. [00:45:14] (03CR) 10Filippo Giunchedi: "This isn't 100% related to the collection issues we've seen but will help with avoiding metric churn." [puppet] - 10https://gerrit.wikimedia.org/r/322025 (https://phabricator.wikimedia.org/T150479) (owner: 10Filippo Giunchedi) [00:45:35] urandom: though we could split things and give you access first [00:45:58] James_F: live on ùw1099 [00:46:10] also prepare for a shower of ipsec alerts about cp3039 [00:46:15] or not [00:46:18] * James_F tests. [00:46:39] Dereckson: Both or just the config one? [00:47:01] just the config one, but I'm pushing the other in a few instants [00:47:07] Config change LGTM. [00:48:12] godog: urandom: oops , 2010 is not included in restbase20[01][1-9] [00:48:19] changing that [00:48:23] (03PS14) 1020after4: Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) [00:48:28] !log repool cp3039 - T150879 [00:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:51] T150879: Cannot connect to cp3039.mgmt.esams.wmnet:22 - https://phabricator.wikimedia.org/T150879 [00:48:54] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Beta Features: Update whitelist ([[Gerrit:321992]) (duration: 00m 49s) [00:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:38] godog: urandom: for 2011 we need to add missing secret "restbase2011.kst" in private repo it looks [00:50:05] (03CR) 1020after4: Enable multiple config files in phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [00:50:07] James_F: VE patch live on mw1099 [00:50:25] (03CR) 10jenkins-bot: [V: 04-1] prometheus: discard varnish backend UUID during collection [puppet] - 10https://gerrit.wikimedia.org/r/322025 (https://phabricator.wikimedia.org/T150479) (owner: 10Filippo Giunchedi) [00:50:29] * James_F nods. [00:50:39] mutante: indeed, I'll create the secrets now, thanks! [00:50:53] Dereckson: Yup, works well. [00:51:05] PROBLEM - puppet last run on restbase2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:51:11] godog: :) [00:52:46] James_F: syncing [00:53:05] ACKNOWLEDGEMENT - puppet last run on restbase2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn just being setup [00:53:16] (03CR) 1020after4: "https://puppet-compiler.wmflabs.org/4599/" [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [00:53:24] !log dereckson@tin Synchronized php-1.29.0-wmf.3/extensions/VisualEditor/lib/ve/src/ui/: Make $returnFocusTo a no-op in WindowManager (T150556) (duration: 00m 49s) [00:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:46] T150556: [Regression pre-wmf.3] Cursor jumps to the beginning of the article after adding Citation/Media/Template/Formula/Graph - https://phabricator.wikimedia.org/T150556 [00:55:02] (03CR) 10Mattflaschen: "It's set up now. There were a couple false starts due to bugs with addWiki.php." [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [00:55:07] Dereckson: Yay, prod is fixed. Thank you! [00:55:09] (03PS1) 10Dzahn: site/restbase: include restbase2010 in node regex [puppet] - 10https://gerrit.wikimedia.org/r/322029 [00:55:38] (03PS2) 10Dzahn: site/restbase: include restbase2010 in node regex [puppet] - 10https://gerrit.wikimedia.org/r/322029 (https://phabricator.wikimedia.org/T150680) [00:56:00] You're welcome. [00:56:38] (03CR) 10Dzahn: [C: 032] site/restbase: include restbase2010 in node regex [puppet] - 10https://gerrit.wikimedia.org/r/322029 (https://phabricator.wikimedia.org/T150680) (owner: 10Dzahn) [00:57:07] (03CR) 10Ppchelko: [C: 031] "All good then." [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [00:57:53] (03PS3) 10Mattflaschen: Add dewiktionary to RESTBase on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) [00:57:53] 06Operations, 10ops-esams, 10Traffic: Cannot connect to cp3039.mgmt.esams.wmnet:22 - https://phabricator.wikimedia.org/T150879#2801275 (10BBlack) The dmesg from the very-delayed bootup is interesting. Normally we get through the bulk of the bootup messages in under 20s, but the process took closer to 3-4 mi... [01:00:57] Pchelolo, I saw you +1'ed. Do you want to re-review now that it's rebased, or should I +2, or what? [01:02:32] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T0100). Please do the needful. [01:02:32] i'll use the phabricator update window [01:02:32] to also rename iridium-vcs [01:02:34] (03PS2) 10Dzahn: rename iridium-vcs to phab1001-vcs [dns] - 10https://gerrit.wikimedia.org/r/317290 (https://phabricator.wikimedia.org/T143363) [01:02:41] urandom: godog: restbase2010 is applying the role right now [01:03:15] matt_flaschen: I don't have +2 on puppet, but actually I've just noticed that the patch is wrong (it's been wrong before for wiktionaries in beta) Lemme upload a new version [01:03:16] 06Operations, 10ops-esams, 10Traffic: Cannot connect to cp3039.mgmt.esams.wmnet:22 - https://phabricator.wikimedia.org/T150879#2801276 (10BBlack) 05Open>03Resolved a:03BBlack In any case, this evening I started with connecting to the serial serial console (all normal, responds to enter keypress with a... [01:04:02] (03CR) 10Dzahn: [C: 032] rename iridium-vcs to phab1001-vcs [dns] - 10https://gerrit.wikimedia.org/r/317290 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [01:04:12] mutante: kk [01:04:23] and boom goes the login [01:04:34] yep, just created :) [01:04:38] nice, thanks mutante ! [01:05:01] java.io.FileNotFoundException: /etc/cassandra-a/tls/server.key (No such file or directory) [01:05:09] twentyafterfour: renamed vcs name [01:05:09] (About the +2) [01:05:43] godog: fwiw, cassandra-streams is in cassandra-tools-wmf (which i just installed) [01:05:50] (03CR) 10Dzahn: "[radon:~] $ host phab1001-vcs.eqiad.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/317290 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [01:06:13] 06Operations, 10Phabricator, 13Patch-For-Review: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2801280 (10Dzahn) iridium-vcs renamed: [radon:~] $ host phab1001-vcs.eqiad.wmnet phab1001-vcs.eqiad.wmnet has address 10.64.32.186 phab1001-vcs.eqiad.wmnet has IPv6... [01:06:15] (03PS4) 10Ppchelko: Add dewiktionary to RESTBase on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [01:06:29] (03PS2) 10Filippo Giunchedi: prometheus: discard varnish backend UUID during collection [puppet] - 10https://gerrit.wikimedia.org/r/322025 (https://phabricator.wikimedia.org/T150479) [01:06:52] urandom: ah! worth adding the dependency in puppet now [01:06:59] matt_flaschen: Uploaded https://gerrit.wikimedia.org/r/#/c/321817/4/ - we've actually been using wrong project there. [01:07:05] (03CR) 10Ppchelko: [C: 031] Add dewiktionary to RESTBase on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [01:07:12] PROBLEM - puppet last run on restbase2010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/cassandra-instances.d] [01:07:13] ya [01:07:44] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Remove deprecated features from book creator UI - https://phabricator.wikimedia.org/T150917#2801284 (10JKatzWMF) [01:08:56] !log renamed iridium-vcs.eqiad to phab1001-vcs.eqiad (phabricator ssh) [01:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:24] mutante, godog: uh oh [01:11:17] mutante: so... i thought puppet would setup scap, including TWCS, but... nope [01:12:14] RECOVERY - puppet last run on restbase2010 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [01:12:40] oh, i don't know that [01:12:46] but recovery looks good [01:12:58] no, it's down atm [01:14:53] ACKNOWLEDGEMENT - puppet last run on restbase2012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues daniel_zahn just got added [01:15:05] urandom: mh I see the directory structure in place and the jar too on 2010 [01:15:08] (03PS1) 10Dereckson: Add Wikitraits blog on fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/322031 [01:15:12] yeah [01:16:04] godog: i think it's OK now [01:16:25] godog: it wasn't [01:16:40] i did a scap deploy, but it didn't [01:16:44] grrr [01:16:49] i did a scap deploy, but it didn't look like it worked [01:17:01] either it did, or something kicked-off/finished in the meantime [01:17:13] (03PS15) 10Dzahn: Enable multiple config files in phabricator [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [01:18:00] (03PS1) 10Mattflaschen: Fix Wiktionary typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322033 [01:19:08] (03CR) 10Dzahn: [C: 032] "comments have been addressed, compiler run looks good" [puppet] - 10https://gerrit.wikimedia.org/r/321654 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [01:19:27] (03CR) 10Mattflaschen: [C: 032] "Obvious typo, only Beta Cluster." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322033 (owner: 10Mattflaschen) [01:20:01] (03Merged) 10jenkins-bot: Fix Wiktionary typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322033 (owner: 10Mattflaschen) [01:21:09] 06Operations, 10Deployment-Systems, 06Performance-Team, 06Release-Engineering-Team, 07HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1401646 (10Krinkle) >>! In T103886#2125417, @ori wrote: > https://github.com/facebook/hhvm/issue... [01:21:51] urandom: *nod* [01:23:05] !log temp disable puppet on iridium (maintenance) [01:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:25] !log scheduled downtime for iridium and services (phab) [01:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:25:12] PROBLEM - puppet last run on mc1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:25:21] phab downtime starting now [01:25:23] (03PS3) 10Filippo Giunchedi: prometheus: discard varnish backend UUID during collection [puppet] - 10https://gerrit.wikimedia.org/r/322025 (https://phabricator.wikimedia.org/T150479) [01:25:28] should be back in just a moment [01:25:48] damnit i was in middle of editing a task :P [01:25:52] oh well [01:26:00] (03CR) 10Dzahn: "tumblr redirects https to http . tssk tss" [puppet] - 10https://gerrit.wikimedia.org/r/322031 (owner: 10Dereckson) [01:26:08] phabricator down? :S [01:26:15] Maintenance (see SAL) [01:26:34] Vulpix maintence [01:26:39] ok, thx [01:26:41] yeah, I noticed that, a little puzzled not to see a lock [01:27:02] (03CR) 10Dzahn: [C: 032] Add Wikitraits blog on fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/322031 (owner: 10Dereckson) [01:27:07] (03PS2) 10Dzahn: Add Wikitraits blog on fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/322031 (owner: 10Dereckson) [01:27:07] Krinkle perhaps a topic change is in order? [01:27:23] Tumblr serves a lot of content and probably considers it 'mostly harmless'. [01:27:29] or a mention in #wikimedia-dev :) [01:27:29] Zppix it is routine maint, it is in the deployment page [01:27:31] on wikitech [01:27:36] phabricator is being updated [01:27:51] and also they are deploying https://gerrit.wikimedia.org/r/321654 i presume [01:28:10] jouncebot: now [01:28:10] For the next 0 hour(s) and 31 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T0100) [01:28:14] ^ :) [01:28:28] who reads wikitech anyway? :) [01:28:29] paladox i dont even look there and i pretty sure not many others do either [01:28:31] what paladox said, yes [01:28:32] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: discard varnish backend UUID during collection [puppet] - 10https://gerrit.wikimedia.org/r/322025 (https://phabricator.wikimedia.org/T150479) (owner: 10Filippo Giunchedi) [01:28:51] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.185, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [01:28:53] Oh yep sorry [01:29:02] !log mattflaschen@tin Synchronized wmf-config/InitialiseSettings-labs.php: Beta Cluster only (duration: 00m 49s) [01:29:09] (03PS3) 10Dzahn: Add Wikitraits blog on fr.planet [puppet] - 10https://gerrit.wikimedia.org/r/322031 (owner: 10Dereckson) [01:29:23] Vulpix jouncebot annouced it ealyer. [01:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:34] Dereckson: yea, i noticed all tumblrs do that before.. i was trying to convert all to https.. [01:29:45] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T0100). Please do the needful. [01:29:50] there is a ticet for the mixed protocols there [01:30:10] A Troublesome Encounter! [01:30:10] Woe! This request had its journey cut short by unexpected circumstances (Can Not Connect to MySQL). [01:30:18] im getting that error on phab now [01:30:24] ACKNOWLEDGEMENT - restbase endpoints health on restbase2010 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.192.16.185, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) daniel_zahn ongoing setup [01:30:30] it works now [01:30:38] any way there is persitent chat now [01:30:38] paladox: byproduct of the update I guess? [01:30:44] whats going on with enwiki? [01:30:46] phab has a new error page style? [01:30:46] Yep, works now [01:30:55] that was part of separating the credentials to login on th edb [01:31:01] who reads wikitech anyway? :) [01:31:01] hi [01:31:09] mutante, godog: fwiw: 2010-a is now happily bootstrapping [01:31:16] urandom: :) [01:31:19] I like the new chat box, it is just like fb [01:31:22] huh, other new styles [01:31:24] Zppix: what do you mean? [01:31:31] PROBLEM - Restbase root url on restbase2010 is CRITICAL: connect to address 10.192.16.185 and port 7231: Connection refused [01:31:35] paladox, well that'll earn it all sorts of opinions around here [01:31:42] ACKNOWLEDGEMENT - restbase endpoints health on restbase2010 is CRITICAL: Generic error: Generic connection error:HTTPConnectionPool(host=10.192.16.185, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connectionrefused))) @ twentyafterfour [01:31:47] Oh [01:32:01] PROBLEM - cassandra-a CQL 10.192.16.186:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.186 and port 9042: Connection refused [01:32:04] ACKNOWLEDGEMENT - Restbase root url on restbase2010 is CRITICAL: connect to address 10.192.16.185 and port 7231: Connection refused daniel_zahn ongoing bootstrap [01:32:04] ACKNOWLEDGEMENT - cassandra-a CQL 10.192.16.186:9042 on restbase2010 is CRITICAL: connect to address 10.192.16.186 and port 9042: Connection refused daniel_zahn ongoing bootstrap [01:32:09] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2801337 (10Krenair) task description line 4 [01:35:35] (03CR) 10Filippo Giunchedi: "FTR this didn't work, the regexp gets "\" escaped:" [puppet] - 10https://gerrit.wikimedia.org/r/322025 (https://phabricator.wikimedia.org/T150479) (owner: 10Filippo Giunchedi) [01:35:41] (03CR) 10Mattflaschen: [C: 031] Add dewiktionary to RESTBase on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [01:36:46] Oh wow a new sidebar [01:36:49] for projects [01:36:51] on phab [01:37:54] 06Operations, 10ops-codfw: update/audit serial of EX4300-spare2-codfw - https://phabricator.wikimedia.org/T147592#2801370 (10RobH) 05Open>03Resolved [01:39:11] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [01:40:01] mutante ^^ [01:40:05] you need to update pybal [01:40:10] twentyafterfour ^^ [01:40:19] it says iridium-vcs.eqiad.wmnet [01:41:06] needs to be iridium-vcs to phab1001-vcs [01:41:11] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [01:41:57] mutante twentyafterfour https://github.com/wikimedia/operations-puppet/blob/bb25b12c57e0d78d4a36272edbea6b229240bd0d/conftool-data/nodes/eqiad.yaml#L300 [01:41:57] (03CR) 10Filippo Giunchedi: "Correction: it did work!" [puppet] - 10https://gerrit.wikimedia.org/r/322025 (https://phabricator.wikimedia.org/T150479) (owner: 10Filippo Giunchedi) [01:43:33] (03Draft1) 10Paladox: Replace iridium-vcs with phab1001-vcs [puppet] - 10https://gerrit.wikimedia.org/r/322034 [01:43:37] (03Draft2) 10Paladox: Replace iridium-vcs with phab1001-vcs [puppet] - 10https://gerrit.wikimedia.org/r/322034 [01:43:43] mutante twentyafterfour ^^ :) [01:44:10] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2801389 (10Krenair) >>! In T150649#2800637, @fgiunchedi wrote: > The issue afaics is that swift on `deployment-ms-fe01` doesn't have the password for `mw:thumbor` in `/e... [01:44:42] (03CR) 10Filippo Giunchedi: [C: 031] Drop poolcounter role from helium [puppet] - 10https://gerrit.wikimedia.org/r/321902 (owner: 10Muehlenhoff) [01:44:51] paladox: yes, thank you, hold on [01:44:58] that is what we need though, yea :) [01:44:58] your welcome and ok [01:45:01] :) [01:47:59] (03PS3) 10Dzahn: conftool/phabricator: replace iridium-vcs with phab1001-vcs [puppet] - 10https://gerrit.wikimedia.org/r/322034 (https://phabricator.wikimedia.org/T143363) (owner: 10Paladox) [01:48:11] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! [01:48:21] (03CR) 10Dzahn: [C: 032] conftool/phabricator: replace iridium-vcs with phab1001-vcs [puppet] - 10https://gerrit.wikimedia.org/r/322034 (https://phabricator.wikimedia.org/T143363) (owner: 10Paladox) [01:48:38] mutante ^^ thanks :) [01:48:56] (03CR) 10Filippo Giunchedi: [C: 031] Introduce a system wide systemd check [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) (owner: 10Alexandros Kosiaris) [01:49:09] ACKNOWLEDGEMENT - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! daniel_zahn rename in progress (T143363) [01:49:09] ACKNOWLEDGEMENT - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server iridium-vcs.eqiad.wmnet because of too many down! daniel_zahn rename in progress (T143363) [01:50:38] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/#/c/322034/ thanks paladox" [dns] - 10https://gerrit.wikimedia.org/r/317290 (https://phabricator.wikimedia.org/T143363) (owner: 10Dzahn) [01:50:47] Your welcome :) [01:53:11] RECOVERY - puppet last run on mc1018 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [01:55:45] ah!, i needed to run "conftool-merge" [01:55:49] besides puppet-merge [01:56:15] (docs say in the near future puppet-merge will do both) [01:56:44] !log conftool-merge, created node phab1001-vcs.eqiad.wmnet for cluster phabricator/git-ssh, removed node iridium-vcs... [01:57:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:18] awesome thank you mutante [02:02:54] (03PS1) 10Gerrit Patch Uploader: Adding nick change functionality automatically [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/322037 [02:02:56] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/322037 (owner: 10Gerrit Patch Uploader) [02:03:28] (03PS2) 10Zppix: Adding nick change functionality automatically [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/322037 (https://phabricator.wikimedia.org/T150916) (owner: 10Gerrit Patch Uploader) [02:03:56] puppetmaster1001:~] $ sudo confctl select dc=eqiad,name=iridium-vcs.eqiad.wmnet get [02:03:59] [puppetmaster1001:~] $ sudo confctl select dc=eqiad,name=phab1001-vcs.eqiad.wmnet get [02:04:02] {"phab1001-vcs.eqiad.wmnet": {"pooled": "no", "weight": 10}, "tags": "dc=eqiad,cluster=phabricator,service=git-ssh"} [02:04:05] should be working now [02:04:20] well, except let's pool it, heh [02:04:59] !log dzahn@puppetmaster1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=phab1001-vcs.eqiad.wmnet [02:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:33] (03PS3) 10Zppix: Adding nick change functionality automatically [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/322037 (https://phabricator.wikimedia.org/T150916) (owner: 10Gerrit Patch Uploader) [02:07:20] runs puppet on icinga now to get that service check updated too [02:10:46] (03PS1) 10Filippo Giunchedi: add dummy material for restbase201[012] [labs/private] - 10https://gerrit.wikimedia.org/r/322038 [02:11:50] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add dummy material for restbase201[012] [labs/private] - 10https://gerrit.wikimedia.org/r/322038 (owner: 10Filippo Giunchedi) [02:21:56] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.2) (duration: 08m 08s) [02:22:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:30] https://config-master.wikimedia.org/conftool/eqiad/git-ssh looks all good, just dont know why icinga isnt changed yet [02:23:48] it's ok in /srv/pybal-config/ too [02:37:57] (03PS1) 1020after4: PHABRICATOR_ENV for vcs service [puppet] - 10https://gerrit.wikimedia.org/r/322041 (https://phabricator.wikimedia.org/T146055) [02:40:38] (03PS2) 10Dzahn: phabricator: PHABRICATOR_ENV for vcs service [puppet] - 10https://gerrit.wikimedia.org/r/322041 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [02:40:42] (03PS3) 10Dzahn: phabricator: PHABRICATOR_ENV for vcs service [puppet] - 10https://gerrit.wikimedia.org/r/322041 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [02:43:13] (03CR) 10Dzahn: [C: 032] phabricator: PHABRICATOR_ENV for vcs service [puppet] - 10https://gerrit.wikimedia.org/r/322041 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [02:49:53] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:49:57] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.3) (duration: 11m 26s) [02:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:39] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Nov 17 02:55:38 UTC 2016 (duration 5m 41s) [02:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:10] (03PS1) 1020after4: make vcs user a member of phd group [puppet] - 10https://gerrit.wikimedia.org/r/322042 (https://phabricator.wikimedia.org/T146055) [02:56:38] (03CR) 10jenkins-bot: [V: 04-1] make vcs user a member of phd group [puppet] - 10https://gerrit.wikimedia.org/r/322042 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [02:56:51] (03PS2) 10Dzahn: phabricator: make vcs user a member of phd group [puppet] - 10https://gerrit.wikimedia.org/r/322042 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [02:57:19] (03CR) 10jenkins-bot: [V: 04-1] phabricator: make vcs user a member of phd group [puppet] - 10https://gerrit.wikimedia.org/r/322042 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [02:57:26] (03PS3) 1020after4: make vcs user a member of phd group [puppet] - 10https://gerrit.wikimedia.org/r/322042 (https://phabricator.wikimedia.org/T146055) [02:58:07] (03PS4) 1020after4: phabricator: make vcs user a member of phd group [puppet] - 10https://gerrit.wikimedia.org/r/322042 (https://phabricator.wikimedia.org/T146055) [02:58:13] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [02:59:04] yay [02:59:19] (03CR) 1020after4: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/322042 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [02:59:55] (03CR) 10Dzahn: [C: 032] phabricator: make vcs user a member of phd group [puppet] - 10https://gerrit.wikimedia.org/r/322042 (https://phabricator.wikimedia.org/T146055) (owner: 1020after4) [03:05:33] (03Abandoned) 10Dzahn: Enable JVM heap log to debug gerrit slowing down [puppet] - 10https://gerrit.wikimedia.org/r/316622 (https://phabricator.wikimedia.org/T148478) (owner: 10Paladox) [03:07:45] PROBLEM - PyBal backends health check on lvs1005 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down! [03:08:15] PROBLEM - PyBal backends health check on lvs1002 is CRITICAL: PYBAL CRITICAL - git-ssh4_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down!: git-ssh6_22 - Could not depool server phab1001-vcs.eqiad.wmnet because of too many down! [03:08:22] aha, but it does use the new name now [03:08:35] twentyafterfour still working on it, correct [03:09:04] I just finally got it working [03:09:09] :) [03:09:14] is icinga delayed a little? [03:10:42] yes [03:10:47] git ls-remote is working for me [03:10:55] so the service seems to be up [03:10:56] @lvs1002:/etc/pybal# /usr/local/lib/nagios/plugins/check_pybal --url http://localhost:9090/alerts [03:10:59] PYBAL OK - All pools are healthy [03:11:02] that is the command that icinga runs too [03:11:15] RECOVERY - PyBal backends health check on lvs1002 is OK: PYBAL OK - All pools are healthy [03:11:18] i expect recovery :) [03:11:19] yay! [03:11:19] heh [03:11:45] RECOVERY - PyBal backends health check on lvs1005 is OK: PYBAL OK - All pools are healthy [03:11:52] (03PS4) 10Dzahn: Move config for git-ssh(phabricator) to hiera [puppet] - 10https://gerrit.wikimedia.org/r/318662 (https://phabricator.wikimedia.org/T143363) (owner: 1020after4) [03:12:17] twentyafterfour: ^ ok then... let's get that done quickly too [03:12:25] sweet [03:12:26] since we're already touching it, right [03:12:33] yeah awesome [03:13:28] (03CR) 10Dzahn: [C: 032] Move config for git-ssh(phabricator) to hiera [puppet] - 10https://gerrit.wikimedia.org/r/318662 (https://phabricator.wikimedia.org/T143363) (owner: 1020after4) [03:14:51] made backup of ferm/iptables, running puppet [03:15:35] -export PHABRICATOR_ENV=phd [03:15:35] +export PHABRICATOR_ENV=vcs [03:15:45] twentyafterfour: ^ did not really expect to see that [03:15:50] on this run [03:16:12] (03PS1) 1020after4: phabricator: fix one stray PHABRICATOR_ENV=vcs [puppet] - 10https://gerrit.wikimedia.org/r/322044 [03:16:17] -ListenAddress = [2620:0:861:ed1a::3:16] [03:16:17] -ListenAddress = [2620:0:861:103:10:64:32:186] [03:16:17] +ListenAddress = 2620:0:861:ed1a::3:16 [03:16:17] +ListenAddress = 2620:0:861:103:10:64:32:186 [03:16:20] did expect that :) [03:16:34] (03CR) 1020after4: "this is a no-op but I shouldn't leave junk laying around." [puppet] - 10https://gerrit.wikimedia.org/r/322044 (owner: 1020after4) [03:16:34] well, the brackets ? [03:17:12] ^ fixes the env=vcs thing [03:17:22] but brackets is an error somehow [03:17:42] does it need to add ENV=phd though? [03:17:45] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [03:18:10] because puppet removed that [03:18:11] no that script calls another which has the ENV setting [03:18:24] ok [03:18:34] (03CR) 10Dzahn: [C: 032] phabricator: fix one stray PHABRICATOR_ENV=vcs [puppet] - 10https://gerrit.wikimedia.org/r/322044 (owner: 1020after4) [03:18:50] I'm confused about the ListenAddress [] [03:19:22] I guess I got the if / else mixed up in https://gerrit.wikimedia.org/r/#/c/318662/4/modules/phabricator/templates/ferm_rule-ssh_public.erb [03:20:20] oh, i was just looking at https://gerrit.wikimedia.org/r/#/c/318662/4/hieradata/role/eqiad/phabricator/main.yaml [03:21:38] you just have to put the [] around the v6 IP ? [03:21:54] yes [03:22:10] so that it knows what is a port, with the : already being used [03:22:23] i mean, dont we just have to change it in the yaml then? [03:22:44] well ferm rules don't like [] actually, but sshd does [03:22:52] I wanted to refactor it so that it's consistent [03:23:17] if you compare phabricator::vcs::listen_addresses [03:23:40] thing is now i have to go in a few minutes.. but i can be back [03:23:45] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 736.04 seconds [03:24:00] phabricator::vcs::address::v6 and phabricator::vcs::listen_addresses are redundant [03:24:10] mutante: it's ok [03:24:36] seems to be working ok for now, we can do more cleanup tomorrow [03:24:51] ok, great :) [03:24:56] nice progress [03:24:59] or rather I can do cleanup tonight and you can +2 tomorrow [03:25:04] yeah thanks for helping!!! [03:25:09] very good progress [03:25:10] yea, or later, just add me [03:25:15] cool [03:25:15] :) ok, cu around [03:25:23] thank you, have a good evening [03:25:28] you too [03:44:45] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 264.99 seconds [04:20:15] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:48:11] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [05:01:10] ahem, do we show SVG as raw text now? https://en.wikipedia.org/wiki/Vermonter_(train)#Route -- click the map and click it again to magnify [05:13:55] --> https://phabricator.wikimedia.org/T150929 [05:15:18] https://upload.wikimedia.org/wikipedia/commons/f/fb/Amtrak_Vermonter.svg is Content-Type: text/plain for some reason. [05:15:36] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:17:57] 06Operations, 06Security-Team: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300#2801598 (10Tgr) @bd808 pointed to the Kibana watcher plugin: https://github.com/elasticfence/kaae [05:43:37] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [07:11:56] 06Operations, 10ops-codfw, 10DBA: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801687 (10Marostegui) This is what I have seen - There is a big spike on disk writes just before the server died - The ILO logs after the reset show: ``` description=POST Error: 1792-Slot X Dr... [07:43:52] 06Operations, 10ops-codfw, 10DBA: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801712 (10Marostegui) db2050 is located right on top of db2049 and its logs do not reveal any warning or any trace of overheat [07:54:44] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:57:55] !log uploaded imagemagick 8:6.8.9.9-5+deb8u5+wmf1 to carbon (T141739) [07:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:16] T141739: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739 [08:10:52] (03CR) 10R4q3NWnUx2CEhVyr: "Any feedback ?" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/318068 (owner: 10R4q3NWnUx2CEhVyr) [08:11:39] (03PS3) 10Muehlenhoff: Add debdeploy salt grain for labs::db::proxy [puppet] - 10https://gerrit.wikimedia.org/r/321846 [08:14:34] PROBLEM - puppet last run on mw1299 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:20:22] (03PS2) 10Muehlenhoff: Drop poolcounter role from helium [puppet] - 10https://gerrit.wikimedia.org/r/321902 [08:21:37] 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2801766 (10Peachey88) [08:22:45] (03CR) 10Muehlenhoff: [C: 032] Drop poolcounter role from helium [puppet] - 10https://gerrit.wikimedia.org/r/321902 (owner: 10Muehlenhoff) [08:23:44] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:29:40] 06Operations, 10MediaWiki-General-or-Unknown, 10Traffic, 10media-storage: Mediawiki thumbnail requests for 0px should result in http 400 not 500 - https://phabricator.wikimedia.org/T147784#2801775 (10Gilles) Looking at https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html the definition of 400 seems more... [08:29:50] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2778014 (10Marostegui) @Cmjohnson did HP come back to you about this issue? Thanks! [08:42:33] RECOVERY - puppet last run on mw1299 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [08:44:47] 06Operations, 06Performance-Team, 10Thumbor: Investigate whether we need a repeat failure guard and/or a poolcounter-like behavior in Thumbor - https://phabricator.wikimedia.org/T150745#2801785 (10Gilles) Thumbor can probably talk to PoolCounter. As for the failure counter, it's based on a cache, a DC-local... [08:45:31] 06Operations, 10Gerrit, 06Release-Engineering-Team, 10hardware-requests: Requesting 1 spare misc box for Gerrit in codfw - https://phabricator.wikimedia.org/T148187#2716861 (10hashar) @demon can we stick with 400GBytes disk ? Not sure there is a point in buying 800 GBytes disk if we have 400G ones already... [08:46:13] 06Operations, 06Performance-Team, 10Thumbor: Investigate SVG default language behavior on non-English wikis for Thumbor - https://phabricator.wikimedia.org/T150743#2801788 (10Gilles) p:05High>03Normal [08:50:53] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:52:13] RECOVERY - cassandra-a CQL 10.192.16.186:9042 on restbase2010 is OK: TCP OK - 0.036 second response time on 10.192.16.186 port 9042 [09:11:43] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:16:23] (03CR) 10Muehlenhoff: "Haven't looked into the xpra setup yet, but already some comments on the other bits." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [09:18:53] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [09:19:11] !log rebooting mc1019->mc1036 (memcached/redis servers, not taking any traffic) for kernel upgrades [09:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:54] PROBLEM - Host db2049 is DOWN: PING CRITICAL - Packet loss = 100% [09:28:42] ^ that is me [09:28:49] forgot to downtime it [09:28:50] sorry [09:29:17] (03CR) 10Alexandros Kosiaris: [C: 031] "I am a bit skeptical about the namespace of the hiera variable, I think it should be in the base:: namespace, but we are still debating th" [puppet] - 10https://gerrit.wikimedia.org/r/256890 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [09:29:54] RECOVERY - Host db2049 is UP: PING OK - Packet loss = 0%, RTA = 36.80 ms [09:30:04] !log Reboot db2049 for maintenance - https://phabricator.wikimedia.org/T150876 [09:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:04] (03PS3) 10Jcrespo: Beta: auto-start mysql on beta so it comes back after reboot [puppet] - 10https://gerrit.wikimedia.org/r/319572 [09:31:13] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 2 others: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2801855 (10Addshore) [09:33:19] (03CR) 10Jcrespo: "Alex, I think you will find this patch equally pleasing, by not having to start mysql manually anymore on beta." [puppet] - 10https://gerrit.wikimedia.org/r/319572 (owner: 10Jcrespo) [09:33:27] (03CR) 10Jcrespo: [C: 032] Beta: auto-start mysql on beta so it comes back after reboot [puppet] - 10https://gerrit.wikimedia.org/r/319572 (owner: 10Jcrespo) [09:33:29] (03Abandoned) 10Muehlenhoff: Tools proxy: Restrict to labs networks [puppet] - 10https://gerrit.wikimedia.org/r/312527 (owner: 10Muehlenhoff) [09:37:42] 06Operations, 10ops-codfw, 10DBA: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801859 (10Marostegui) After the reboot the Cache message is gone. [09:39:46] (03PS3) 10Jcrespo: analytics-meta: Manage mariadb service through mariadb::service class [puppet] - 10https://gerrit.wikimedia.org/r/319570 [09:40:44] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [09:42:48] (03PS4) 10Jcrespo: analytics-meta: Manage mariadb service through mariadb::service class [puppet] - 10https://gerrit.wikimedia.org/r/319570 [09:44:50] jynus: thanks a lot for --^ [09:45:18] well, I would need a +1 on that [09:45:51] but it is all andrew's work [09:46:17] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2801883 (10akosiaris) As far as I am concerned, hiera keys should be namespaced according to the class that tries to look them up. That is if class `pro... [09:48:09] (03Abandoned) 10Hashar: contint: allow .htaccess on doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/321651 (https://phabricator.wikimedia.org/T150727) (owner: 10Hashar) [09:48:46] (03PS4) 10Hashar: contint: move .htaccess content for doc/integration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/322019 (https://phabricator.wikimedia.org/T149928) (owner: 10Dzahn) [09:49:03] (03CR) 10Alexandros Kosiaris: [C: 032] Add eventstreams.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/321940 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [09:49:08] (03PS2) 10Alexandros Kosiaris: Add eventstreams.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/321940 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [09:49:10] (03CR) 10Alexandros Kosiaris: [V: 032] Add eventstreams.svc.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/321940 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [09:49:26] jynus: https://puppet-compiler.wmflabs.org/4600/analytics1003.eqiad.wmnet/change.analytics1003.eqiad.wmnet.err [09:52:37] (03CR) 10Alexandros Kosiaris: [C: 031] "+1ed with the caveat the it should be using the production kafka cluster and not the analytics one in the future (which is not in this com" [puppet] - 10https://gerrit.wikimedia.org/r/320690 (https://phabricator.wikimedia.org/T143925) (owner: 10Ottomata) [09:52:47] (03PS5) 10Jcrespo: analytics-meta: Manage mariadb service through mariadb::service class [puppet] - 10https://gerrit.wikimedia.org/r/319570 [09:53:25] 06Operations, 10ops-codfw, 10DBA: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801893 (10Marostegui) I am running a burn test - I have started burning 3 CPUs and will leave it for a little while before starting with 3 more. [09:55:08] (03CR) 10Hashar: [C: 031] "Sounds all good. Thanks for the cleanup :)" [puppet] - 10https://gerrit.wikimedia.org/r/322019 (https://phabricator.wikimedia.org/T149928) (owner: 10Dzahn) [09:56:15] 06Operations, 10ops-codfw, 10DBA: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2801894 (10jcrespo) a:03Marostegui Assigning it to you to credit you are working more on this. [09:56:37] ^ sounds good :) [09:56:44] (03PS4) 10Jcrespo: mariadb: Remove /root/.my.cnf from all servers [puppet] - 10https://gerrit.wikimedia.org/r/321888 (https://phabricator.wikimedia.org/T150446) [09:56:46] (03PS1) 10Jcrespo: beta: Fix typo on Beta mysql service configuration [puppet] - 10https://gerrit.wikimedia.org/r/322067 [09:57:24] (03PS2) 10Jcrespo: beta: Fix typo on Beta mysql service configuration [puppet] - 10https://gerrit.wikimedia.org/r/322067 [09:58:03] (03CR) 10Elukey: [C: 031] "https://puppet-compiler.wmflabs.org/4601/analytics1003.eqiad.wmnet/ - LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/319570 (owner: 10Jcrespo) [09:58:34] jynus: merge whenevery you want, I'll double check an1003 afterwards [09:58:41] *whenever [09:59:10] I can handle mysql there, I just one someone from the service around [09:59:12] just in case [09:59:19] *want [09:59:35] (03CR) 10Jcrespo: [C: 032] beta: Fix typo on Beta mysql service configuration [puppet] - 10https://gerrit.wikimedia.org/r/322067 (owner: 10Jcrespo) [10:00:02] sure :) [10:00:42] (03CR) 10Jcrespo: [C: 032] analytics-meta: Manage mariadb service through mariadb::service class [puppet] - 10https://gerrit.wikimedia.org/r/319570 (owner: 10Jcrespo) [10:00:47] (03PS6) 10Jcrespo: analytics-meta: Manage mariadb service through mariadb::service class [puppet] - 10https://gerrit.wikimedia.org/r/319570 [10:04:48] (03CR) 10Volans: [C: 031] "LGTM, nitpick comment inline ;)" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) (owner: 10Alexandros Kosiaris) [10:05:39] elukey, it was a noop, I think [10:05:59] I can delete S20mysql for proper testing [10:06:12] (03PS2) 10Muehlenhoff: elasticsearch::https: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/319875 [10:06:39] ^what do you think? [10:07:22] !log temporarily disable puppet on elastic* for staged merge of ferm change [10:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:09] jynus: what do you mean with "S20mysql" ? [10:08:25] the rc link for service autostart [10:08:33] ahhhh sorry [10:08:36] sure [10:08:44] in th'old init.d days [10:08:48] (03CR) 10Muehlenhoff: [C: 032] elasticsearch::https: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/319875 (owner: 10Muehlenhoff) [10:09:01] what is init.d? [10:09:03] :D [10:09:14] this kids and its systemd [10:09:19] *their [10:09:31] (03CR) 10Volans: "Daniel, I don't see the new revision after your latest comments" [puppet] - 10https://gerrit.wikimedia.org/r/320246 (https://phabricator.wikimedia.org/T150160) (owner: 10Dzahn) [10:10:50] 06Operations, 10Gerrit, 06Release-Engineering-Team, 10hardware-requests: Requesting 1 spare misc box for Gerrit in codfw - https://phabricator.wikimedia.org/T148187#2801957 (10demon) Yeah, 400 will be fine--we currently use about 30gb. I dunno why I said 500, that's more than enough and anything in the TB+... [10:14:35] elukey, the init.d got recreated, but the rc link, didn't [10:14:52] which is strange, because identical puppet code should have been run [10:15:05] let me investigate [10:16:31] (03CR) 10Alexandros Kosiaris: Introduce a system wide systemd check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) (owner: 10Alexandros Kosiaris) [10:16:45] (03PS3) 10Alexandros Kosiaris: Introduce a system wide systemd check [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) [10:17:17] !log applying schema change on s5 (page) T69223 [10:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:38] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [10:19:54] PROBLEM - puppet last run on maps1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:20:16] !log upgrading imagemagick on mw1293 (T141739) [10:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:32] T141739: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739 [10:22:36] !log cleanup on analytics1027 - Removed mysql-server-5.5 (not used) and ran apt autoremove (old kernels) [10:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:12] Dereckson: wrt to multiple isset, would if isset (blah || blah ) work? [10:32:40] however i've read that isset only returns true if all conditions are met [10:47:22] !log installing trusty kernel updates [10:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:51] RECOVERY - puppet last run on maps1004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [10:55:21] PROBLEM - DPKG on mw1260 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:56:21] RECOVERY - DPKG on mw1260 is OK: All packages OK [10:58:11] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:20:31] PROBLEM - Check whether ferm is active by checking the default input chain on cp3017 is CRITICAL: Return code of 255 is out of bounds [11:20:41] PROBLEM - salt-minion processes on cp3017 is CRITICAL: Return code of 255 is out of bounds [11:20:41] PROBLEM - Check size of conntrack table on cp3017 is CRITICAL: Return code of 255 is out of bounds [11:20:51] PROBLEM - Disk space on cp3017 is CRITICAL: Return code of 255 is out of bounds [11:20:51] PROBLEM - DPKG on cp3017 is CRITICAL: Return code of 255 is out of bounds [11:20:51] PROBLEM - configured eth on cp3017 is CRITICAL: Return code of 255 is out of bounds [11:21:11] PROBLEM - MD RAID on cp3017 is CRITICAL: Return code of 255 is out of bounds [11:21:16] that's me rebooting the spares ^ [11:21:21] PROBLEM - dhclient process on cp3017 is CRITICAL: Return code of 255 is out of bounds [11:21:31] PROBLEM - puppet last run on cp3017 is CRITICAL: Return code of 255 is out of bounds [11:23:23] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2802152 (10hashar) [11:23:26] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2802150 (10hashar) 05Resolved>03Open deployment-apertium01 is still around and complaining. Maybe the deletion failed in wikitech/horizon? [11:25:21] PROBLEM - MD RAID on cp3016 is CRITICAL: Return code of 255 is out of bounds [11:25:21] PROBLEM - puppet last run on cp3016 is CRITICAL: Return code of 255 is out of bounds [11:25:21] PROBLEM - Check size of conntrack table on cp3016 is CRITICAL: Return code of 255 is out of bounds [11:25:21] PROBLEM - dhclient process on cp3016 is CRITICAL: Return code of 255 is out of bounds [11:25:21] PROBLEM - Disk space on cp3016 is CRITICAL: Return code of 255 is out of bounds [11:25:22] PROBLEM - DPKG on cp3016 is CRITICAL: Return code of 255 is out of bounds [11:25:41] PROBLEM - configured eth on cp3016 is CRITICAL: Return code of 255 is out of bounds [11:25:51] PROBLEM - salt-minion processes on cp3016 is CRITICAL: Return code of 255 is out of bounds [11:25:51] PROBLEM - Check whether ferm is active by checking the default input chain on cp3016 is CRITICAL: Return code of 255 is out of bounds [11:26:16] 07Puppet, 10Beta-Cluster-Infrastructure, 07Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2802174 (10hashar) [11:26:18] 07Puppet, 10Beta-Cluster-Infrastructure: deployment-apertium01 puppet failing due to missing packages on trusty - https://phabricator.wikimedia.org/T147210#2802172 (10hashar) 05Open>03Resolved I have terminated deployment-apertium01 using Horizon. [11:26:24] PROBLEM - puppet last run on cp3015 is CRITICAL: Return code of 255 is out of bounds [11:26:24] PROBLEM - Disk space on cp3015 is CRITICAL: Return code of 255 is out of bounds [11:26:24] PROBLEM - salt-minion processes on cp3015 is CRITICAL: Return code of 255 is out of bounds [11:26:44] PROBLEM - Check whether ferm is active by checking the default input chain on cp3015 is CRITICAL: Return code of 255 is out of bounds [11:26:44] PROBLEM - Check size of conntrack table on cp3015 is CRITICAL: Return code of 255 is out of bounds [11:26:54] PROBLEM - DPKG on cp3015 is CRITICAL: Return code of 255 is out of bounds [11:26:54] PROBLEM - dhclient process on cp3015 is CRITICAL: Return code of 255 is out of bounds [11:26:54] PROBLEM - MD RAID on cp3015 is CRITICAL: Return code of 255 is out of bounds [11:27:05] PROBLEM - configured eth on cp3015 is CRITICAL: Return code of 255 is out of bounds [11:27:24] PROBLEM - Check size of conntrack table on cp3019 is CRITICAL: Return code of 255 is out of bounds [11:27:24] PROBLEM - configured eth on cp3019 is CRITICAL: Return code of 255 is out of bounds [11:36:14] PROBLEM - Host cp3017 is DOWN: PING CRITICAL - Packet loss = 100% [11:37:34] RECOVERY - Host cp3017 is UP: PING OK - Packet loss = 0%, RTA = 83.83 ms [11:39:04] PROBLEM - Host cp3016 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:24] RECOVERY - Host cp3016 is UP: PING OK - Packet loss = 0%, RTA = 83.81 ms [11:40:44] PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100% [11:41:02] sorry for the spam [11:41:40] instead of rebooting properly all these spares got reinstalled heh :( [11:42:14] RECOVERY - Host cp3015 is UP: PING OK - Packet loss = 0%, RTA = 83.93 ms [11:44:34] PROBLEM - Host cp3019 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:54] RECOVERY - Host cp3019 is UP: PING OK - Packet loss = 0%, RTA = 83.77 ms [11:46:54] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:48:04] 06Operations, 10OfflineContentGenerator, 10Reading-Community-Engagement, 06Reading-Web-Backlog, and 3 others: [EPIC] Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#2802254 (10mobrovac) [11:57:45] PROBLEM - NTP on cp3015 is CRITICAL: NTP CRITICAL: No response from NTP server [12:07:24] PROBLEM - NTP on cp3019 is CRITICAL: NTP CRITICAL: No response from NTP server [12:10:54] PROBLEM - puppet last run on elastic1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:24] PROBLEM - NTP on cp3016 is CRITICAL: NTP CRITICAL: No response from NTP server [12:12:34] PROBLEM - NTP on cp3017 is CRITICAL: NTP CRITICAL: No response from NTP server [12:14:54] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:17:14] (03CR) 10Mobrovac: PDF Render Service: Role and module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [12:22:45] (03CR) 10Muehlenhoff: PDF Render Service: Role and module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [12:23:04] PROBLEM - MegaRAID on cp3018 is CRITICAL: Return code of 255 is out of bounds [12:24:46] (03CR) 10Alexandros Kosiaris: PDF Render Service: Role and module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [12:25:18] (03PS1) 10Addshore: DNM config for ElectronPdfService on beta sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322086 (https://phabricator.wikimedia.org/T150945) [12:27:00] 06Operations, 10ops-eqiad, 06DC-Ops: Reclaim SSD from labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T116936#2802315 (10hashar) 05stalled>03Open After a year or so I am confident we will not use the SSD disks. I guess removing them means a server shutdown, that will cause CI to lack in... [12:27:24] PROBLEM - NTP on cp3018 is CRITICAL: NTP CRITICAL: No response from NTP server [12:29:52] (03CR) 10Muehlenhoff: PDF Render Service: Role and module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/305256 (https://phabricator.wikimedia.org/T143129) (owner: 10Mobrovac) [12:39:54] RECOVERY - puppet last run on elastic1044 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:42:49] (03CR) 10Mobrovac: [C: 031] Add dewiktionary to RESTBase on Beta Cluster [puppet] - 10https://gerrit.wikimedia.org/r/321817 (https://phabricator.wikimedia.org/T150764) (owner: 10Mattflaschen) [12:47:34] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:50:41] !log upgrading imagemagick on remaining image scalers (T141739) [12:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:03] T141739: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739 [12:56:30] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2802345 (10MoritzMuehlenhoff) The backporte... [13:15:34] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [13:31:16] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2802382 (10Gilles) Nope. Thumbor's config has those values: ``` SWIFT_HOST = 'http://deployment-ms-fe01.deployment-prep.eqiad.wmflabs' SWIFT_API_PATH = '/v1/AUTH_mw/' S... [13:34:28] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2802385 (10Krenair) >>! In T150649#2796944, @Krenair wrote: > does prod also show u'www-authenticate': u'Swift realm="unknown"' ? [13:38:05] I'll just drop the link http://securityaffairs.co/wordpress/53494/breaking-news/cve-2016-4484-linux.html [13:40:19] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2802424 (10Gilles) I'm not sure what you're asking, Swift auth works in production and we don't log at debug level there. Not sure that the swiftclient library would log... [13:46:34] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2802442 (10Gilles) I'll write some python code to mimic what Thumbor does and attempt to get the headers of the successful response in production. [13:50:06] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "comments addressed, merging" [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) (owner: 10Alexandros Kosiaris) [13:50:11] (03PS4) 10Alexandros Kosiaris: Introduce a system wide systemd check [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) [13:50:13] (03CR) 10Alexandros Kosiaris: [V: 032] Introduce a system wide systemd check [puppet] - 10https://gerrit.wikimedia.org/r/320793 (https://phabricator.wikimedia.org/T134890) (owner: 10Alexandros Kosiaris) [13:50:19] (03CR) 10Marostegui: [C: 031] mariadb: Remove /root/.my.cnf from all servers [puppet] - 10https://gerrit.wikimedia.org/r/321888 (https://phabricator.wikimedia.org/T150446) (owner: 10Jcrespo) [13:51:49] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2802452 (10Gilles) ``` DEBUG:swiftclient:REQ: curl -i http://ms-fe.svc.eqiad.wmnet/auth/v1.0 -X GET DEBUG:swiftclient:RESP STATUS: 200 OK DEBUG:swiftclient:RESP HEADERS:... [13:55:05] (03PS1) 10Reedy: Make and re-use list of all elevated WMF groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) [13:56:33] Krenair: ^ Want to sanity check that please [13:59:43] (03PS2) 10Reedy: Make and re-use list of all elevated WMF groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T1400). Please do the needful. [14:00:27] nothing to SWAT [14:14:30] (03PS1) 10Muehlenhoff: Add recently assigned CVE ID to fix already merged via older 4.4.10 [debs/linux44] - 10https://gerrit.wikimedia.org/r/322097 [14:17:25] (03CR) 10Muehlenhoff: [C: 032] Add recently assigned CVE ID to fix already merged via older 4.4.10 [debs/linux44] - 10https://gerrit.wikimedia.org/r/322097 (owner: 10Muehlenhoff) [14:20:38] (03CR) 10MarcoAurelio: [C: 04-1] "trwikiquote flag is called "technican"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [14:21:18] (03CR) 10Reedy: "With a typo?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [14:23:59] addshore: i'm going to add something to swat :) [14:26:48] (03CR) 10jenkins-bot: [V: 04-1] Make and re-use list of all elevated WMF groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [14:28:53] (03CR) 10MarcoAurelio: "> With a typo?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [14:29:23] (03PS3) 10Reedy: Make and re-use list of all elevated WMF groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) [14:34:05] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2802576 (10MoritzMuehlenhoff) Reported to D... [14:35:37] (03CR) 10Jdlrobson: [C: 031] "looks like it needs another rebase though?!? o_O" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314748 (https://phabricator.wikimedia.org/T147092) (owner: 10Dereckson) [14:37:15] (03PS6) 10Jdlrobson: Switch MobileFrontend to extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314748 (https://phabricator.wikimedia.org/T147092) (owner: 10Dereckson) [14:38:46] aude: when's the next swat window out of interest? Dereckson are you around? [14:38:57] (i'm disorientated with my time zones :)) [14:39:02] (03PS2) 10Jdlrobson: Clean unused MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315985 (owner: 10Dereckson) [14:39:11] (03CR) 10Jdlrobson: [C: 031] Clean unused MobileFrontend variables [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315985 (owner: 10Dereckson) [14:39:35] jouncebot: next [14:39:35] In 2 hour(s) and 20 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T1700) [14:39:39] oh [14:39:50] don't know how to get next swat [14:40:27] 2 hours after puppet swat, so 4 hours and 20 minutes from now [14:40:45] (03PS1) 10Muehlenhoff: Rename hiera file so that the debdeploy grain is assigned to the new role name [puppet] - 10https://gerrit.wikimedia.org/r/322105 [14:41:58] (03PS2) 10Muehlenhoff: Rename hiera file so that the debdeploy grain is assigned to the new role name [puppet] - 10https://gerrit.wikimedia.org/r/322105 [14:49:44] 06Operations, 06Performance-Team, 10Thumbor: Match cache headers between thumbor and mediawiki - https://phabricator.wikimedia.org/T150642#2802611 (10Gilles) [14:51:36] aude: no worries. Thanks for trying :) [14:51:44] Given Dereckson isn't around it can wait till Monday [14:51:57] (03CR) 10Muehlenhoff: [C: 032] Rename hiera file so that the debdeploy grain is assigned to the new role name [puppet] - 10https://gerrit.wikimedia.org/r/322105 (owner: 10Muehlenhoff) [14:52:44] jdlrobson: is it important? [14:52:53] i think next week is only important stuff for swat [14:53:36] (03PS1) 10Alexandros Kosiaris: monitoring: Rename a few variables [puppet] - 10https://gerrit.wikimedia.org/r/322107 [14:57:13] oh it's just config cleanup aude [14:57:24] ok [15:01:49] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [15:02:47] !log rebooting secondary (inactive) LVS hosts for kernel updates [15:02:49] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3338662 keys, up 17 days 6 hours - replication_delay is 0 [15:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:49] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [15:09:07] (03PS1) 10Alexandros Kosiaris: docker: Fix monitoring description [puppet] - 10https://gerrit.wikimedia.org/r/322111 [15:10:22] !log installing libxslt security updates [15:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:49] 06Operations, 10ops-codfw, 10DBA: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2802648 (10jcrespo) p:05Triage>03Normal [15:17:01] (03PS1) 10Muehlenhoff: Assign debdeploy grains for analytics zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/322113 [15:17:49] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [15:22:26] (03CR) 10Muehlenhoff: [C: 032] Assign debdeploy grains for analytics zookeeper cluster [puppet] - 10https://gerrit.wikimedia.org/r/322113 (owner: 10Muehlenhoff) [15:23:49] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [15:24:48] !log applying schema change on s4 (page) T69223 [15:24:49] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3339269 keys, up 17 days 7 hours - replication_delay is 0 [15:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:08] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [15:31:07] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2802669 (10matmarex) [15:31:17] 06Operations, 06Commons, 06Multimedia: Deploy some fixed version of ImageMagick from apt.wikimedia.org - https://phabricator.wikimedia.org/T150432#2802667 (10matmarex) 05Open>03Resolved This has been done now (but logged at T141739). [15:33:32] 06Operations, 06Commons, 10MediaWiki-File-management, 06Multimedia, 07Upstream: Issues with displaying thumbnails for CMYK JPG images due to buggy version of ImageMagick (black horizontal stripes, black color missing) - https://phabricator.wikimedia.org/T141739#2802678 (10matmarex) 05Open>03Resolved... [15:34:37] 06Operations, 06Commons, 06Multimedia: Deploy some fixed version of ImageMagick from apt.wikimedia.org - https://phabricator.wikimedia.org/T150432#2785785 (10MoritzMuehlenhoff) The Thumbor servers have also been upgraded. [15:35:21] (03CR) 10Alexandros Kosiaris: [C: 032] docker: Fix monitoring description [puppet] - 10https://gerrit.wikimedia.org/r/322111 (owner: 10Alexandros Kosiaris) [15:35:26] (03PS2) 10Alexandros Kosiaris: docker: Fix monitoring description [puppet] - 10https://gerrit.wikimedia.org/r/322111 [15:35:28] (03CR) 10Alexandros Kosiaris: [V: 032] docker: Fix monitoring description [puppet] - 10https://gerrit.wikimedia.org/r/322111 (owner: 10Alexandros Kosiaris) [15:42:40] !log ori@tin Synchronized php-1.29.0-wmf.2/extensions/NavigationTiming/modules/ext.navigationTiming.js: I8e8ec96f: Dont report stats when page visibility changes during page load ; scap sync-file php-1.29.0-wmf.3/extensions/NavigationTiming/modules/ext.navigationTiming.js I8e8ec96f: Dont report stats when page visibility changes during page load (duration: 00m 51s) [15:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:01] damn it, unclosed quote [15:44:11] !log ori@tin Synchronized php-1.29.0-wmf.3/extensions/NavigationTiming/modules/ext.navigationTiming.js: I8e8ec96f: Don't report stats when page visibility changes during page load (duration: 00m 48s) [15:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:04] (03CR) 10Alexandros Kosiaris: "PCC @ https://puppet-compiler.wmflabs.org/4602/ says noop." [puppet] - 10https://gerrit.wikimedia.org/r/322107 (owner: 10Alexandros Kosiaris) [15:47:09] ori, shouldn't you &&'ed? [15:47:38] I don't start my new position until Dec 5 so I'm tying up some loose ends [15:48:08] no, I mean "&&" instead of ";" [15:48:22] oh, heh -- I misread that as '&' [15:48:32] the two operations aren't really dependent on one another, but I suppose, yes [15:52:43] !log rebooting primary LVS hosts for kernel updates [15:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:41] 06Operations, 10ops-codfw, 10DBA: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2802731 (10Marostegui) I am burning 12 CPUs now. For the night I am planning to leave 24 of them and see what happens tomorrow morning. [16:07:50] Hi, Horst@cswiki was trying to turn on 2FA. He installed some app for 2FA, scanned the code, 2FA was turned on. And the codes no longer works. I know that this can be turned off only by somebody with shell access. I would like to know how he can request it. New ticket at phab? [16:08:00] Or can it be made here? [16:08:17] Is the time etc right on his phone? [16:08:26] Is he still logged in? [16:08:42] Did he not save the recovery codes? [16:09:13] Yes, the time is correct, no, he isn't logged in and the recovery codes does not work. [16:09:35] This sounds rather curious [16:09:43] Let's have a look at the DB [16:09:49] Ok. [16:10:36] Row is there, and looks sane [16:11:31] So this is something that's mostly undocumented atm... How to prove a user is who they say they are [16:11:44] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: labstore1003 - RAID fail - https://phabricator.wikimedia.org/T149156#2802757 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson [16:12:05] Can they get on IRC? [16:12:13] I kinda want to see what recovery tokens they've been given [16:12:21] I can prove it and some others members of cswiki community can do the same. [16:12:29] Should I invite him to this channel? [16:12:56] Sure, if they can get here [16:13:39] Okay. I asked him for it. [16:14:13] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2802760 (10Gehel) Building the package on copper fails as some dependencies are still being... [16:14:33] !log restarting app server canaries to pick up libxslt update [16:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:48] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2802761 (10Gehel) @hashar : you probably have some experience in building .debs from maven... [16:16:32] Reedy, he is trying to connect here. [16:16:37] Thanks [16:17:27] 06Operations, 06Discovery, 10Elasticsearch, 06Discovery-Search (Current work), 13Patch-For-Review: Upgrade our logstash-gelf package to latest available upstream version - https://phabricator.wikimedia.org/T150408#2785082 (10MoritzMuehlenhoff) It might fail due to not using the web proxy? [16:17:51] 06Operations, 10ops-eqiad, 10hardware-requests: Return wmf4747/wmf4748/wmf4749/wmf4750 to spares - https://phabricator.wikimedia.org/T146171#2802767 (10Cmjohnson) disks are wiping...will be completed in 190mins [16:19:19] hey Horst__ [16:19:40] Hi [16:19:46] 06Operations, 10ops-eqiad: scb1003, scb1004 exhibit temperature problems - https://phabricator.wikimedia.org/T150882#2799962 (10Cmjohnson) I find it odd that so many servers are seeing these overheating issues. Thermal paste has worked in the past. I will need to purchase more thermal paste. Let's plan on d... [16:20:50] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802774 (10Cmjohnson) @jcrespo the disk has been swapped please let me know if you need anything else. [16:20:55] Reedy, will you need me? [16:20:56] Horst__: Sent you a PM [16:21:31] Urbanecm: Shouldn't do... [16:22:00] Okay, I'll be back in few of minutes Reedy. [16:22:53] 06Operations, 10ops-eqiad, 10hardware-requests, 10netops, 13Patch-For-Review: Move labsdb1008 to production, rename it back to db1095, use it as a temporary sanitarium - https://phabricator.wikimedia.org/T149829#2802776 (10Cmjohnson) [16:22:59] 06Operations, 10ops-eqiad: relabel labsdb1008 to db1095, update racktables - https://phabricator.wikimedia.org/T150793#2802775 (10Cmjohnson) 05Open>03Resolved [16:26:14] 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#2802779 (10Cmjohnson) @fgiunchedi I am going to need to take this server down and remove and re-assemble the controller. Let me know when I can do this. Thanks [16:26:56] the procedure is at https://wikitech.wikimedia.org/wiki/Password_reset#Reset_two_factor_authentication but requires https://wikitech.wikimedia.org/wiki/Password_reset/Confirming_identities [16:27:12] so i wouldn't call it undocumented [16:27:32] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2802784 (10Cmjohnson) @robh is there anything you can do with the vendor? We replaced, the system board, both disk back planes, ssds (several times). The only thing left is the raid co... [16:27:39] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802785 (10jcrespo) Yes, I mentioned changing the thermal paste (unless you see some other reason to create thermal issues, such as a malfunctioning fan) and wiping the logical disk volume (not ph... [16:27:59] The technical side is easy [16:28:16] !log Deleted centralauth.oathauth_users row for Horst [16:28:33] The confirming identities is not unless confirmed [16:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:52] (03CR) 10BryanDavis: [C: 04-1] "A few small fixes needed." (034 comments) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/322037 (https://phabricator.wikimedia.org/T150916) (owner: 10Gerrit Patch Uploader) [16:33:06] (03CR) 10BryanDavis: "check" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/322037 (https://phabricator.wikimedia.org/T150916) (owner: 10Gerrit Patch Uploader) [16:34:11] HI Reedy [16:34:26] Hi [16:34:48] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802801 (10Cmjohnson) @jcrespo sure, can I power down anytime? [16:35:15] 06Operations, 10ops-eqiad, 10DBA: Multiple hardware issues on db1073 - https://phabricator.wikimedia.org/T149728#2802802 (10jcrespo) Yes. [16:36:07] Key: EEKIF4FKBIJKN3FH [16:36:21] Hmm [16:36:25] Definitely wasn't in the row [16:36:41] Horst_: Can you try re-enabling it again, and see if it works? [16:38:07] Codes LLRAQKDXHT5DWKIP [16:40:13] !log upgrade nginx on prometheus and thumbor machines [16:40:19] UIF7XAIUACYVSJRA [16:40:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:39] (03PS2) 10Chad: static.php: Consolidate error headers in wmfStaticShowError() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320322 [16:41:29] If confirmed working, disable and reenable with a new secret that you do not share because this channel is logged. If not, this ought to be needing troubleshooting [16:42:05] (03CR) 10Chad: [C: 032] static.php: Consolidate error headers in wmfStaticShowError() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320322 (owner: 10Chad) [16:42:43] (03Merged) 10jenkins-bot: static.php: Consolidate error headers in wmfStaticShowError() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/320322 (owner: 10Chad) [16:43:50] 06Operations, 10ops-eqiad: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964#2802840 (10Cmjohnson) [16:44:25] 06Operations, 10ops-eqiad: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964#2802854 (10Cmjohnson) p:05Triage>03Normal [16:45:18] !log demon@tin Synchronized w/static.php: code duplication stuffs (duration: 00m 49s) [16:45:18] (03PS1) 10Rush: toollabs: refactor and establish norms [puppet] - 10https://gerrit.wikimedia.org/r/322127 [16:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:59] 06Operations, 10ops-eqiad: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964#2802840 (10Cmjohnson) Received 3 servers destined to be restbase nodes T141005. Need names, locations, any pertinent details. [16:46:13] PROBLEM - pybal on lvs2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [16:46:23] PROBLEM - PyBal backends health check on lvs2002 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 [16:46:45] (03PS3) 10Chad: Standardize most of the docroots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321726 [16:47:06] 06Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 13Patch-For-Review: Decommission db1042 - https://phabricator.wikimedia.org/T149793#2802859 (10Cmjohnson) p:05Normal>03Low [16:47:22] 06Operations, 10ops-eqiad: decom palladium (datacenter) - https://phabricator.wikimedia.org/T149395#2802860 (10Cmjohnson) p:05Normal>03Low [16:47:40] 06Operations, 10ops-eqiad: Decommission strontium - https://phabricator.wikimedia.org/T142722#2802861 (10Cmjohnson) p:05Normal>03Low [16:47:43] (03CR) 10Chad: [C: 032] Standardize most of the docroots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321726 (owner: 10Chad) [16:48:15] (03Merged) 10jenkins-bot: Standardize most of the docroots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/321726 (owner: 10Chad) [16:48:16] 06Operations, 10ops-eqiad: Investigate strontium disk issues on 2016-08-05 - https://phabricator.wikimedia.org/T142187#2802862 (10Cmjohnson) 05Open>03Resolved Resolving this task. This server is to be decom'd [16:49:49] 06Operations, 06Security-Team: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300#2802855 (10bd808) >>! In T150300#2801598, @Tgr wrote: > @bd808 pointed to the Kibana watcher plugin: https://github.com/elasticfence/kaae If we wanted to try this plugin out, I thi... [16:51:02] !log demon@tin Synchronized docroot/: Unifying most docroots (duration: 00m 50s) [16:51:06] 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#2802864 (10fgiunchedi) @Cmjohnson can be done at any time as long as a graceful `shutdown` is used, thanks! [16:51:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:58] !log change-prop deploying ed3711b [16:53:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:38] (03PS2) 10Rush: toollabs: refactor and establish norms [puppet] - 10https://gerrit.wikimedia.org/r/322127 [16:53:56] 06Operations, 10ops-eqiad, 10DBA: labsdb1009 boot issues (power supply and controller?) - https://phabricator.wikimedia.org/T150211#2802881 (10Cmjohnson) HP Support Case Opened. Case ID: 5315048494 Case title: Failed Power Supply Severity 3-Normal [16:58:12] jouncebot: next [16:58:12] In 0 hour(s) and 1 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T1700) [16:58:19] Yay, just made the cutoff :) [17:00:04] godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T1700). [17:00:22] :) [17:01:41] I can SWAT [17:02:18] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2802904 (10Cmjohnson) A new case was opened with HP to replace the raid card Case ID: 5315048752 Case title: Failed Raid Card Severity 3-Normal [17:02:26] (03PS2) 10Filippo Giunchedi: Add branch.autosetuprebase = always to my .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/321718 (owner: 10Chad) [17:02:50] !log powering down ms-be1016 to reseat the raid card. [17:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:41] (03CR) 10Filippo Giunchedi: [C: 032] Add branch.autosetuprebase = always to my .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/321718 (owner: 10Chad) [17:05:14] PROBLEM - Host ms-be1016 is DOWN: PING CRITICAL - Packet loss = 100% [17:05:41] (03CR) 10Andrew Bogott: [C: 031] "Looks to me like no-ops all around" [puppet] - 10https://gerrit.wikimedia.org/r/322127 (owner: 10Rush) [17:07:28] (03PS2) 10Filippo Giunchedi: scap: Remove deploy2graphite [puppet] - 10https://gerrit.wikimedia.org/r/321401 (owner: 10Chad) [17:09:17] ostriches: I'm leaving the gerrit/java8 last [17:09:30] the rest looks fine [17:09:37] Ok sounds fine [17:10:01] (03PS2) 10Gehel: Kartotherian: deploy application configuration with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/321374 (https://phabricator.wikimedia.org/T150021) [17:11:04] ooohh is that going today? awesome [17:11:08] waiting for CI [17:11:09] (gerrit) [17:11:14] !log restart apache2 on iridium to clear lagged queries refs T150965 [17:11:16] (03CR) 10Filippo Giunchedi: [C: 032] scap: Remove deploy2graphite [puppet] - 10https://gerrit.wikimedia.org/r/321401 (owner: 10Chad) [17:11:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:35] T150965: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965 [17:11:38] (03PS2) 10Filippo Giunchedi: Removing support for DOLOGMSGNOLOG [puppet] - 10https://gerrit.wikimedia.org/r/317848 (owner: 10Chad) [17:14:56] (03CR) 10Filippo Giunchedi: [C: 032] Removing support for DOLOGMSGNOLOG [puppet] - 10https://gerrit.wikimedia.org/r/317848 (owner: 10Chad) [17:15:23] PROBLEM - puppet last run on labtestweb2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/deploy2graphite] [17:15:56] apergos: Nah, just the installing of openjdk8 so migration is easier. [17:16:04] ok [17:16:10] first steps! [17:16:30] I started building gerrit w/ 8 last night, wasted almost 2 hours because the stupid build program decided to break OSX compat. [17:16:32] :) [17:16:52] ouch! [17:17:01] Well, OSX/BSD/anything-not-providing-realpath-out-of-the-box :p [17:17:13] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:17:23] This is what happens when you write your wrapper scripts in bash :p [17:18:31] godog: Ouch, I left one off.... [17:18:42] https://gerrit.wikimedia.org/r/#/c/321916/ [17:18:48] Ok if I add? [17:19:23] RECOVERY - puppet last run on labtestweb2001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [17:20:01] ostriches: I'm not very familiar with the implications of that change and no consensus afaics, I wouldn't merge it [17:20:13] PROBLEM - puppet last run on lvs4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:21:19] godog: Doesn't really need a consensus built, I'm on a warpath. But your call :) [17:21:37] (compare, for example, https://gerrit.wikimedia.org/r/#/c/321726/) [17:22:33] RECOVERY - Host ms-be1016 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [17:23:51] 06Operations, 10ops-codfw, 10DBA: db2049 overheated and restarted - https://phabricator.wikimedia.org/T150876#2802986 (10Marostegui) I am burning 32 cores until tomorrow morning. [17:24:02] godog: are you referring to https://gerrit.wikimedia.org/r/#/c/321398/ ? I can merge that one, I reviewed it some days ago [17:24:32] 06Operations, 10ops-eqiad: ms-be1016 controller cache failure - https://phabricator.wikimedia.org/T150206#2802988 (10Cmjohnson) @fgiunchedi I removed and reassembled the raid card. Rebooted please take a look and lmk if you see anything unusual. [17:24:42] moritzm: ah no we were talking about https://gerrit.wikimedia.org/r/#/c/321916/, the gerrit one sounds fine I'm going to merge it next [17:25:10] ostriches: heh more like +1s than consensus, I'm a little wary when touching apache on puppet swat, maybe next time! [17:25:21] (03PS2) 10Filippo Giunchedi: Gerrit: Also install openjdk8 alongside 7, make it configurable [puppet] - 10https://gerrit.wikimedia.org/r/321398 (owner: 10Chad) [17:25:30] godog: ah, ok [17:25:35] godog: No worries at all :) [17:25:43] PROBLEM - MD RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:25:48] Those docroots are a freaking mess, I've been trying to unify them as much as possible :) [17:25:53] PROBLEM - very high load average likely xfs on ms-be1016 is CRITICAL: CRITICAL - load average: 221.12, 108.46, 42.77 [17:27:13] indeed, must feel like decluttering a basement [17:28:08] running pcc just in case on 321398 [17:28:27] Ah good idea, I skipped that [17:29:23] (03CR) 10Filippo Giunchedi: [C: 032] Gerrit: Also install openjdk8 alongside 7, make it configurable [puppet] - 10https://gerrit.wikimedia.org/r/321398 (owner: 10Chad) [17:29:33] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/4605" [puppet] - 10https://gerrit.wikimedia.org/r/321398 (owner: 10Chad) [17:30:47] ostriches: LGTM, puppet is running on cobalt [17:31:08] Thank youuuuuuuuu [17:31:43] ostriches: np! we should bounce gerrit to avoid surprises though, ok to do it now? [17:33:12] godog: good a time as any I suppose [17:34:11] ok! [17:34:22] !log bounce gerrit on cobalt after https://gerrit.wikimedia.org/r/321398 [17:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:20] {{done}} we're back [17:36:22] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2803028 (10GWicke) [17:37:21] cmjohnson: I see ms-be1016 with very high load average from icinga and can't ssh, is that you on th econsole? [17:37:59] ostriches looks like java 8 works with gerrit 2.12 [17:38:00] https://gerrit.git.wmflabs.org/r/#/q/status:open [17:38:01] :) [17:38:13] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config] [17:38:13] RECOVERY - Host lvs2002 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [17:40:17] !log phabricator: deploying hotfix for T150965 [17:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:41] T150965: phabricator close to saturate its database connections - https://phabricator.wikimedia.org/T150965 [17:41:17] (03PS1) 10Muehlenhoff: Add further retroactively assigned CVE IDs [debs/linux44] - 10https://gerrit.wikimedia.org/r/322131 [17:42:10] 06Operations, 06Security-Team: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300#2781237 (10fgiunchedi) We could (also) export the number of lines written to badpass to graphite and setup an icinga alert. The metric would be public though and so will the alert,... [17:43:10] (03CR) 10Muehlenhoff: [C: 032] Add further retroactively assigned CVE IDs [debs/linux44] - 10https://gerrit.wikimedia.org/r/322131 (owner: 10Muehlenhoff) [17:43:38] twentyafterfour searching is no longer working for me [17:43:48] searching for phabricator is not bringing up any projects [17:43:58] I get HTTP 500 [17:43:59] cmjohnson1: ms-be1016 has very load average and unsshable, is it you on the console? [17:44:00] godog: i should be off console now [17:44:04] ah thanks! [17:44:09] https://phabricator.wikimedia.org/search/ [17:44:11] ^^ shows 500 [17:44:33] twentyafterfour, revert [17:44:58] Oh main page is showing 500 too [17:45:12] ┻━┻ ︵ ¯\_(ツ)_/¯ ︵ ┻━┻ [17:45:17] yeah totally down =[ [17:45:25] cmjohnson1: still says console in use, need to reset the ilo in this case? [17:46:11] !log unbreak search [17:46:24] fixed [17:46:29] thanks [17:46:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:24] godog: should be good to go...for future reference stop /system1/oemhp_vsp1 [17:47:29] will reset vsp [17:49:09] cmjohnson1: awesome, thanks! [17:49:13] RECOVERY - puppet last run on lvs4001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:49:43] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:50:16] cmjohnson1: sigh, had to power reset, couldn't even login on console [17:50:51] odd...i had no issues [17:51:53] PROBLEM - Host ms-be1016 is DOWN: PING CRITICAL - Packet loss = 100% [17:52:48] twentyafterfour https://secure.phabricator.com/rPe053534c7e84b09e5f01ac3acb41352bb6a37e05 ? [17:52:53] Is that related to search? [17:53:03] PROBLEM - puppet last run on snapshot1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:54:33] RECOVERY - Host ms-be1016 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [17:54:33] RECOVERY - MD RAID on ms-be1016 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [17:54:53] RECOVERY - very high load average likely xfs on ms-be1016 is OK: OK - load average: 39.91, 9.94, 3.33 [17:57:33] PROBLEM - swift-object-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:33] PROBLEM - swift-object-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:57:40] silenced [17:57:44] paladox: yeah that's the fix ;) [17:57:51] Oh :) [17:59:52] twentyafterfour should we revert https://phabricator.wikimedia.org/rPHAB9e9b2d958736c0fc39776a284890e3c78cd98095 and replace it with https://secure.phabricator.com/rPe053534c7e84b09e5f01ac3acb41352bb6a37e05 ? [18:00:04] yurik, gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T1800). Please do the needful. [18:00:25] paladox: no, I think the hotfix is fine, it's safer ;) [18:00:31] 06Operations, 06Security-Team: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300#2803110 (10bd808) >>! In T150300#2803049, @fgiunchedi wrote: > We could (also) export the number of lines written to badpass to graphite and setup an icinga alert. The metric would... [18:00:31] Oh [18:00:32] ok [18:00:33] :) [18:00:49] no parsoid deploy today [18:03:15] ACKNOWLEDGEMENT - HP RAID on db2035 is CRITICAL: CRITICAL: Slot 0: Failed: 1I:1:7 - OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12, Controller, Battery/Capacitor nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T150973 [18:03:18] 06Operations, 10ops-codfw: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T150973#2803118 (10ops-monitoring-bot) [18:04:09] jynus: is COUNT (*) not working for mysql queries? [18:04:13] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [18:05:04] mafk: COUNT (*) is a very costly operation [18:05:19] it says 'syntax error' [18:05:42] COUNT (*) FROM user_groups WHERE ug_group = 'autopatrolled'; [18:05:53] maybe I have it badly-syntaxed too [18:05:54] prepend with SELECT [18:06:06] I did select count (*) and no success either [18:06:24] SELECT COUNT(*) FROM user_groups WHERE ug_group = 'autopatrolled'; is syntaxically correct [18:06:28] where dies it say it is the syntax error before ' '? [18:06:52] try not adding a space between function and ( [18:06:57] but this is purely offtopic here [18:07:37] http://pastebin.com/vrkqg3P3 [18:07:48] sorry, I'm not sure what is the proper place [18:08:03] #wikimedia-labs is better [18:08:18] moving there [18:08:18] more users, more support, less noise here [18:09:00] I asked yesterday there and got silence as reply ;) [18:09:42] 06Operations, 10ops-codfw, 10DBA: db2035: RAID disk about to fail - https://phabricator.wikimedia.org/T150511#2803133 (10Papaul) a:05Papaul>03Marostegui @Marostegui disk replacement complete [18:13:25] (03PS1) 10Dereckson: Allow users to enable wikieditor-preview on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322136 [18:13:56] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2803153 (10Fjalapeno) @Anomie thanks… The explanation makes sense. @GWicke and I spoke at the Reading Services meeting and we are good to go - the co... [18:15:39] (03CR) 10Alex Monk: [C: 04-1] "We should determine the criteria for this list, then either write it down or (preferably) use it to determine which groups get oathauth-en" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [18:16:20] !log deploy more phabricator hotfixes [18:16:23] RECOVERY - swift-object-replicator on ms-be1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:16:24] RECOVERY - swift-object-server on ms-be1016 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [18:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:43] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [18:21:03] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:23:07] 06Operations, 10ops-codfw: db2042 disk predictive failure - https://phabricator.wikimedia.org/T150974#2803218 (10jcrespo) [18:24:14] twentyafterfour this https://phabricator.wikimedia.org/project/5/panel/new/custom.open-tasks/ shows blank to me? [18:25:02] paladox: yeah but it works ;) [18:25:12] Oh [18:25:17] it seems the problem patch was https://secure.phabricator.com/rP7cb44bcee6bf05470f64597133e0e7f425bcad0a [18:25:36] Oh i get it now [18:25:42] twentyafterfour change $field to fields [18:25:46] $fields [18:25:50] let me submit that [18:26:32] paladox: no, the fix I made is right [18:27:11] $fields is an array so you can't call $fields->getIsLockable() [18:27:54] Oh, but $field wont work since it would need to be put into the foreach [18:27:59] twentyafterfour ^^ [18:28:20] Since if you you scroll a bit up the diff it shows [18:28:22] foreach ($fields as $field) { [18:28:38] but the problem code is defined after that so $field does not exist [18:28:53] yeah but $field remains set after the foreach, except when there are no fields, then it's unset because the whole foreach essentially gets skipped in that case [18:29:28] php does not have separate scope beyond function scope, so something that's set inside the foreach doesn't get unset after the block ends [18:29:37] Oh [18:30:03] It seems that the problem code is the only one defined as $field [18:30:11] apart from the ones in the foreach that work [18:30:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:30:19] 06Operations, 06Security-Team: icinga notification if elevated writing to badpass.log - https://phabricator.wikimedia.org/T150300#2803284 (10fgiunchedi) @bd808 looks good! I guess the regular icinga/graphite check could be used in this case then [18:31:33] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:33:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:36:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:36:43] !log phabricator: deploy upstream fix for T150971 (upstream sha1: 7ebc47d906fe ) [18:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:05] T150971: "Undefined variable: field" exception trying to add a "Link to Open Tasks" to a project's menu items - https://phabricator.wikimedia.org/T150971 [18:38:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:39:23] twentyafterfour i just figured out, that when i click the submit button it links to the open tasks [18:39:33] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:39:34] but the form look blank so it looked like a problem [18:40:03] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2803304 (10GWicke) In the same conversation, @Pchelolo brought up a potential issue around image format conversion / selection. We do have a need to su... [18:44:43] (03PS1) 10Alex Monk: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322141 (https://phabricator.wikimedia.org/T150926) [18:47:03] anyone around with knowledge of mediawiki/wikidata release cycle? [18:47:54] I am seeing some both good and bad patterns since 11/10 [18:48:05] MW or wikidata? Whats up? [18:48:20] there is no "ongoing problem" [18:48:31] but a drastic data pattern change [18:49:08] I am looking at https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=8&fullscreen&from=1474220937794&to=1479408537794&var-dc=eqiad%20prometheus%2Fops&var-group=All&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-role=All [18:49:52] we are now reading half the rows that we used to [18:50:02] which is good, and could be an optimization [18:50:14] jouncebot, next [18:50:15] In 0 hour(s) and 9 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T1900) [18:50:15] In 0 hour(s) and 9 minute(s): Wikidata query service (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T1900) [18:50:27] both deployments at the same time? [18:50:39] on the other side, s5 (normally it is wikidata) is reading 10 times more rows than the rest of the wikis [18:51:07] I wonder if it could be related to a deployment? [18:52:05] Hmmm [18:52:08] change seems to happen around 2016-11-10 since 21h [18:52:21] let me search the deployment schedule [18:52:32] it is on SAL, too, right? [18:52:42] Any scap/sync should be ya [18:52:51] And the 10th would've been last thurs [18:52:59] So, group2, wikipedias [18:53:47] !log starting mobileapps deploy [18:54:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:35] to be fair, I have no reason to belive any problem on the deployment [18:54:40] in fact, maybe a huge win [18:55:07] jynus: Maybe a question for Lydia_WMDE :) [18:55:12] but maybe it is just exposing an existing wikidata issue [18:55:14] correct [18:55:44] (03PS4) 10BryanDavis: bigbrother: Rewrite as python script [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) [18:57:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [18:57:41] 06Operations, 06Performance-Team, 10Thumbor: Thumbor 504s on several images Mediawiki renders succesfully - https://phabricator.wikimedia.org/T150746#2803360 (10fgiunchedi) Definitely we don't, did you have a chance to see if the mw/200 vs thumbor/504 was real or intermittent? I spot-checked the urls above a... [18:57:52] !log deployed mobileapps bf44547 [18:58:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T1900). [19:00:04] Addshore and Krenair: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:04] gehel: Dear anthropoid, the time has come. Please deploy Wikidata query service (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T1900). [19:00:04] SMalyshev and Jonas_WMDE: A patch you scheduled for Wikidata query service is about to be deployed. Please be available during the process. [19:00:40] Hi Krenair! [19:01:09] hey [19:02:44] addshore, so you first [19:02:54] addshore, want to deploy my patch too or shall I? [19:03:09] Krenair: feel free to go before me (mine will take longer) [19:03:15] ok [19:04:48] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2803392 (10fgiunchedi) 05Open>03Resolved @krenair I don't think the realm/headers is related The swift proxy needs restarting once credentials are in place, I've do... [19:05:27] (03CR) 10Alex Monk: [C: 032] Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322141 (https://phabricator.wikimedia.org/T150926) (owner: 10Alex Monk) [19:06:01] (03Merged) 10jenkins-bot: Interwiki map update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322141 (https://phabricator.wikimedia.org/T150926) (owner: 10Alex Monk) [19:06:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:06:20] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2803398 (10Krenair) thanks. puppet doesn't handle that automatically? [19:07:45] !log krenair@tin Synchronized wmf-config/interwiki.php: https://gerrit.wikimedia.org/r/#/c/322141/ (duration: 00m 50s) [19:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:56] addshore, now it's your turn [19:09:04] Great, I'm going! [19:09:26] +2ed https://gerrit.wikimedia.org/r/#/c/322092/ [19:10:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [19:10:15] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2803407 (10fgiunchedi) No it doesn't because in production that'd mean uncoordinated restarts of swift-proxy. It would probably be fine I think to restart since swift-pr... [19:11:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:12:32] (03PS1) 10Aklapper: Phab: Remove custom UI string translations not in use anymore [puppet] - 10https://gerrit.wikimedia.org/r/322144 [19:12:33] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:13:18] (03PS2) 10Aklapper: Phab: Remove custom UI string translations not in use anymore [puppet] - 10https://gerrit.wikimedia.org/r/322144 [19:13:27] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2803410 (10Krenair) >>! In T150649#2803407, @fgiunchedi wrote: > No it doesn't because in production that'd mean uncoordinated restarts of swift-proxy. It would probably... [19:14:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:16:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [19:17:03] PROBLEM - MariaDB Slave Lag: s4 on db2037 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 591.35 seconds [19:17:23] 06Operations, 10Beta-Cluster-Infrastructure, 10Thumbor: Thumbor keeps losing Swift auth on beta - https://phabricator.wikimedia.org/T150649#2803414 (10fgiunchedi) I'd say the closest to such a list is https://wikitech.wikimedia.org/wiki/Service_restarts [19:20:48] pulled onto 1099 [19:21:33] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:22:13] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [19:23:08] syncing [19:25:10] !log addshore@tin Synchronized php-1.29.0-wmf.3/extensions/Wikidata: {{gerrit|322092}} T150948 Backporting fix for quantity precision issue. (duration: 02m 17s) [19:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:30] T150948: [Bug] quantity precision omits precision +-0 - https://phabricator.wikimedia.org/T150948 [19:25:41] well... that was quicker than I was expecting Krenair [19:26:07] SWAT all done! :) [19:26:16] audephone: aude ^^ all done [19:30:09] addshore: thanks :) [19:31:03] RECOVERY - MariaDB Slave Lag: s4 on db2037 is OK: OK slave_sql_lag Replication lag: 51.15 seconds [19:32:41] I think that was me? [19:32:58] (the ongoing schema change) [19:33:50] I will stand by, cancel if it happens again [19:34:11] 06Operations, 10hardware-requests: eqiad/codfw: swift frontend hardware refresh - https://phabricator.wikimedia.org/T148510#2803503 (10RobH) [19:35:01] 06Operations, 10hardware-requests: eqiad/codfw: swift frontend hardware refresh - https://phabricator.wikimedia.org/T148510#2725129 (10RobH) The servers have been selected and placed for delivery. (Just updating this public task.) Once they arrive onsite, a setup task will be created, and this task resolved. [19:37:52] 06Operations, 10ops-eqiad: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964#2802840 (10fgiunchedi) We're expanding in the same rows as existing restbase systems, therefore: * restbase1016 row A * restbase1017 row C * restbase1018 row D [19:42:35] 06Operations, 06Performance-Team, 10Thumbor: Match cache headers between thumbor and mediawiki - https://phabricator.wikimedia.org/T150642#2803054 (10BBlack) I think there's some confusion above between different layers in your test outputs. When you hit `https://upload.wikimedia.org` you're hitting Varnish... [19:42:50] 06Operations, 10ops-codfw, 06DC-Ops, 10Traffic: lvs2002 Embedded Flash/SD-CARD iLO errors - https://phabricator.wikimedia.org/T126321#2803055 (10Papaul) I clear the log and will leave this task open for now. Before the system update {F4735526} {F4735528} After the system update {F4735607} {F4735609} [19:44:08] jynus: I believe that was indeed the date when the last version of wikidata was released [19:45:22] jynus: hey :) here now. i was highlighted before. still relevant? [19:46:14] (03CR) 10Reedy: "Well, everyone should get access to OATHAuth in the near future, so that's moot..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [19:46:33] (03PS2) 10Filippo Giunchedi: prometheus: switch 'ops' prometheus to varbit encoding [puppet] - 10https://gerrit.wikimedia.org/r/321941 [19:46:39] Lydia_WMDE, I was looking at: [19:46:41] jynus> I am looking at https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=8&fullscreen&from=1474220937794&to=1479408537794&var-dc=eqiad%20prometheus%2Fops&var-group=All&var-shard=s1&var-shard=s2&var-shard=s3&var-shard=s4&var-shard=s5&var-shard=s6&var-shard=s7&var-role=All [19:47:49] jynus: ok not sure what is going on. I will send an email to the team [19:48:37] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: switch 'ops' prometheus to varbit encoding [puppet] - 10https://gerrit.wikimedia.org/r/321941 (owner: 10Filippo Giunchedi) [19:49:26] (03CR) 10Alex Monk: "which specific rights?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [19:53:11] (03CR) 10Reedy: "Precisely. That's gonna be one of the few things we can do it dynamically based on... Which is basically the problem we're having; wanting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322094 (https://phabricator.wikimedia.org/T150951) (owner: 10Reedy) [19:59:21] addshore: Krenair: SWAT done? [19:59:30] yeah [19:59:33] Got a quick hotpatch to roll out to WikimediaEvents to ensure perf metrics are in sync with each other [19:59:39] ages ago [19:59:45] k, thanks [19:59:47] yup! [19:59:52] jouncebot, next [19:59:53] In 0 hour(s) and 0 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T2000) [20:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T2000). [20:00:59] Krinkle: ping me when your hotpatch is done and I'll run the train [20:03:45] (03PS5) 10BryanDavis: bigbrother: Rewrite as python script [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) [20:03:47] !log krinkle@tin Synchronized php-1.29.0-wmf.3/extensions/WikimediaEvents/modules/ext.wikimediaEvents.visibilitychange.js: Ibd0935bef8f (duration: 00m 48s) [20:03:51] thcipriani: Done [20:04:01] Krinkle: thank you [20:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:34] (03PS1) 10Thcipriani: all wikis to 1.29.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322146 [20:06:36] (03CR) 10Thcipriani: [C: 032] all wikis to 1.29.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322146 (owner: 10Thcipriani) [20:07:33] (03Merged) 10jenkins-bot: all wikis to 1.29.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322146 (owner: 10Thcipriani) [20:07:57] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.29.0-wmf.3 [20:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:13] 06Operations, 06Multimedia, 06Performance-Team, 10RESTBase-API, and 4 others: Thumb API: Varnish / CDN questions - https://phabricator.wikimedia.org/T150673#2803660 (10Legoktm) [20:14:17] 06Operations, 10ops-eqiad: eqiad: Rack and setup new restbase nodes - https://phabricator.wikimedia.org/T150964#2803688 (10fgiunchedi) Also note to me: when adding these new systems note the fact that cassandra 'rack' assignment was wrong (since https://gerrit.wikimedia.org/r/#/c/191339/) in eqiad (rack 'b' wh... [20:25:05] 06Operations, 10Gerrit, 06Release-Engineering-Team, 10hardware-requests: Requesting 1 spare misc box for Gerrit in codfw - https://phabricator.wikimedia.org/T148187#2803734 (10RobH) >>! In T148187#2801786, @hashar wrote: > @demon can we stick with 400GBytes disk ? Not sure there is a point in buying 800 G... [20:26:22] !log applying schema change on s1 (page) T69223 [20:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:42] T69223: Schema change for page content language - https://phabricator.wikimedia.org/T69223 [20:28:22] 06Operations, 10Gerrit, 06Release-Engineering-Team, 10hardware-requests: Requesting 1 spare misc box for Gerrit in codfw - https://phabricator.wikimedia.org/T148187#2803739 (10RobH) Chatted with Chad in IRC. Basically means the 800GB quote (on sub task) may be overkill, and I'll get a quote for dual 400GB. [20:30:31] (03PS3) 10Rush: toollabs: refactor and establish norms [puppet] - 10https://gerrit.wikimedia.org/r/322127 [20:32:55] (03CR) 10Rush: [C: 032] toollabs: refactor and establish norms [puppet] - 10https://gerrit.wikimedia.org/r/322127 (owner: 10Rush) [20:34:13] (03PS1) 10Jcrespo: mariadb: Update package creation method [software] - 10https://gerrit.wikimedia.org/r/322147 (https://phabricator.wikimedia.org/T127811) [20:35:35] (03CR) 10jenkins-bot: [V: 04-1] mariadb: Update package creation method [software] - 10https://gerrit.wikimedia.org/r/322147 (https://phabricator.wikimedia.org/T127811) (owner: 10Jcrespo) [20:48:02] (03PS1) 10Rush: gridengine: refactor and establish norms [puppet] - 10https://gerrit.wikimedia.org/r/322149 [20:50:27] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:55:03] !log update RESTBase to b9722ba7c - staging [20:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:59:39] 06Operations, 06Performance-Team, 10Thumbor: Match cache headers between thumbor and mediawiki - https://phabricator.wikimedia.org/T150642#2803857 (10Gilles) Right, I was just surprised that users don't get caching headers at all. I expected a low TTL, not nothing at all. I guess it's probably that way becau... [20:59:58] !log update RESTBase to b9722ba7c - canary on restbase1007 [21:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:07] !log update RESTBase to b9722ba7c [21:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:58] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 404 (expecting: 200) [21:12:58] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy [21:15:49] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is inactive [21:16:08] ^I see this [21:16:48] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [21:19:28] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [21:32:05] 06Operations, 06Discovery, 06Maps, 06WMF-Legal, 03Interactive-Sprint: Define tile usage policy - https://phabricator.wikimedia.org/T141815#2803954 (10Pnorman) > And I try to ask around MapQuest what traffic levels did they observe before throwing it in. Some statistics from the HOT layer were that when... [21:33:33] (03PS4) 10Zppix: Adding nick change functionality automatically [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/322037 (https://phabricator.wikimedia.org/T150916) (owner: 10Gerrit Patch Uploader) [21:34:37] (03PS5) 10Dzahn: contint: move .htaccess content for doc/integration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/322019 (https://phabricator.wikimedia.org/T149928) [21:35:02] (03PS1) 10Kaldari: Test cookie blocking on Test Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322154 (https://phabricator.wikimedia.org/T150991) [21:37:03] (03CR) 10Dzahn: [C: 032] contint: move .htaccess content for doc/integration to puppet [puppet] - 10https://gerrit.wikimedia.org/r/322019 (https://phabricator.wikimedia.org/T149928) (owner: 10Dzahn) [21:37:39] oh.. i didnt think that i dont have +2 on the other repo [21:37:43] jouncebot: now [21:37:43] For the next 0 hour(s) and 22 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161117T2000) [21:38:56] 06Operations, 06Discovery, 10hardware-requests, 06Discovery-Search (Current work): Estimate hardware requirements for ordering new servers for Elasticsearch - https://phabricator.wikimedia.org/T148559#2803979 (10Deskana) 05Open>03Resolved >>! In T148559#2751079, @Deskana wrote: > I believe that this ta... [21:48:09] Reedy, twentyafterfour, (or really, anyone), any idea who I can ask for help with https://phabricator.wikimedia.org/T150373 ? [21:49:06] andrewbogott: Which part? MW or the bot? [21:49:40] Reedy: well, either — I've been assuming this is a MW bug but I can certainly change the bot if it's to blame [21:50:35] The bot is running on multiple hosts at the same time (possibly in multiple threads per host). So it's certainly the case that there are multiple simultaneous logins. [21:50:36] andrewbogott: Turn the bot off? [21:50:39] * ostriches ducks [21:50:51] disable the bot temp to see if it fixes that andrewbogott? [21:51:01] ostriches: worse yet, the reason I care is I want to add /more/ functionality like that [21:51:19] Zppix: but… the problem is that my edits are failing [21:51:31] andrewbogott to what the bot? [21:52:11] Those edits are produced by hooks in openstack nova. They update the state pages about Labs vms [21:52:21] Blah. Or we could kill those stupid pages. [21:52:56] um… really? I mean, you aren't interested in the mw api being broken and I just shouldn't use it? [21:53:48] andrewbogott oh this is labs' vms nevermind i know jackshit about how labs does their vms and stuff [21:53:51] We get these CAS bugs now and again in production [21:54:13] AFAIK this is exacerbated by so many logins at the same time [21:54:21] Zppix: except the bug doesn't have anything to do with labs. This is a service running on production that uses mwclient to edit a page on wikitech [21:54:27] the context /really/ shouldn't matter [21:54:28] I hope [21:54:31] (03CR) 10Tim Landscheidt: "I rewrote bigbrother as a module to tools-manifest (because it essentially performs the same task as webservicemonitor; cf. I00cd7a90273e0" [puppet] - 10https://gerrit.wikimedia.org/r/309216 (https://phabricator.wikimedia.org/T144955) (owner: 10BryanDavis) [21:54:35] What Reedy said ^ [21:54:52] andrewbogott: And no, it's not labs-specific at all. [21:54:57] We see this in prod from time to time [21:54:58] So, options include... [21:55:03] Multiple accounts (ugh) [21:55:05] Staggering [21:55:17] Sharing logins [21:56:15] Could allow anonymous edits so don't have to login ;-) [21:56:36] heh [21:56:44] ostriches to what? [21:56:48] everything [21:56:59] Is this use case /inherently/ wrong? Does the API not support editing from multiple places with the same account? [21:57:03] andrewbogott: Can we do something about the times the cronjobs run at? [21:57:19] andrewbogott: it's not really the API [21:57:21] (03CR) 10Gehel: [C: 031] "LGTM, hourly.sh script is deployed on stat1002, we can merge this" [puppet] - 10https://gerrit.wikimedia.org/r/319252 (https://phabricator.wikimedia.org/T149722) (owner: 10MaxSem) [21:57:22] if we allowed anom edits to development things, wouldnt that hurt our ablity to communicate stuff [21:57:29] (03PS2) 10Gehel: Switch discovery-stats cronjob to a dedicated script [puppet] - 10https://gerrit.wikimedia.org/r/319252 (https://phabricator.wikimedia.org/T149722) (owner: 10MaxSem) [21:57:36] Reedy: what then? The api auth extension? [21:58:12] https://phabricator.wikimedia.org/T95839 [21:58:13] Personally, I think "stop updating those stupid state pages on wikitech, they're useless" is the best suggestion so far. [21:58:19] (and not just because it's my suggestion either) [21:58:19] Reedy: I can make the updates happen less often, but I can't marshal actions across n hosts (currently 13) to make sure they're perfectly staggered [21:58:41] !log doc.wm.org Apache config live-hack fixed, puppet patch coming up [21:58:47] ok y'all, I'm very sorry I asked [21:59:01] andrewbogott: It's annoying we don't have a better answer [21:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:19] Is there a way that the nodes could remain logged in? [21:59:21] But no, the general expectation is that MW isn't being logged into many times from many locations [21:59:25] It's the login stampede that sucks. [21:59:28] It's only software that we wrote and maintain. Obviously diagnosing and fixing the actual problem is out of the question. [21:59:29] ^ is another solution, yeah [21:59:52] ostriches: As far as I know they do remain logged in… the session is cached per thread [22:00:04] (03CR) 10Gehel: [C: 032] Switch discovery-stats cronjob to a dedicated script [puppet] - 10https://gerrit.wikimedia.org/r/319252 (https://phabricator.wikimedia.org/T149722) (owner: 10MaxSem) [22:00:06] Oh, hourly basis, missed that. [22:00:08] Hmmm [22:00:09] I mean, none of what you've suggested gets me anywhere other than "this will happen less often" [22:00:33] (03PS1) 10Dzahn: contint/doc.wm.org: add missing DirectoryIndex, remove old 2.2 syntax [puppet] - 10https://gerrit.wikimedia.org/r/322201 [22:00:44] andrewbogott: It's not actually a bug in MW, MW is doing what it's supposed to (even if the error message itself sucks) [22:00:52] hmm [22:01:16] ostriches: oh? That's not obvious to me, what is it doing? [22:01:19] ostriches: Think a getInstanceForUpdate() call in updateUser would help? [22:01:49] andrewbogott: Basically, you're trying to login from two places at once, and updating user_touched fails because of the race between the logins. [22:02:13] multiple things can't have the same lock at the same time [22:02:15] Thread A starts login, Thread B starts login, Thread A logs in and updates user_touched, Thread B explodes because user_touched is like WTF you're out of date! [22:02:56] Hm, in theory I'm catching that and retrying. But then I just get throttled and /all/ logins failed after a few retries :( [22:03:09] Ahhhh, now the throttle we could fix! [22:03:15] We could raise the throttle limits for your bot [22:03:26] Is it actually the throttle? [22:03:27] * Get a new instance of this user that was loaded from the master via a locking read [22:03:27] * [22:03:27] * Use this instead of the main context User when updating that user. This avoids races [22:03:27] * where that user was loaded from a replica DB or even the master but without proper locks. [22:03:37] This... sounds vaguely like what we want [22:04:07] (03PS2) 10Dzahn: contint/doc.wm.org: add missing DirectoryIndex, remove old 2.2 syntax [puppet] - 10https://gerrit.wikimedia.org/r/322201 (https://phabricator.wikimedia.org/T149928) [22:04:14] Reedy: Using an instance for updating when....updating a user....seems like the right approach ;-) [22:04:20] (03PS3) 10Dzahn: contint/doc.wm.org: add missing DirectoryIndex, remove old 2.2 syntax [puppet] - 10https://gerrit.wikimedia.org/r/322201 (https://phabricator.wikimedia.org/T149928) [22:04:44] Imma make a patch [22:04:48] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 17 failures. Last run 2 minutes ago with 17 failures. Failed resources (up to 3 shown): Service[ferm],Service[diamond],Service[prometheus-node-exporter],Package[ecryptfs-utils] [22:04:50] Reedy: It'll mean it'll still fail, just in a slightly less opaque way [22:05:02] Which is an improvement, at least [22:05:03] In which case, andrewbogott's try/catch should save us [22:05:07] here's my retry code — it's quite likely stupid [22:05:08] https://phabricator.wikimedia.org/diffusion/OPUP/browse/HEAD/modules/openstack/files/liberty/nova/wikistatus/wikistatus.py;48f81347f67c81036aabe0e26abaf60271afa596$107 [22:05:11] I note Aaron applied this fix elsewhere [22:05:48] Ah ok, so two things here we can do [22:05:52] maybe I should make that 2-second sleep much longer... [22:05:58] I note that function is out of date too [22:06:01] 1) What Reedy's proposing, it'll make it explode less opaquely [22:06:18] and 2) up the throttles on andrew's bot so his try/catch logic has a chance to work right [22:06:27] (03CR) 10Dzahn: [C: 032] contint/doc.wm.org: add missing DirectoryIndex, remove old 2.2 syntax [puppet] - 10https://gerrit.wikimedia.org/r/322201 (https://phabricator.wikimedia.org/T149928) (owner: 10Dzahn) [22:06:31] what would you guess the default throttle is? [22:06:33] andrewbogott: Yeah, I'd bump it as long as the actual time-to-update doesn't matter much [22:06:43] Like, 3? [22:06:55] (03CR) 10Hashar: "That at least restore access for directories having index.php index.html and the root directory." [puppet] - 10https://gerrit.wikimedia.org/r/322201 (https://phabricator.wikimedia.org/T149928) (owner: 10Dzahn) [22:06:55] I mean, is the throttle likely to be single digits? [22:07:32] Failed logins in N time? Prolly yeah [22:07:35] Lemme check [22:07:46] thx [22:07:49] ostriches: I note it seems to try and saveSettings numerous times too [22:07:59] If it's really like 3 or 5 then that pretty much explains everything :) [22:08:09] Which can only exacerbate the situation [22:08:24] !log contint1001 - Apache fix puppetized, re-enabeld [22:08:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:48] I thought we $wgRateLimits had one for login... [22:08:50] Hmm.... [22:09:10] (03CR) 10Dzahn: "needed follow-up" [puppet] - 10https://gerrit.wikimedia.org/r/322019 (https://phabricator.wikimedia.org/T149928) (owner: 10Dzahn) [22:10:14] quick question in .gitconfig does the git review need be your gerrit login (on web) or does it need to be your shell username i forget [22:10:21] (im troubleshooting somethings with git) [22:10:54] Hmm, the content of doc.wm.org looks like it's regressed a lot (missing half the old content). [22:10:55] Hm, I'm also puzzled by [22:10:58] https://www.irccloud.com/pastebin/yOfffC4x/ [22:11:18] Like, shouldn't that already be doing the retry/wait? Maybe that only works for certain exceptions and not for this one? [22:11:21] James_F see -releng [22:11:35] Hmm, maybe we use ping limiter... [22:11:52] Zppix: Shell [22:12:34] btw, ostriches, for context: The reason I care about those pages is that within the Openstack paradigm, there's not otherwise a way for a user to know what's going on inside a project (or, coming soon, even know a project exists) without being a project member. [22:12:49] And I remain pretty tied to the idea that what's-on-labs should remain transparent [22:12:59] ostriches: i may end up needing reinstall git then (either that or gerrit doesnt like my account) is it possible to see if my gerrit account could be for some reason corrupted in some manner (if thats even possible) [22:13:11] Having in-sync status pages and reports seems like a good way to produce that transparency. [22:13:20] andrewbogott: I don't disagree with that ideal, I'm just not sure that editing a bunch of MW pages all the time is the best solution long term :) [22:13:36] But that's beyond the scope of today / your current problem :) [22:13:37] It's weird [22:13:51] Zppix: That would be a first (corruption) [22:14:22] Aha, it's back, I guess Daniel fixed it. [22:14:51] James_F: yes, the last merge there fixed it (mostly) [22:14:57] ostriches hmm does git review have a update cmd? [22:15:06] we moved the content of .htaccess files into main config.. and side-effects from 2.2->2.4 change [22:15:27] mutante: Fun. :-) [22:16:03] andrewbogott: The Ldap extension really needs rewriting from the ground up... [22:16:35] Zppix: Like self-updating? No. What version are you on though? [22:17:03] (also insert the general scolding against using git-review at all :p) [22:17:05] that doesn't shock me :( [22:17:23] ostriches i have no choice its easier for people whom suck a$$ at git [22:17:37] * ostriches shrugs :) [22:17:56] Zppix: Coincidentally, git-review also sucks at git! :p [22:17:58] It'll need doing for definite at some point [22:18:01] ok, in the case of my last test I can see it trying three logins in a row, all three fail with that same 'CAS update failed on user_touched ' [22:18:08] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:18:53] ostriches git-review version 1.25.0 [22:19:06] Ok, so the new/required version [22:19:08] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [22:19:27] ostriches maybe a git-review reinstall might be in order? [22:21:26] (03PS1) 10Dzahn: contint/doc.wm.org: drop dir.php from DirectoryIndex [puppet] - 10https://gerrit.wikimedia.org/r/322206 (https://phabricator.wikimedia.org/T149928) [22:21:32] predicts a comment about git-review :) [22:22:28] mutante i killed ostriches with my questions :P [22:22:29] Zppix: i have 1.25.0-2 from Debian repo [22:22:38] _not_ from pip [22:23:03] mutante you realise me not using pip is the same thing as you trying to hack into the NSA right [22:23:53] well, i have had bad experience with it, when you tell it to "pip remove" stuff, it is a liar and claims it does but did not really delete all files, then later they conflict with the files from distro and great "fun" is had [22:24:13] once i found that and deleted them all and used only distro.. my problems were gone [22:24:28] mutante i had that issue on my tools.account on labs [22:24:32] tools labs * [22:25:11] (03PS2) 10Dzahn: contint/doc.wm.org: drop dir.php from DirectoryIndex [puppet] - 10https://gerrit.wikimedia.org/r/322206 (https://phabricator.wikimedia.org/T149928) [22:25:41] it's the mixing of different package managers.. [22:25:46] that didnt end well [22:26:08] well, and pip should actually remove stuff on "remove" of course [22:26:34] (03CR) 10Dzahn: [C: 032] contint/doc.wm.org: drop dir.php from DirectoryIndex [puppet] - 10https://gerrit.wikimedia.org/r/322206 (https://phabricator.wikimedia.org/T149928) (owner: 10Dzahn) [22:27:16] mutante if i reinstall git-review will i have to reconfig all my cloned repos again [22:27:24] no [22:27:29] ok [22:27:29] that's what the .gitreview files are fo [22:27:30] r [22:27:33] Hm, I suppose I can't vary wgPasswordAttemptThrottle by account [22:29:03] i have to go afk to drive my car back from office, cu later [22:30:02] (03PS1) 10Andrew Bogott: Wikitech: Increase login throttle limits x4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322207 (https://phabricator.wikimedia.org/T150373) [22:31:17] andrewbogott: I don't think so [22:31:31] at most, you maybe could do that for *performing* account [22:31:39] not for target account, which is the interesting bit [22:33:08] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:33:46] (03PS1) 10Andrew Bogott: wikistatus: Further attempt to solve login races [puppet] - 10https://gerrit.wikimedia.org/r/322208 (https://phabricator.wikimedia.org/T139773) [22:35:29] (03CR) 10Chad: "Minor inline nit, otherwise lgtm. 20s sleep will probably help." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322208 (https://phabricator.wikimedia.org/T139773) (owner: 10Andrew Bogott) [22:39:38] 06Operations, 10Phabricator, 13Patch-For-Review: networking: allow ssh between iridium and phab2001 - https://phabricator.wikimedia.org/T143363#2804228 (10mmodell) This should be resolved now, I think? [22:39:58] (03CR) 10Andrew Bogott: wikistatus: Further attempt to solve login races (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322208 (https://phabricator.wikimedia.org/T139773) (owner: 10Andrew Bogott) [22:40:38] mutante found my problem my dumbass forget to turn on peagent [22:40:46] (03CR) 10Chad: [C: 031] wikistatus: Further attempt to solve login races (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322208 (https://phabricator.wikimedia.org/T139773) (owner: 10Andrew Bogott) [22:41:55] Reedy: how long would you expect these sessions to last before I have to re-login? [22:42:19] It seems like every time I do something I have to log in again, I can't tell if that means the caching is broken or if the cache is just very short-lived [22:42:26] um… the session I mean [22:43:10] I presume the script isn't persistent? [22:43:12] do you save the cookies somewhere? [22:43:23] ^ That ^ [22:43:31] If you don't store the cookie, "remember me" won't do anything :) [22:45:08] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [22:45:19] andrewbogott im assuming if its gerrit because remember isnt clicked therefore no cookie no remember [22:45:40] we're not talking about gerrit [22:45:48] oh [22:45:55] ah now i see [22:46:01] i overlooked "sessions" [22:46:18] Reedy: I think it is persistent, unless there's a thread being spun off someplace in library code that I haven't found yet [22:46:31] OOI, what version of mwclient are you using? [22:46:51] 0.8.1-1 [22:47:47] I'd presume, by default it wouldn't be persistent [22:48:22] I don't think mwclient had persistent session storage [22:48:35] you can see bryan's and my attempt to cache the session here: https://gerrit.wikimedia.org/r/#/c/321169/ [22:49:08] For the login to be remembered.. it'd need to be a long runing script [22:49:11] with long sleep() [22:49:12] I guess [22:49:23] that's only per script invocation [22:49:27] what you're looking at is a hook, not a script [22:49:42] so it lives within the nova-compute service itself which is long-lived [22:49:43] Where's it run from? [22:49:51] How does that run/instantiate it? [22:50:10] with an entrypoint [22:50:34] But the long answer is — they rewrite this stuff constantly so I don't know what it does /today/. I'm reading still. [22:51:22] (03CR) 10Andrew Bogott: [C: 032] wikistatus: Further attempt to solve login races [puppet] - 10https://gerrit.wikimedia.org/r/322208 (https://phabricator.wikimedia.org/T139773) (owner: 10Andrew Bogott) [22:53:30] tbh... [22:53:31] site = mwclient.Site(("https", host), [22:53:31] retry_timeout=5) [22:53:39] I think that'd be enough to loose any state [22:54:48] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 2I:1:5, 2I:1:6, 2I:1:7, 2I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 1I:2:1, 1I:2:2, 1I:2:3, 1I:2:4, 1I:4:1, 1I:4:2, Controller [22:55:58] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:57:57] icinga-wm was resurrected i see [23:01:04] 06Operations, 10Mail, 10MediaWiki-Email, 05Security: DMARC: Users cannot send emails via a wiki's [[Special:EmailUser]] - https://phabricator.wikimedia.org/T66795#2804285 (10Aklapper) [23:01:07] chasemp: ^ CR if you could? [23:01:08] bah [23:01:08] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [23:01:13] ssh is borked again? [23:01:19] let me switch networks [23:03:29] (03PS1) 10Yuvipanda: toollabs: Fix maintain-kubeusers crashing [puppet] - 10https://gerrit.wikimedia.org/r/322213 [23:03:44] chasemp: ^ [23:03:47] greg-g: I'd like to temporarily revert a patch relating to last month's wmf.18 regression to verify that we've found the root cause. (Which was mainly related to our metrics wrongly measuring page load times for background/hidden tabs, which are always longer.) It should have minimal or no impact on foreground page views. For background tabs, it will cause [23:03:47] page load times to be a little slower, but should not impact our graphite metrics this time since we've fixed it (hopefully). [23:04:04] chasemp: do you think you'll have time to look at it? if not no problem, I can carefully reason through it again and get it sorted [23:08:26] ok, it's pretty clear that our caching attempts don't work at all, so I'll work on that more tomorrow :( [23:09:08] (03Draft1) 10Paladox: Add REL1_28 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322217 [23:09:10] (03Draft2) 10Paladox: Add REL1_28 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322217 [23:09:21] ostriches im wondering could you review ^^ please? [23:09:39] (03CR) 10Filippo Giunchedi: [C: 04-1] "I guess there's no easy solution to this :(" [puppet] - 10https://gerrit.wikimedia.org/r/318145 (https://phabricator.wikimedia.org/T149098) (owner: 10Filippo Giunchedi) [23:11:05] (03PS2) 10Yuvipanda: toollabs: Fix maintain-kubeusers crashing [puppet] - 10https://gerrit.wikimedia.org/r/322213 (https://phabricator.wikimedia.org/T150946) [23:11:16] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Install a docker registry for production - https://phabricator.wikimedia.org/T148960#2804340 (10fgiunchedi) [23:11:19] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 13Patch-For-Review, 15User-Joe: Experiment with Swift as docker registry backend - https://phabricator.wikimedia.org/T149098#2804337 (10fgiunchedi) 05Open>03stalled p:05Triage>03Normal [23:12:11] (03CR) 10Chad: "Should also go in the array below." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322217 (owner: 10Paladox) [23:13:16] (03PS3) 10Paladox: Add REL1_28 to ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322217 [23:13:25] (03CR) 10Paladox: "> Should also go in the array below." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/322217 (owner: 10Paladox) [23:13:26] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 13Patch-For-Review, 15User-Joe: Experiment with Swift as docker registry backend - https://phabricator.wikimedia.org/T149098#2804346 (10fgiunchedi) Stalling on the fact that it'd be nice to be using esams for this, though codfw will do... [23:21:56] greg-g: addshore: I'm going ahead with the metric observation test, it'll just be 5-10 minutes before the next swat window. [23:22:17] !log krinkle@tin Synchronized php-1.29.0-wmf.3/resources/src/mediawiki/mediawiki.js: Ie21f5c: temp revert cache-eval for metric observation (duration: 00m 49s) [23:22:24] ack [23:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:58] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [23:25:08] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [23:34:28] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 2I:1:5, 2I:1:6, 2I:1:7, 2I:1:8, 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 1I:2:1, 1I:2:2, 1I:2:3, 1I:2:4, 1I:4:1, 1I:4:2, Controller [23:36:32] 06Operations, 05Prometheus-metrics-monitoring: Provide authenticated access to Prometheus native web interface - https://phabricator.wikimedia.org/T151009#2804471 (10fgiunchedi) [23:41:18] (03PS6) 10Filippo Giunchedi: role: add prometheus 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/321814 (https://phabricator.wikimedia.org/T150486) [23:42:14] !log krinkle@tin Synchronized php-1.29.0-wmf.3/resources/src/mediawiki/mediawiki.js: Ie21f5c: undo temp revert for metric observation (duration: 00m 54s) [23:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:48] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [23:48:48] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3357437 keys, up 17 days 15 hours - replication_delay is 0 [23:49:48] (03PS2) 10Filippo Giunchedi: [WIP] swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) [23:51:01] (03CR) 10jenkins-bot: [V: 04-1] [WIP] swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi) [23:52:25] addshore: Done [23:52:32] [= [23:52:57] (03PS3) 10Filippo Giunchedi: [WIP] swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) [23:54:46] There's nothing in to SWAT anyway [23:56:01] (03PS1) 10Addshore: WIP Add grafana_json_datasource [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) [23:57:01] Reedy: woo! [23:57:02] (03CR) 10jenkins-bot: [V: 04-1] WIP Add grafana_json_datasource [puppet] - 10https://gerrit.wikimedia.org/r/322220 (https://phabricator.wikimedia.org/T147328) (owner: 10Addshore) [23:57:43] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 05MW-1.28-release-notes, 13Patch-For-Review: HHVM: segfault when serializing/unserializing large preprocessor cache items - https://phabricator.wikimedia.org/T73486#2804535 (10Aklapper) 05Open>03Resolved >>! In T73486#2644398, @hashar wrote: > @ts... [23:57:46] (03PS4) 10Filippo Giunchedi: [WIP] swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) [23:58:56] (03CR) 10jenkins-bot: [V: 04-1] [WIP] swift: terminate https with nginx [puppet] - 10https://gerrit.wikimedia.org/r/310549 (https://phabricator.wikimedia.org/T127455) (owner: 10Filippo Giunchedi)