[00:00:15] (03PS6) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [00:00:31] !log mattflaschen Synchronized php-1.26wmf4/includes/jobqueue/: Job queue changes for triggerOpportunisticLinksUpdate (duration: 00m 12s) [00:00:39] Logged the message, Master [00:01:24] matt_flaschen: for some reason I can't edit any pages on wikitech wiki. it just gives me a session error. tried logging in and out a few times, but no luck :( The gerrit patch is https://gerrit.wikimedia.org/r/#/c/202926/ [00:01:27] !log mattflaschen Synchronized php-1.26wmf4/includes/page/WikiPage.php: Job queue changes for triggerOpportunisticLinksUpdate (duration: 00m 11s) [00:01:32] Logged the message, Master [00:01:34] matt_flaschen, right [00:02:40] kaldari: if you repost the form, sometimes it works [00:03:01] tried that too [00:03:06] several times [00:03:11] and cleared my cookies [00:03:48] kaldari, I've had issues recently where wikitech logged me out either immediately, or after a couple minutes. [00:04:51] AaronSchulz, 1.26wmf4 is done. [00:05:37] matt_flaschen: logged you out, or gave your "session data error" messages? [00:05:52] greg-g, just logged out, IIRC. Wasn't today, last week IIRC. [00:06:00] I can confirm that re-submitting will get it to save, you just have to do it an dget lucky [00:06:03] ah [00:06:50] greg-g: I can click save page all day and it never saves [00:07:44] greg-g: it's definitely not the usual session error weirdness [00:08:11] greg-g: can you save there? [00:08:20] (03CR) 10Mattflaschen: [C: 04-1] "Like I said on the ticket, there are technical reasons to either have 'absolutely all WMF wikis' or a dblist." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/139326 (https://phabricator.wikimedia.org/T97760) (owner: 10Withoutaname) [00:08:38] (03PS7) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [00:09:39] kaldari: I've been having the same issue (session data) for the last week, if I brute force saving (submit submit submit submit...) it works in the end. Logging out/back in doesn't change anything [00:09:47] 6operations, 6Phabricator, 5Patch-For-Review: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1278056 (10mmodell) @Robh: I believe this should be working now, can you confirm? [00:10:19] from Friday: [00:10:19] 15:28 greg-g: wikitech's session data errors are transient, hitting save multiple times will eventually work [00:10:22] 15:26 greg-g: multiple independent reports of wikitech wiki having session data errors [00:10:29] https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:48] yuvipanda: do you know if a bug ever got filed for that? [00:10:54] huh [00:10:54] what [00:10:58] * yuvipanda reads backscroll [00:11:00] oh. [00:11:03] not afaik [00:11:49] who cares about wikitech again? :) [00:11:50] greg-g: just tried to save 10 times in a row and didn't have any luck. Also tried on other pages. [00:11:55] kaldari: :( [00:11:56] no idea [00:12:08] how important is it? [00:12:26] At least "High" [00:12:27] if you really need to edit wikitech or the world will end, ssh silver.wikimedia.org -t sudo apache2ctl restart [00:12:37] :( [00:12:42] that's note the point of wikitech [00:12:45] it's a work wiki [00:12:50] your ssh config will need to send that via a bastion [00:12:51] it's meant to work [00:13:29] who cares about wikitech? the 3 people in the labs team who also care about 400 other things [00:13:30] * yuvipanda looks [00:14:06] yuvipanda: I know, I'm being annoying, sorry [00:14:07] greg-g: can you file the bug? [00:14:13] sure [00:14:15] thanks [00:14:29] yuvipanda: uh, what project? [00:14:30] reference T98084 [00:14:35] wikitech.wikimedia.org; operations [00:14:36] greg-g: there’s wikitech.wikimedia.org [00:14:43] matt_flaschen: ping? [00:14:56] I would stick operations on it as well, and maybe labs [00:14:59] Krenair: why is that resolved? [00:15:05] Dereckson, AaronSchulz's addition took a little while to merge. It's done now; about to deploy it. [00:15:06] > PHP Fatal error: Call to undefined function MediaWiki\\Logger\\Monolog\\is_resource() in /srv/mediawiki/php-1.26wmf4/includes/debug/logger/monolog/LegacyHandler.php on line 233 [00:15:10] bd808: ^ is that from you? [00:15:19] I found that on silver’s logs [00:15:21] Okay. [00:15:30] matt_flaschen greg-g: save finally worked! Extra config change added to deployment calendar :) [00:15:45] 6operations, 10wikitech.wikimedia.org: transient failures of wiki page saves - https://phabricator.wikimedia.org/T98084#1278098 (10greg) 5Resolved>3Open Re-Opening. This is still happening. @Kaldari can't save at all. [00:15:46] !log restarted apache on silver [00:15:52] Logged the message, Master [00:16:05] 6operations, 10wikitech.wikimedia.org: transient failures of wiki page saves - https://phabricator.wikimedia.org/T98084#1278101 (10Krenair) a:5Krenair>3None [00:16:11] yuvipanda: ugh. that's a php namespace caused bug [00:16:13] 6operations, 10wikitech.wikimedia.org: transient failures of wiki page saves - https://phabricator.wikimedia.org/T98084#1278103 (10greg) (please don't close until we can confirm this stays working for more than a day) [00:16:17] and it would be my fault, yes [00:16:34] (03CR) 10Legoktm: "recheck" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (owner: 10Yuvipanda) [00:16:36] bd808: ok. it’s not the cause of the session failure, though [00:16:43] (03PS2) 10Kaldari: Removing redundant ® from mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202926 (https://phabricator.wikimedia.org/T95007) [00:16:46] (03CR) 10jenkins-bot: [V: 04-1] Removing redundant ® from mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202926 (https://phabricator.wikimedia.org/T95007) (owner: 10Kaldari) [00:16:52] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (owner: 10Yuvipanda) [00:16:56] it lasted 3 and a half days [00:16:56] 6operations, 10wikitech.wikimedia.org: transient failures of wiki page saves - https://phabricator.wikimedia.org/T98084#1278104 (10greg) p:5Triage>3High [00:16:59] legoktm: thanks :) [00:17:06] Does sync-file sync a directory? I think I used that by mistake before, so re-running. [00:17:16] bd808, ^ [00:17:16] np :) [00:17:20] matt_flaschen: sync-dir does :P [00:17:23] :) [00:17:25] greg-g: nobody in the labs team is actually qualified to look into it and fix it either - none of us know much about how MW itself works... [00:17:34] (03PS3) 10Aaron Schulz: Bumped the $wgJobBackoffThrottling refreshLinks limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210246 (https://phabricator.wikimedia.org/T98621) [00:17:45] yuvipanda: yeah, feel free to call in for reinforcements [00:17:46] !log mattflaschen Synchronized php-1.26wmf4/includes/jobqueue/: Job queue changes for triggerOpportunisticLinksUpdate (duration: 00m 13s) [00:17:55] * yuvipanda wonders whom to poke now [00:18:06] yuvipanda: add it on the MW Core team escalation queue... oh wait [00:18:12] yeah [00:18:20] platform engineering [00:18:23] greg-g: whom do you suggest I poke? [00:18:29] ori, https://gerrit.wikimedia.org/r/210246 [00:18:32] * yuvipanda goes back in time, pokes ‘platform engineering' [00:18:48] greg-g: I can poke individuals, and they can help out of the goodness of their hearts, but that seems… unideal. [00:18:54] !log mattflaschen Synchronized php-1.26wmf5/includes/jobqueue/: Job queue changes for triggerOpportunisticLinksUpdate (duration: 00m 12s) [00:18:57] Logged the message, Master [00:18:58] yuvipanda: it's the way it's always been, really [00:19:06] !log mattflaschen Synchronized php-1.26wmf5/includes/page/WikiPage.php: Job queue changes for triggerOpportunisticLinksUpdate (duration: 00m 12s) [00:19:09] Logged the message, Master [00:19:13] yuvipanda: lego is a good first start :) [00:19:14] greg-g: well, when there was a MW Core team I could poke them and not feel as guilty :) [00:19:17] yuvipanda: that error message you quoted is a php interpreter bug maybe. :/ [00:19:23] :| [00:19:25] bd808: oh…. that’s possible. [00:19:28] hi legoktm ! :) [00:19:31] session data again? [00:19:32] AaronSchulz, done in wmf5 [00:19:42] bd808: it’s the only thing running PHP incluster for web serving, I think [00:19:42] did anyone look at memcache? [00:19:42] legoktm: yes [00:20:03] legoktm: it seems to be running :) what else should I look at? [00:20:11] kaldari, ready? [00:20:13] hold on, logging into silver [00:20:17] legoktm: sweet, thanks [00:20:22] considering an apache restart makes it go away.... [00:20:29] matt_flaschen: one sec, the patch has a merge conflict... [00:20:34] I would assume probably not memcached's fault? [00:20:45] for silver? Maybe something wrong with the proxy in front of memcached? [00:20:52] ^ [00:21:01] there's a proxy in front of memcached? [00:21:10] what do we use? twproxy or something [00:21:14] matt_flaschen, good [00:21:21] nutcracker aka twemproxy [00:21:28] bd808: we got rid of it for silver, IIRC [00:21:42] it’s still running. [00:21:43] hmm [00:21:50] kaldari, unrelatedly, shouldn't those be PNGs? [00:21:55] $wgMemc looks fine [00:22:07] legoktm: what port does it hit? [00:22:19] eh it’s still being used [00:22:23] nutcracker, tha tis [00:22:24] *that is [00:22:30] Okay, skipping kaldari for now. [00:22:32] Dereckson, ready? [00:22:47] wmf-config/CommonSettings.php:$wgSessionCacheType = 'sessions'; [00:22:49] wmf-config/wikitech.php:$wgSessionCacheType = 'memcached-pecl'; [00:22:58] yes [00:23:00] so... why do we use a separate system for wikitech for this? [00:23:28] no good reason [00:23:38] (03CR) 10Mattflaschen: [C: 032] Content namespaces configuration on he.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210021 (https://phabricator.wikimedia.org/T98709) (owner: 10Dereckson) [00:23:38] 986 connections in CLOSE_WAIT on nutcracker [00:23:42] not sure if that’s normal? [00:23:49] (03Merged) 10jenkins-bot: Content namespaces configuration on he.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210021 (https://phabricator.wikimedia.org/T98709) (owner: 10Dereckson) [00:23:59] Krenair: because it was a separate special wiki from the beginning, we're slowly making it less so :) [00:24:16] greg-g, yes I know that, I was wondering if there is still a good reason for it [00:24:22] none on mw1197 [00:24:33] Krenair: probably not [00:24:34] so I’m wondering if it’s nutcracker [00:24:56] !log mattflaschen Synchronized wmf-config/InitialiseSettings.php: Deploy Hebrew Wikisource content namespace config changes (duration: 00m 14s) [00:25:01] Testing. [00:25:04] strace actually points to nutcracker still doing stuff tho [00:25:05] (03PS3) 10Kaldari: Removing redundant ® from mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202926 (https://phabricator.wikimedia.org/T95007) [00:25:21] matt_flaschen: all fixed now: https://gerrit.wikimedia.org/r/#/c/202926/2 [00:25:37] Works. [00:25:46] legoktm: Krenair memcached and nutcracker both reading / writing data [00:25:54] yeah, my test worked fine [00:26:08] yuvipanda: there were prod nutcracker problems in the outage at the end of January when the switch lost power. ori might remember the details [00:26:24] bd808: hmm, I remember seeing these CLOSE_WAIT things earlier [00:26:49] * yuvipanda considers restarting nutcracker [00:26:50] CLOSE_WAIT is a normal tcp socket thing [00:26:55] Dereckson, okay. May take 5 minutes to propogate to wgContentNamespaces. [00:27:01] bd808: yeah, but close to a thousand connections at CLOSE_WAIT? [00:27:05] just means that the other side hasn't acked the close [00:27:06] bd808: I checked another prod mw host, had 0 [00:27:12] bd808: in this case all the sides are on the same host [00:27:25] I thought we switched the config to use unix sockets instead of tcp actually [00:27:36] ["servers"]=> [00:27:36] array(1) { [00:27:36] [0]=> [00:27:36] string(15) "127.0.0.1:11212" [00:27:37] } [00:27:39] yup :P [00:27:41] (03CR) 10Mattflaschen: [C: 032] Add flood user group on ca.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209699 (https://phabricator.wikimedia.org/T98576) (owner: 10Dereckson) [00:27:43] bd808: ^ that's what it's configured to use [00:27:49] (03Merged) 10jenkins-bot: Add flood user group on ca.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209699 (https://phabricator.wikimedia.org/T98576) (owner: 10Dereckson) [00:28:01] legoktm: *nod* that's the nutcracker port as I recall [00:28:06] 'servers' => defined( 'HHVM_VERSION' ) [00:28:07] ? array( '/var/run/nutcracker/nutcracker.sock:0' ) [00:28:07] : array( '127.0.0.1:11212' ), [00:28:20] ah only for hhvm then [00:28:20] and wikitech is using php5 [00:28:28] !log mattflaschen Synchronized wmf-config/InitialiseSettings.php: Deploy Catalan Wikinews flood group (duration: 00m 13s) [00:28:34] Logged the message, Master [00:28:37] ^ Dereckson, please test. [00:28:47] any objections to me restarting nutcracker? [00:29:10] how would restarting apache affect nutcracker? [00:29:20] yuvipanda: nope [00:29:25] oh, hmm. I wonder that’s what the CLOSE_WAIT is from [00:29:31] I don’t think nutcracker restart will have any effect. [00:29:32] 209699 works [00:29:58] (03CR) 10Mattflaschen: [C: 032] Add www.jacar.go.jp to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210039 (https://phabricator.wikimedia.org/T98733) (owner: 10Dereckson) [00:29:59] !log restarted nutcracker on silver [00:30:02] Logged the message, Master [00:30:05] (03Merged) 10jenkins-bot: Add www.jacar.go.jp to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210039 (https://phabricator.wikimedia.org/T98733) (owner: 10Dereckson) [00:30:24] (03PS1) 10BryanDavis: Cleanup base::remote-syslog [puppet] - 10https://gerrit.wikimedia.org/r/210253 [00:30:34] legoktm: so saves work on first try for me now [00:30:39] Krenair: greg-g ^ [00:30:58] !log mattflaschen Synchronized wmf-config/InitialiseSettings.php: Add www.jacar.go.jp to wgCopyUploadsDomains (duration: 00m 11s) [00:31:02] (03PS2) 10BryanDavis: Cleanup base::remote-syslog [puppet] - 10https://gerrit.wikimedia.org/r/210253 (https://phabricator.wikimedia.org/T98289) [00:31:05] wouldn't that have been the case after you restarted apache earlier though? [00:31:12] ^ Probably not feasible to test immediately. I'm sure we'll hear if it doesn't work. [00:31:18] Testing. [00:31:27] Krenair: it worked for kaldari for one save and then it didn’t (for me, at least) [00:31:30] Oh I've an upload form ready with a test picture. [00:31:34] now I tried two saves and they both worked [00:31:34] hmm [00:31:41] Dereckson, +1 [00:31:56] Yes, works, https://commons.wikimedia.org/wiki/File:K%C5%8Dkai_no_tatakai_ni_waga_Matsushima_no_suihei_shi_ni_nozonde_tekikan_no_sonpi_o_tou.jpg [00:32:29] (it seems so modern for a 1895 work) [00:32:33] (1894) [00:32:51] Dereckson, :( Not PD-100 since he died in November 1915. [00:33:04] Config works though. [00:33:16] Oh, well spotted, thanks :) [00:33:29] And thanks for the deploy. [00:33:34] Probably can just change to PD-70? Or is Japan PD-100? [00:33:41] 6operations, 10wikitech.wikimedia.org: transient failures of wiki page saves - https://phabricator.wikimedia.org/T98084#1278164 (10yuvipanda) Restarting apache fixed it intermittently, but restarting nutcracker (which had about a thousand connections in CLOSE_WAIT) seems to have fixed it better. I wonder if w... [00:34:24] {{PD-Old-70}} {{PD-1923}} I guess. [00:34:31] In the glorious future, with Commons on Wikidata, we shouldn't need to explicitly put and update those templates. Can be figured out from the author's death year. [00:34:41] If the metadata is in place. [00:34:47] kaldari, ready? [00:34:53] yep [00:35:13] yuvipanda: I think getting rid of nutcracker is a good idea...one less place to fail. Though it ends up becoming one more thing different from prod [00:35:24] matt_flaschen: you're a little bit optimistic on this one, public domain review is a little bit more complicated than only author death date, thanks to URAA for example. [00:35:32] legoktm: yeah [00:35:38] Dereckson, yeah, maybe. [00:35:48] greg-g: legoktm can you try editing as well and see if it succeeds? [00:35:57] (03CR) 10BryanDavis: "cherry-picked to deployment-salt for testing" [puppet] - 10https://gerrit.wikimedia.org/r/210253 (https://phabricator.wikimedia.org/T98289) (owner: 10BryanDavis) [00:36:08] (03CR) 10Mattflaschen: [C: 032] Removing redundant ® from mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202926 (https://phabricator.wikimedia.org/T95007) (owner: 10Kaldari) [00:36:15] (03Merged) 10jenkins-bot: Removing redundant ® from mobile wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/202926 (https://phabricator.wikimedia.org/T95007) (owner: 10Kaldari) [00:38:50] !log mattflaschen Synchronized images/mobile/wikipedia-wordmark-en.png: Update Wikipedia word mark and related config (duration: 00m 13s) [00:38:55] Logged the message, Master [00:38:58] checking... [00:39:19] kaldari, not done yet. [00:39:22] !log mattflaschen Synchronized wmf-config/InitialiseSettings.php: Update Wikipedia word mark and related config (duration: 00m 11s) [00:39:25] Logged the message, Master [00:39:32] ^ kaldari, now it is. [00:39:34] (03PS3) 10BryanDavis: Cleanup base::remote-syslog [puppet] - 10https://gerrit.wikimedia.org/r/210253 (https://phabricator.wikimedia.org/T98289) [00:39:35] matt_flaschen: looks great. thanks! [00:40:12] Alright, I think that's everything. [00:40:25] Can someone mark them all done (if Wikitech's sessions will behave)? [00:41:39] (03PS1) 10Kaldari: Updating trademark symbols in mobile per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210254 [00:44:19] legoktm: Krenair thanks for help debugging :) [00:44:22] I don’t know what comes next [00:44:34] greg-g: so the current crisis seems averted, I’m not sure what next. [00:44:39] we should re-evaluate why wikitech still exists [00:44:48] and whether we really need a separate wiki [00:45:23] legoktm: I think that’s a super long term question, tbh. Right now it’s serving multiple purposes: 1. LDAP account creation, 2. OpenStackManager, 3. Wiki for documentation [00:45:47] legoktm: (2) is being replaced slowly by horizon, but I suspect it’ll be another 6 months before it’s a full replacement. [00:45:57] for 3, we have meta and mw.o, [00:45:57] legoktm: and there’s good reason to suggest (3) be kept separate from rest of prod cluster [00:46:01] why? [00:46:09] what happens when the prod cluster goes down? [00:46:28] set up mediawiki-static or whatever? [00:46:51] if the prod cluster goes down due to a fatal or something, there's a good chance wikitech is going with it [00:46:53] legoktm: https://wikitech-static.wikimedia.org/wiki/Main_Page [00:47:12] legoktm: well, not necessarily. our last big outage was because a switch died, for instance. and wikitech wasn’t affected [00:47:12] yuvipanda: what am I looking at? [00:47:24] legoktm: wikitech-static, a static copy hosted on Linode or Rackspace [00:47:29] right [00:47:29] (IIRC) [00:47:35] so just set up a version of that for mw.o [00:48:01] rest of ops is not going to agree on letting wikitech run in-cluster. [00:48:08] what needs to happen of course, is for it to get much closer to prod [00:48:11] so trusty, HHVM, etc [00:49:08] ... it runs trusty [00:49:23] (03PS8) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [00:49:24] and a newer version of php too [00:49:29] uh [00:49:31] why not HHVM then? [00:49:38] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (owner: 10Yuvipanda) [00:50:06] I don't know, I'd quite like for that hhvm migration to be completed :) [00:50:32] +1 [00:50:36] is there a bug for that somewhere? [00:50:39] err [00:50:40] * yuvipanda looks [00:51:03] can’t find one [00:51:15] yes [00:51:30] search for tin or terbium [00:51:31] 6operations, 6Labs, 10wikitech.wikimedia.org: Move wikitech to HHVM - https://phabricator.wikimedia.org/T98813#1278203 (10yuvipanda) 3NEW [00:51:39] Krenair: ah, but this is specifically wikitech, I guess [00:53:01] 6operations, 6Labs, 10wikitech.wikimedia.org: Move wikitech to HHVM - https://phabricator.wikimedia.org/T98813#1278212 (10Krenair) See also T87036 - although silver runs trusty and has PHP 5.5 rather than 5.3, we should still migrate it to HHVM. [00:53:05] legoktm: I do agree that it’s doing way too much, and needs to be made much better than how it’s now. [00:53:19] (03PS9) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [00:53:35] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (owner: 10Yuvipanda) [00:55:21] 6operations: /tmp full on stat1002 - https://phabricator.wikimedia.org/T98773#1278215 (10Dzahn) a:3ori [00:55:30] 6operations: /tmp full on stat1002 - https://phabricator.wikimedia.org/T98773#1278217 (10Dzahn) 5Open>3Resolved /dev/mapper/tank-tmp 99G 60G 39G 61% /tmp [00:55:32] (03PS10) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [00:55:46] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (owner: 10Yuvipanda) [00:55:54] 6operations, 6Labs, 10wikitech.wikimedia.org: Move wikitech to HHVM - https://phabricator.wikimedia.org/T98813#1278220 (10Krenair) [00:55:56] 6operations, 10Beta-Cluster, 7HHVM: Convert work machines (tin, terbium) to Trusty and hhvm usage - https://phabricator.wikimedia.org/T87036#1278221 (10Krenair) [00:55:59] 6operations, 7HHVM, 7Tracking: Complete the use of HHVM over Zend PHP on the Wikimedia cluster (tracking) - https://phabricator.wikimedia.org/T86081#1278219 (10Krenair) [00:56:15] yuvipanda, take a look at that tracker [00:56:47] we're also missing snapshot hosts and imagescalers [00:56:49] Krenair: I suspect a big part of the problem is that nobody hast the time to do hit :( [00:57:20] springle: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=all&type=detail&servicestatustypes=16&hoststatustypes=3&serviceprops=2097162&nostatusheader [00:57:38] (03CR) 10BryanDavis: "cherry-picked and working as expected in beta cluster. Syslog events are flowing into the deployment-logstash1 host and not going to deplo" [puppet] - 10https://gerrit.wikimedia.org/r/210253 (https://phabricator.wikimedia.org/T98289) (owner: 10BryanDavis) [01:00:24] 6operations: tendril.wikimedia.org attempts to load external resources (fonts from google) - https://phabricator.wikimedia.org/T98710#1278224 (10Dzahn) looks like that Google font URL would just work over https as well https://fonts.googleapis.com/css?family=Droid+Serif:400,700|Droid+Sans:400,700 where is the... [01:03:18] 6operations: tendril.wikimedia.org attempts to load external resources (fonts from google) - https://phabricator.wikimedia.org/T98710#1278229 (10Dzahn) looks like this needs a pull request: 6 [remote "origin"] 7 fetch = +refs/heads/*:refs/remotes/origin/* 8 url = https://github.com/seanpringle/ten... [01:05:08] 6operations, 10Graphoid, 10RESTBase, 10Traffic: Verify Varnish caching of the Graphoid content - https://phabricator.wikimedia.org/T98803#1277962 (10BBlack) [01:05:19] (03PS11) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [01:05:39] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (owner: 10Yuvipanda) [01:06:22] 6operations: tendril.wikimedia.org attempts to load external resources (fonts from google) - https://phabricator.wikimedia.org/T98710#1278253 (10Krenair) p:5Triage>3Lowest [01:07:24] mutante, so... we're cloning a git repository hosted on github directly into prod? [01:07:34] from someone's individual account? [01:07:36] 6operations, 10Graphoid, 10RESTBase, 10Traffic: Verify Varnish caching of the Graphoid content - https://phabricator.wikimedia.org/T98803#1278262 (10GWicke) What @bblack says. Performance on a different domain isn't as good either, so moving to /api/rest_v1/ should be better for perf in any case. [01:07:37] :/ [01:07:54] aha, https://phabricator.wikimedia.org/T98816#1278256 :) [01:07:57] * Krenair will add projects [01:08:04] Krenair: :) [01:08:09] (03PS12) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [01:08:09] 6operations, 10Wikimedia-Git-or-Gerrit: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1278263 (10Krenair) [01:08:26] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (owner: 10Yuvipanda) [01:09:13] Krenair: i kind of want to request repos via phab instead of wiki ... [01:09:35] Didn't we migrate that process? [01:09:39] (03PS13) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [01:09:48] (well, "someone") [01:09:52] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (owner: 10Yuvipanda) [01:10:03] 6operations: tendril.wikimedia.org attempts to load external resources (fonts from google) - https://phabricator.wikimedia.org/T98710#1278266 (10Dzahn) related: T98816 [01:10:17] apparently not: https://www.mediawiki.org/wiki/Git/New_repositories/Requests :/ [01:10:26] Krenair: i wasnt sure if we did [01:11:54] Krenair: what's "Repository-Ownership-Requests" and "Repository Admins" on phab [01:12:12] I think that's for phabricator-hosted repos [01:12:17] rather than gerrit-hosted [01:12:20] https://phabricator.wikimedia.org/tag/repository-admins/ [01:12:39] (03PS14) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [01:12:41] " all of our Git repositories" [01:12:42] right, Differential, Diffusion and Audit [01:12:46] all phabricator applications [01:12:53] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (owner: 10Yuvipanda) [01:13:06] hm,ok [01:13:15] https://phabricator.wikimedia.org/project/profile/1076/ on the other hand... [01:13:16] is for gerrit [01:13:27] but is about granting people access to existing repos [01:13:30] not creating new repos :( [01:13:49] why do i have to tell the form my wiki user name, i'm editing as it :) [01:14:35] Krenair: heh, well if we do "create new repos is wrong on phab but changing permissions on them is right" then that's odd :) [01:21:29] 6operations, 10Wikimedia-Git-or-Gerrit: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1278270 (10Dzahn) repo requested: https://www.mediawiki.org/w/index.php?title=Git%2FNew_repositories%2FRequests%2FEntries&type=revision&diff=1647756&oldid=1645573 [01:21:59] 6operations, 10Wikimedia-Git-or-Gerrit: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1278271 (10Dzahn) [01:22:46] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant ebernhardson shell account access to the elasticsearch cluster - https://phabricator.wikimedia.org/T98766#1278273 (10Dzahn) p:5Triage>3Normal [01:27:18] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1278279 (10Dzahn) the requesting user is member of legal. the teams that could potentially confirm a NDA has been signed seem to be legal or HR. who can we assign it to to confirm? [01:29:05] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1278281 (10Dzahn) a:3Slaporte [01:38:02] (03PS15) 10Yuvipanda: [WIP] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 [01:38:14] (03PS1) 10Krinkle: webperf: Remove JQMigrateUsage deprecate handler [puppet] - 10https://gerrit.wikimedia.org/r/210263 [01:40:21] (03PS16) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [01:49:37] anyone up/around who can help me with some mailman audit stuff? [01:51:15] someone who can su - mailman [01:56:07] mutante: ping [01:56:15] Deskana|Away: ping [02:25:27] PROBLEM - puppet last run on mw2169 is CRITICAL Puppet has 1 failures [02:30:43] !log l10nupdate Synchronized php-1.26wmf4/cache/l10n: (no message) (duration: 06m 33s) [02:30:53] Logged the message, Master [02:35:34] !log LocalisationUpdate completed (1.26wmf4) at 2015-05-12 02:34:30+00:00 [02:35:45] Logged the message, Master [02:36:12] (03PS1) 10Yuvipanda: [WIP] Add lighttpd webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210266 [02:37:48] (03PS2) 10Yuvipanda: [WIP] Add lighttpd webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210266 [02:38:16] PROBLEM - puppet last run on mw1139 is CRITICAL Puppet has 1 failures [02:43:06] RECOVERY - puppet last run on mw2169 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [02:54:16] RECOVERY - puppet last run on mw1139 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [02:57:04] !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 05m 47s) [02:57:12] Logged the message, Master [03:01:25] !log LocalisationUpdate completed (1.26wmf5) at 2015-05-12 03:00:22+00:00 [03:01:29] Logged the message, Master [03:26:21] cajoel: hey [03:26:35] did you get the help you needed? [03:43:18] Ori: still need. [03:43:27] Full list of wmfall [03:43:34] Via email please. [03:43:52] on its way [03:43:58] Mailman web interface does it in short pages. [03:44:35] I think the page size is adjustable in config. But I'm not sure. [03:44:38] Thx [03:51:13] cajoel: sent [04:32:30] ori: thanks [05:01:25] devunt: well, we had many extensions deployed only for English :[ [05:39:25] (03PS1) 10BryanDavis: logstash: Exclude api-feature-usage-sanitized from indexing [puppet] - 10https://gerrit.wikimedia.org/r/210277 (https://phabricator.wikimedia.org/T98750) [05:39:27] (03PS1) 10BryanDavis: logstash: Update syslog processing rules [puppet] - 10https://gerrit.wikimedia.org/r/210278 [05:46:19] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1278472 (10Qgil) >>! In T98722#1277991, @Dzahn wrote: > @qgil Didn't this come up before and we got a kind of blanket statement from legal/HR saying that we can assume anyone who is an employee also s... [06:05:29] !log killed 100+ 3-day unindexed research queries on dbstore1002, all repl streams lagging and /tmp unhappy [06:05:34] Logged the message, Master [06:15:24] !log pt-kill on 3600s running on dbstore1002 until repl streams recover [06:15:29] Logged the message, Master [06:16:57] <_joe_> Nemo_bis: are you going to be in Lyon? [06:17:06] <_joe_> springle: morning [06:17:22] hey _joe_ [06:18:00] <_joe_> springle: whenever I see you murder an analytics query, I feel like I'm back home :P [06:18:25] <_joe_> analytics clogging databases is the golden standard, ain't it? [06:22:20] well, to be fair, they only clog their own databases [06:24:23] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 3 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1278568 (10aaron) >>! In T96360#1276379, @Anomie wrote: >>>! In T96360#1236196, @GWicke wrote: >> It is not c... [06:29:07] PROBLEM - puppet last run on labcontrol2001 is CRITICAL puppet fail [06:30:47] PROBLEM - puppet last run on mw2016 is CRITICAL Puppet has 1 failures [06:31:37] <_joe_> springle: that' [06:31:46] <_joe_> s because you chose to make things boring [06:32:07] PROBLEM - puppet last run on wtp2008 is CRITICAL Puppet has 1 failures [06:32:16] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures [06:32:21] heh [06:33:47] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:33:57] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 2 failures [06:33:57] PROBLEM - puppet last run on mw2003 is CRITICAL Puppet has 1 failures [06:34:07] PROBLEM - puppet last run on mw2036 is CRITICAL Puppet has 1 failures [06:35:37] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:35:38] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:36:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [06:44:27] _joe_: yes I'll be in Lyon [06:46:47] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [06:46:56] RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:57] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on mw2003 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:47:07] RECOVERY - puppet last run on mw2016 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on mw2036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:47] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:47] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:51:10] <_joe_> Nemo_bis: cool! We can finally chat :) [06:51:46] PROBLEM - puppet last run on graphite2001 is CRITICAL Puppet last ran 16 hours ago [06:52:57] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [06:53:18] RECOVERY - puppet last run on graphite2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:32] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 3 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1278589 (10Springle) jftr, S4 had ~2 hours of this again today. Mostly googlebot IPs. [06:54:27] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60622 bytes in 0.066 second response time [06:58:20] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue May 12 06:57:17 UTC 2015 (duration 57m 16s) [06:58:26] Logged the message, Master [07:06:26] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 1 below the confidence bounds [07:17:36] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [07:25:56] _joe_: yes :) [07:28:57] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60622 bytes in 4.310 second response time [07:31:34] (03CR) 10Gilles: [C: 031] webperf: Remove JQMigrateUsage deprecate handler [puppet] - 10https://gerrit.wikimedia.org/r/210263 (owner: 10Krinkle) [07:46:57] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [07:47:15] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1278624 (10Krenair) >>! In T98722#1277985, @Dzahn wrote: >>>! In T98722#1277350, @Krenair wrote: >> #WMF-NDA-Requests is really supposed to be for volunteers to use... > > Does it matter whether peop... [07:53:17] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60600 bytes in 0.470 second response time [07:53:52] 6operations, 6Analytics-Engineering: Honor DNT header for access logs & varnish logs - https://phabricator.wikimedia.org/T98831#1278628 (10Gilles) 3NEW [08:13:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [08:13:56] PROBLEM - gitblit process on antimony is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:13:57] PROBLEM - puppet last run on antimony is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:15:27] RECOVERY - gitblit process on antimony is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar gitblit.jar [08:15:28] RECOVERY - puppet last run on antimony is OK Puppet is currently enabled, last run 8 minutes ago with 0 failures [08:16:37] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [08:19:50] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 3 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1278663 (10aaron) /srv/mediawiki/w/index.php;MediaWiki::run;MediaWiki::main;MediaWiki::performRequest;MediaWi... [08:24:27] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60611 bytes in 0.084 second response time [08:24:47] (03CR) 10Gilles: [C: 031] Bumped the $wgJobBackoffThrottling refreshLinks limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210246 (https://phabricator.wikimedia.org/T98621) (owner: 10Aaron Schulz) [08:31:14] (03CR) 10Gilles: [C: 031] Set $wgActivityUpdatesUseJobQueue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206862 (https://phabricator.wikimedia.org/T91284) (owner: 10Aaron Schulz) [08:50:28] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [09:01:02] (03PS2) 10Filippo Giunchedi: gdash: move graphite eqiad to its own directory [puppet] - 10https://gerrit.wikimedia.org/r/210061 [09:01:10] (03PS2) 10Filippo Giunchedi: gdash: add graphite codfw [puppet] - 10https://gerrit.wikimedia.org/r/210062 [09:08:39] 6operations: Degraded RAID-1 arrays on new logstash hosts: [UU__] - https://phabricator.wikimedia.org/T98620#1278715 (10faidon) p:5Triage>3Unbreak! [09:12:57] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60595 bytes in 0.815 second response time [09:13:38] I can't believe I forgot to update that. [09:13:39] ugh [09:14:08] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: move graphite eqiad to its own directory [puppet] - 10https://gerrit.wikimedia.org/r/210061 (owner: 10Filippo Giunchedi) [09:14:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: add graphite codfw [puppet] - 10https://gerrit.wikimedia.org/r/210062 (owner: 10Filippo Giunchedi) [09:26:47] 6operations, 10Graphoid, 10RESTBase, 10Traffic: Verify Varnish caching of the Graphoid content - https://phabricator.wikimedia.org/T98803#1278739 (10Yurik) * I don't think graphoid sets any caching headers - probably should change that. * Current hash is based on graph definition only, but I hope T98837 wi... [09:30:16] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 675 [09:35:16] RECOVERY - check_mysql on db1008 is OK: Uptime: 2234975 Threads: 1 Questions: 6906250 Slow queries: 14754 Opens: 37971 Flush tables: 2 Open tables: 64 Queries per second avg: 3.090 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:35:20] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1278777 (10Nemo_bis) Well, for sure job runners are working harder now: {F163880} Queues on most wikis approach 0 or are in the thousands. s1... [09:52:37] PROBLEM - puppet last run on mw1149 is CRITICAL Puppet has 1 failures [09:53:53] !log restarted gitblit on antimony [09:53:59] Logged the message, Master [09:56:04] 6operations, 10Traffic, 7discovery-system: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1278859 (10fgiunchedi) an alternative approach for pybal would be to do the same thing as varnish: generate config files with confd and wait for p... [10:08:56] RECOVERY - puppet last run on mw1149 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:11:53] (03PS1) 10Filippo Giunchedi: gdash: graphite metricsDropped needs derivative [puppet] - 10https://gerrit.wikimedia.org/r/210294 [10:12:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: graphite metricsDropped needs derivative [puppet] - 10https://gerrit.wikimedia.org/r/210294 (owner: 10Filippo Giunchedi) [10:35:21] (03PS1) 10Filippo Giunchedi: statsite: calculate 75th percentile [puppet] - 10https://gerrit.wikimedia.org/r/210298 (https://phabricator.wikimedia.org/T88662) [10:36:36] (03PS2) 10Filippo Giunchedi: statsite: calculate 75th percentile [puppet] - 10https://gerrit.wikimedia.org/r/210298 (https://phabricator.wikimedia.org/T88662) [10:36:45] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] statsite: calculate 75th percentile [puppet] - 10https://gerrit.wikimedia.org/r/210298 (https://phabricator.wikimedia.org/T88662) (owner: 10Filippo Giunchedi) [10:37:06] PROBLEM - DPKG on mw1010 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:38:38] expecting a 'too many metrics created' alarm from graphite after https://gerrit.wikimedia.org/r/210298 [10:41:02] 6operations, 7Graphite, 5Patch-For-Review: revisit what percentiles are calculated by statsite - https://phabricator.wikimedia.org/T88662#1278957 (10fgiunchedi) 5stalled>3Resolved a:3fgiunchedi 75th percentile is back, metrics should be created shortly, resolving [10:43:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:43:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:43:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:43:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:43:46] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:43:57] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:44:07] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:44:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:44:17] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:44:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:44:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:44:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [10:48:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [10:48:07] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [10:48:16] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [10:48:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [10:48:36] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [10:48:47] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [10:48:57] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [10:49:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [10:49:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [10:49:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [10:49:26] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [10:49:37] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [10:56:28] RECOVERY - DPKG on mw1010 is OK: All packages OK [11:06:18] !log restarted apache on iridium to clear php opecode cache [11:06:24] Logged the message, Master [11:12:02] (03CR) 10Anomie: [C: 031] logstash: Exclude api-feature-usage-sanitized from indexing [puppet] - 10https://gerrit.wikimedia.org/r/210277 (https://phabricator.wikimedia.org/T98750) (owner: 10BryanDavis) [11:38:14] (03PS1) 10KartikMistry: Beta: CX: Enable en-es dictionary [puppet] - 10https://gerrit.wikimedia.org/r/210308 [11:47:06] (03PS1) 10KartikMistry: Beta: CX: Fix config syntax [puppet] - 10https://gerrit.wikimedia.org/r/210310 [11:47:14] akosiaris: Can break CX :) ^^ [11:52:42] springle: what's a reasonable way to check whether a maintenance script caused significant load for s1 DB? [11:53:07] I'm checking the querypages updates running at 1 UTC every 28th of the month with basic graphs like https://ganglia.wikimedia.org/latest/?r=custom&cs=01%2F28%2F2015+00%3A00&ce=01%2F28%2F2015+06%3A00&m=cpu_report&tab=ch&vn=&hide-hf=false&hreg%5B%5D=db10%2852%7C51%7C55%7C57%7C65%7C66%7C72%7C73%29 [11:53:31] I can't see absolutely anything going on there, which seems slightly suspicious. [11:57:41] (03PS1) 10ArielGlenn: add jcrespo to icinga access list, phab T98775 [puppet] - 10https://gerrit.wikimedia.org/r/210312 [11:58:27] Nemo_bis: that's unlikely to help you. Which script and which queries? Perhaps we can narrow it down [11:59:01] Nemo_bis: misc::maintenance::updatequerypages ? [11:59:44] (03CR) 10ArielGlenn: [C: 032] add jcrespo to icinga access list, phab T98775 [puppet] - 10https://gerrit.wikimedia.org/r/210312 (owner: 10ArielGlenn) [12:00:00] springle: yes, specifically updatequerypages::enwiki::cronjob() [12:00:01] most of that load, or rather the initial heavy SELECT traffic, would run on the vslow slave, which is db1051 for S1 [12:00:34] Yep [12:00:54] It's included in my graph, isn't it [12:01:14] also, the simple answer is: yes, most of those jobs cause sugnificant load, which is why we have vslow slaves [12:01:40] kart_: good to deploy the last two reviews? [12:02:12] what's rhodium's status, I am guessing it's not a happy third puppetmaster yet? [12:02:39] springle: true :) how to assess whether it's fine to run some queries every X weeks rather than every 6 months? [12:04:37] Nemo_bis: test them on a depooled box? (or open a ticket to do that ;) [12:07:23] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: CX: Fix config syntax [puppet] - 10https://gerrit.wikimedia.org/r/210310 (owner: 10KartikMistry) [12:09:11] nevermind akosiaris got there first :) [12:10:09] apergos: exactly [12:10:10] akosiaris: thanks [12:10:17] apergos: why do you ask though ? [12:10:28] I thought I have added comments in icinga [12:10:30] oh I just realized that puppet merge hadn't been updated, so that was a clue [12:10:43] no other reason [12:10:48] (03PS2) 10KartikMistry: Beta: CX: Enable en-es dictionary [puppet] - 10https://gerrit.wikimedia.org/r/210308 [12:13:24] hmm git review changed on jessie [12:13:36] for some reason on 2 multiple commits, it's displaying to me like 20 ... [12:15:09] akosiaris: Sometimes 'git fetch gerrit' helps [12:15:10] with that [12:15:57] RoanKattouw: the weird thing is, it did the correct thing at the end, that is submit just 2 patches.. [12:16:47] Nemo_bis: the only enwiki updatequerypages query still in logs is Mostlinkedtemplates, from Apr 28. that took a few hours [12:17:06] which I think would be fine to do more frequently. we should still check the others first [12:17:11] akosiaris: yeah, that means it doesn't know gerrit knows about the intermediate patches [12:17:18] akosiaris: git fetch gerrit or git fetch --all fixes that [12:17:23] akosiaris: alias review='git fetch gerrit; git review;' [12:17:50] * valhallasw infinitely recurses kart_ [12:18:11] I am wondering why git fetch origin did not fix it though [12:18:31] hmm depends on the remote I guess.. interesting [12:18:32] springle: could maybe make them monthly and then look at their effect all at once? [12:18:38] akosiaris: also git-buildpackage changed :) [12:19:49] kart_: yup [12:21:49] Nemo_bis: updatequerypages and updatequerypages::enwiki jobs are designed not to overlap. are any of ::enwiki jobs *more* useful or interesting, so we could move in small steps? [12:23:19] springle: perhaps, but hard to tell which; I could check usage statistics, if they still work [12:25:35] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [12:26:34] ^ that's me [12:28:25] Nemo_bis: is there a ticket? [12:28:37] (03PS1) 10ArielGlenn: add jcrespo to pager contact groups [puppet] - 10https://gerrit.wikimedia.org/r/210332 [12:28:58] springle: I'll file one now that you gave me some hints on what it should contain :) [12:29:05] thanks [12:29:27] we shoud just test each job, and arrange the cron timing to make better use of time [12:30:02] (03CR) 10ArielGlenn: [C: 032] add jcrespo to pager contact groups [puppet] - 10https://gerrit.wikimedia.org/r/210332 (owner: 10ArielGlenn) [12:30:19] so long as jobs don't risk overlap, vslow slaves might as well work harder [12:34:37] (03PS1) 10KartikMistry: New upstream release, fixed encoding issue [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/210333 [12:39:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [12:43:07] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1279181 (10ArielGlenn) added to pager contacts etc but with no phone number nor sms gateway info, cet (hope he is). added to mw channel but not the other as I don't have permissio... [12:47:09] it seems that en wp is still at 28 million jobs eh? [12:48:17] maybe aaron's last changeset will have an impact https://gerrit.wikimedia.org/r/#/c/210246/3/wmf-config/CommonSettings.php [12:48:40] +50 % speed should help, yes :) [12:49:22] I wonder when swat is today [12:50:15] 8am window, that's in a little more than an hour [12:51:58] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [12:54:44] (03PS1) 10KartikMistry: Beta: CX: Fix some MT pairs [puppet] - 10https://gerrit.wikimedia.org/r/210338 [12:56:43] godog: yes, you can merge Beta patches for CX [12:57:20] 7Puppet, 6operations, 10Beta-Cluster, 5Patch-For-Review: Trebuchet on deployment-bastion: wrong group owner - https://phabricator.wikimedia.org/T97775#1279199 (10ArielGlenn) If we can get this done in a day or two, then yes, let's go ahead and have the deployment user and group be per repo. Otherwise I'll... [12:58:37] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1279204 (10jcrespo) a:5Dzahn>3jcrespo I will add contact info myself once I learn how. Channel is pending cloak assignment. [12:59:16] 6operations, 10Graphoid, 10RESTBase, 10Traffic: Verify Varnish caching of the Graphoid content - https://phabricator.wikimedia.org/T98803#1279206 (10BBlack) * Generally, the application's cache headers are what defines the objects' cache TTLs in varnish, unless we add custom rules for them (which we'd rath... [12:59:20] 6operations: stray ganglia-graph files left in /tmp - https://phabricator.wikimedia.org/T97637#1279207 (10ArielGlenn) [12:59:41] (03PS3) 10Filippo Giunchedi: Beta: CX: Enable en-es dictionary [puppet] - 10https://gerrit.wikimedia.org/r/210308 (owner: 10KartikMistry) [12:59:54] (03PS2) 10Filippo Giunchedi: Beta: CX: Fix some MT pairs [puppet] - 10https://gerrit.wikimedia.org/r/210338 (owner: 10KartikMistry) [13:00:02] (03CR) 10Filippo Giunchedi: [C: 032] Beta: CX: Fix some MT pairs [puppet] - 10https://gerrit.wikimedia.org/r/210338 (owner: 10KartikMistry) [13:00:11] (03CR) 10Filippo Giunchedi: [V: 032] Beta: CX: Fix some MT pairs [puppet] - 10https://gerrit.wikimedia.org/r/210338 (owner: 10KartikMistry) [13:00:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Beta: CX: Enable en-es dictionary [puppet] - 10https://gerrit.wikimedia.org/r/210308 (owner: 10KartikMistry) [13:00:32] sigh, some more spam incoming [13:00:39] (03PS4) 10Filippo Giunchedi: Beta: CX: Enable en-es dictionary [puppet] - 10https://gerrit.wikimedia.org/r/210308 (owner: 10KartikMistry) [13:00:47] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Beta: CX: Enable en-es dictionary [puppet] - 10https://gerrit.wikimedia.org/r/210308 (owner: 10KartikMistry) [13:02:14] kart_: merged [13:04:31] godog: cool. Thanks! [13:05:14] np [13:05:49] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [13:07:13] is the private puppet available to all wmf ops? [13:10:46] 6operations: stray ganglia-graph files left in /tmp - https://phabricator.wikimedia.org/T97637#1279220 (10ArielGlenn) seeing these generated today on uranium, much smaller but they should still get cleaned up. cron job? -rw------- 1 www-data www-data 127406 May 12 07:58 ganglia-graph.dYEfC3 -rw------- 1 www-d... [13:11:31] jynus: it is yeah [13:11:52] godog, palladium? [13:14:14] I am just trying to add my phone, etc. to icinga, and I can access the final node, but not the private puppet (or I am trying to access the wrong server) [13:23:25] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#1279253 (10BBlack) >>! In T49832#1274748, @Tony_Tan_98 wrote: > My concern is that vague comments will delay us a lot longer.... [13:24:38] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [13:34:12] 6operations: stray ganglia-graph files left in /tmp - https://phabricator.wikimedia.org/T97637#1279270 (10akosiaris) I 've been noticing those too. PHP is no longer complaining about memory limits and the rate of those files has gone down for sure, so the 2 changes above obviously made things better but still...... [13:37:23] 6operations, 6Services: apparmor for citoid - https://phabricator.wikimedia.org/T98851#1279290 (10akosiaris) 3NEW a:3akosiaris [13:38:04] 6operations, 6Services: apparmor for zotero - https://phabricator.wikimedia.org/T98852#1279297 (10akosiaris) 3NEW a:3akosiaris [13:39:18] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [13:43:55] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1279345 (10ArielGlenn) Any conclusions to draw from this discussion? Do we need to explicitly invite certain people to weigh in? [13:44:43] 6operations, 10Wikimedia-Git-or-Gerrit: move tendril to gerrit repo and puppetize cloning - https://phabricator.wikimedia.org/T98816#1279359 (10ArielGlenn) p:5Triage>3Normal [13:46:38] PROBLEM - High load average on labstore1001 is CRITICAL 66.67% of data above the critical threshold [24.0] [13:46:41] 6operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T98726#1279362 (10ArielGlenn) p:5Triage>3Normal a:3Papaul [13:47:44] 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1279368 (10ArielGlenn) p:5Triage>3Normal [13:49:05] 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1248778 (10ArielGlenn) @Yurik, have you worked with anyone on ops in the past on the maps project? If so maybe we can get them to we... [13:50:21] 6operations, 10Wikimedia-Site-requests: refreshLinks.php --dfn-only cron jobs do not seem to be running - https://phabricator.wikimedia.org/T97926#1279373 (10ArielGlenn) p:5Triage>3Normal [13:51:01] 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1279375 (10Yurik) @ArielGlenn, thanks, we have discussed this issue with @mark & @akosiaris last week, so hopefully it will get resol... [13:51:06] (03PS1) 10Dereckson: Namespaces configuration on or.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210350 (https://phabricator.wikimedia.org/T98584) [13:52:20] (03CR) 10Dereckson: [C: 04-1] "Spaces should be underscore (spaces works, but we use as a convention underscores)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210350 (https://phabricator.wikimedia.org/T98584) (owner: 10Dereckson) [13:54:18] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1279389 (10ArielGlenn) ah well we will re-add you to the mw channel once the cloak is there then :-) [13:55:12] bblack, around? [13:55:21] !log temporarily blocked an IP on uranium firewall. It was the cause of requests causing CPU load. http://ganglia.wikimedia.org/latest/graph.php?r=day&z=xlarge&h=uranium.wikimedia.org&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Miscellaneous+eqiad [13:55:27] Logged the message, Master [13:56:40] yurik: the part of my brain that's awake is kind of around [13:56:51] the rest still needs more coffee [13:56:56] bblack, i take it its a small part, eh? :) [13:56:56] * apergos looks around for SWATters [13:57:28] bblack, i am thinking of how to get the caching in place before the graphoid is overrun with traffic - even if it is just for 1 minute :) [13:58:16] bblack, also, i suspect that external data change is ok for graphoid, people will be mostly concerned if editing the article itself is not updated right away [13:58:42] (03PS2) 10Dereckson: Namespaces configuration on or.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210350 (https://phabricator.wikimedia.org/T98584) [13:59:28] (03CR) 10Dereckson: "PS2: spaces to underscores in namespaces (config convention)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210350 (https://phabricator.wikimedia.org/T98584) (owner: 10Dereckson) [14:01:08] is it being overrun with traffic that it can't handle? or is it expected to be? [14:01:09] the continuous integration weekly meeting is starting now in #wikimedia-office . Short agenda is https://www.mediawiki.org/wiki/Continuous_integration/Meetings/2015-05-12 [14:01:19] bblack, i expect it t obe [14:01:51] bblack, basically, if someone right now adds a graph to barack obama, i suspect it will die [14:02:08] the server, not the president :) [14:02:25] do the articles containing graphs end up directly linking URLs like your ticket example: https://graphoid.wikimedia.org/www.mediawiki.org/v1/png/Extension%3AGraph%2FDemo/1647673/3cbe2b968108670c001e230dca4682a9d03f8814.png ? [14:02:31] yes [14:03:24] (also, I wonder why the hostname is inside the URL, too. couldn't the hash just uniquely identify a global graph in a $domain-agnostic way?) [14:04:13] bblack, no, because graphoid needs to get graph definition, so it needs to use api of the proper host to get it with the page title or revId (1647673) [14:04:27] there is no 3rd party storage of the hash -> definition [14:04:34] I see, I guess I don't really understand the whole model here [14:04:51] bblack, https://www.mediawiki.org/wiki/Extension:Graph#Graphoid_Service [14:05:20] bblack, that schematic is slightly outdated in the sense that now ALL browsers are functioning like the old browsers [14:07:00] 6operations: SSL cert for svn.wikimedia.org has expired, should move behind misc-web - https://phabricator.wikimedia.org/T98723#1279441 (10ArielGlenn) Looks like the majority of hits are scripts or bots, so it's not a giant rush at any rate [14:07:03] I was about to say: shouldn't almost all browsers be locally rendering and not using us to generate PNGs? (which takes us out of the picture here completely other than serving the graph extension's javascript)? [14:07:12] 6operations: SSL cert for svn.wikimedia.org has expired, should move behind misc-web - https://phabricator.wikimedia.org/T98723#1279442 (10ArielGlenn) p:5Triage>3Normal [14:08:35] what's the reason they now all use PNGs instead? [14:09:35] bblack, i just switched it to the graphoid service because it was terribly slow - all browsers had to download 184KB vega + 148KB d3 + all external data - and they could only download external data after all javascript is downloaded and graph definition was processed [14:10:28] oh, I didn't realize the JS rendering involved a bunch of dependencies. I would've thought it would've used something native, e.g. SVG-ish. [14:10:41] this compared with one simple PNG request for 25 KB that started as soon as the first HTML was loaded, etc [14:10:59] bblack, i can output things in SVG too, but it would be no different from PNG - will also be done by graphoid [14:12:09] well, I meant more like, when rendering the article HTML, (definition) becomes SVG embedded in the page, rather than wgGraphSpecs= [14:12:30] that would seem to be the most efficient route for the user and our end of things in the common case. [14:12:40] 10Ops-Access-Requests, 6operations: Grant access to stat1002 and stat1003 - https://phabricator.wikimedia.org/T98536#1279458 (10ArielGlenn) [14:12:41] 10Ops-Access-Requests, 6operations, 10Analytics: Access to stat1003 for jdouglas - https://phabricator.wikimedia.org/T98209#1279457 (10ArielGlenn) [14:13:28] (graphoid code would run as part of rendering pipeline then, but cache with the pages the graphs are embedded in, and no external deps for JS to work) [14:15:20] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant ebernhardson shell account access to the elasticsearch cluster - https://phabricator.wikimedia.org/T98766#1279478 (10ArielGlenn) @manybubbles, if you're the delegate then do the deed, I'd say. [14:16:09] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [14:17:39] 6operations, 6Services: apparmor for citoid - https://phabricator.wikimedia.org/T98851#1279495 (10faidon) Many of the interesting bits of AppArmor have not been merged upstream (e.g. network confinement) and newer versions of AppArmor don't even make it upstream anymore. Thus, with the switch to Debian, we kin... [14:19:07] (03PS1) 10Dereckson: Add *.sl.nsw.gov.au to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210356 (https://phabricator.wikimedia.org/T98744) [14:19:30] 6operations, 10ops-eqiad, 6Labs: Can labvirt* boxes take more RAM? - https://phabricator.wikimedia.org/T98658#1279518 (10Cmjohnson) No, all the slots are full. [14:20:53] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant ebernhardson shell account access to the elasticsearch cluster - https://phabricator.wikimedia.org/T98766#1279522 (10Manybubbles) >>! In T98766#1279478, @ArielGlenn wrote: > @manybubbles, if you're the delegate then do the deed, I'd say. Then I ap... [14:23:44] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant yurik access to sca1001 cluster for graphoid debugging/restarts - https://phabricator.wikimedia.org/T98371#1279536 (10ArielGlenn) No meeting on Monday so it waits another week. Can we get manager sign-off in the meantime? That would have been tfin... [14:24:01] 6operations, 10ops-eqiad, 6Labs: Can labvirt* boxes take more RAM? - https://phabricator.wikimedia.org/T98658#1279538 (10Andrew) 5Open>3declined ok, thanks [14:38:48] 6operations, 7Wikimedia-log-errors: internal_api_error_Exception: [22e05a83] Exception Caught: wfDiff(): popen() failed errors on English Wikipedia - https://phabricator.wikimedia.org/T97145#1279606 (10ArielGlenn) Gone from the exception logs so closing. [14:38:58] 6operations, 7Wikimedia-log-errors: internal_api_error_Exception: [22e05a83] Exception Caught: wfDiff(): popen() failed errors on English Wikipedia - https://phabricator.wikimedia.org/T97145#1279607 (10ArielGlenn) 5Open>3Resolved [14:41:26] 6operations, 10Citoid, 6Services: Separate service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#1279611 (10ArielGlenn) Er, which service is that? Citoid? [14:44:26] bblack, i created https://phabricator.wikimedia.org/T98872 - with this approach we could eventually switch to a on-the-fly hash generation, and possibly transclude SVG result into the HTML output (or else include it externally). In any case, this is for a fairly distant future - what headers should i output for now? [14:46:00] the standard HTTP headers for this stuff, which varnish understands. Cache-control and such: look to mediawiki's app-layer output for examples I guess. [14:49:06] and then I guess you can evolve the ticket into "Hey we're sending appropriate standard cache-control headers to varnish and it's totally ignoring them because it's set to force-pass all graphoid requests. Please let it cache them" :) [14:52:41] bblack, sounds like an awesome plan :) [14:54:15] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:54:35] yurik: for ref: https://www.varnish-software.com/static/book/HTTP.html#cache-related-headers [15:00:05] manybubbles, anomie, ^d, thcipriani, marktraceur, Dereckson: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150512T1500). [15:01:14] I can SWAT, this all seems fairly straightforward today [15:01:22] Dereckson: ping for swat? [15:01:42] Good morning. [15:01:59] okie doke, going. [15:03:54] (03CR) 10Thcipriani: [C: 032] Enable NewUserMessage on bh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209146 (https://phabricator.wikimedia.org/T97920) (owner: 10Dereckson) [15:05:11] (03Merged) 10jenkins-bot: Enable NewUserMessage on bh.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/209146 (https://phabricator.wikimedia.org/T97920) (owner: 10Dereckson) [15:07:33] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT enable NewUserMessage on bh.wikipedia [[gerrit:209146]] (duration: 00m 13s) [15:07:38] Logged the message, Master [15:07:51] Testing. [15:10:46] Works. [15:10:57] !log mediawiki-phpunit-hhvm Jenkins job is broken due to an hhvm upgrade {{bug|T98876}} [15:11:00] Logged the message, Master [15:11:29] Dereckson: cool, 210350 next [15:11:44] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [15:12:16] .. If I would quit getting signed out of gerrit :( [15:12:33] (03CR) 10Thcipriani: [C: 032] Namespaces configuration on or.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210350 (https://phabricator.wikimedia.org/T98584) (owner: 10Dereckson) [15:12:41] (03Merged) 10jenkins-bot: Namespaces configuration on or.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210350 (https://phabricator.wikimedia.org/T98584) (owner: 10Dereckson) [15:15:04] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT Namespaces configuration on or.wiktionary [[gerrit:210350]] (duration: 00m 12s) [15:15:08] Logged the message, Master [15:16:20] Seems to work. [15:17:10] kk, last one [15:18:14] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [15:18:34] (03CR) 10Thcipriani: [C: 032] Add *.sl.nsw.gov.au to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210356 (https://phabricator.wikimedia.org/T98744) (owner: 10Dereckson) [15:18:40] (03Merged) 10jenkins-bot: Add *.sl.nsw.gov.au to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210356 (https://phabricator.wikimedia.org/T98744) (owner: 10Dereckson) [15:20:33] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT Add *.sl.nsw.gov.au to wgCopyUploadsDomains [[gerrit:210356]] (duration: 00m 11s) [15:20:36] Logged the message, Master [15:20:48] Works. [15:21:19] Dereckson: Nice. ty! [15:21:24] Thanks for the deploy. [15:21:49] you are welcome. That seems to conclude this morning's SWAT. [15:22:42] bblack, i will set it to 30 seconds for now https://gerrit.wikimedia.org/r/#/c/210370/ [15:25:43] PROBLEM - Apache HTTP on mw1061 is CRITICAL - Socket timeout after 10 seconds [15:26:23] PROBLEM - HHVM rendering on mw1061 is CRITICAL - Socket timeout after 10 seconds [15:32:14] PROBLEM - HHVM busy threads on mw1061 is CRITICAL 33.33% of data above the critical threshold [86.4] [15:32:39] that.. sounds bad? [15:34:27] that's the standard thread lockup [15:34:41] _joe_: don't restart it yet, i want to take a look [15:34:43] (morning, btw) [15:34:57] heh, standard [15:35:09] greg-g, when updating a graphoid service, do i need to do it in a depl window? [15:35:15] <_joe_> ori: hi! [15:35:23] MatmaRex: standard as in https://phabricator.wikimedia.org/T89912 [15:35:29] <_joe_> looks like the usual deadlock [15:35:32] yeah [15:35:49] yurik: yeah, see the Services window on Mon and Wed: https://wikitech.wikimedia.org/wiki/Deployments#Monday.2C.C2.A0May.C2.A011 [15:35:51] _joe_: i found and i wanted to try the approach [15:36:43] <_joe_> ori: btw, hhvm 3.6 fails the core unit tests apparently [15:36:52] <_joe_> https://phabricator.wikimedia.org/T98876#1279707 [15:36:57] i'll take a look at that next :) [15:36:58] we have a system to prevent actual user requests from being sent to misbehaving hosts, right? [15:37:04] greg-g, gotcha, thx, i will insert myself after this SWAT then, want to set proper cache control :) [15:37:13] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [15:37:53] Krenair: yes, pybal [15:42:14] PROBLEM - HHVM queue size on mw1061 is CRITICAL 37.50% of data above the critical threshold [80.0] [15:42:37] again, please leave mw1061 be for another few mins [15:46:07] (03PS2) 10Andrew Bogott: Several improvements to the cold-migrate script. [puppet] - 10https://gerrit.wikimedia.org/r/209844 [15:47:31] :( who killed betalabs :( [15:49:20] yurik: there's a bug for it see: https://logstash-beta.wmflabs.org/#/dashboard/temp/WvMNokcORLilNqGJVAeHUw [15:50:13] er..https://phabricator.wikimedia.org/T98884 [15:50:16] FeaturedFeeds are now prominently featured [15:50:18] yurik: https://wikitech.wikimedia.org/wiki/Labs_labs_labs [15:50:43] greg-g, i think you sent me that one already :-P i like my betalabs, thank you very much [15:50:48] gilles, ^ [15:51:29] https://gerrit.wikimedia.org/r/#/c/207729/ [15:51:47] * gilles looks [15:51:57] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 3 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1279956 (10GWicke) > It might make sense to revert those 1 of those patches, since it still hits getMetaTree... [15:51:58] yurik: please change how you refer to it, thanks [15:52:07] since the Labs team is actually working on creating a beta environment, "betalabs" is becoming a less and less useful label for "beta cluster" [15:52:25] bd808, they are??? bummer. I will call it betabeta [15:52:28] using $cache for two different things gilles? :p [15:52:59] hah, I'll fix it [15:53:14] I merely badly reviewed it, but I think AaronSchulz isn't around yet [15:53:27] gilles: thanks [15:53:34] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 8 seconds ago with 0 failures [15:54:36] (03CR) 10Jdlrobson: "Since Jared has left someone should feel free to override his +1 and merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 (https://phabricator.wikimedia.org/T73477) (owner: 10Glaisher) [15:55:22] (03CR) 10Greg Grossmeier: "Who's going to PM it? We have to have a PM own all Beta Features." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 (https://phabricator.wikimedia.org/T73477) (owner: 10Glaisher) [15:56:00] (03CR) 10Glaisher: "I don't think it's actually ready to be enabled in production. There are some major issues with mw ui atm." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175406 (https://phabricator.wikimedia.org/T73477) (owner: 10Glaisher) [15:56:21] greg-g: I nominate James_F|Away, who clearly doesn't have enough to do already. :P [15:56:52] (03CR) 10Andrew Bogott: [C: 032] Several improvements to the cold-migrate script. [puppet] - 10https://gerrit.wikimedia.org/r/209844 (owner: 10Andrew Bogott) [15:57:07] marktraceur: not my call :) [15:58:53] https://gerrit.wikimedia.org/r/210385 is the fix if someone feels like reviewing it [16:04:04] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:36] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 229.99 ms [16:08:16] I hadn't noticed the one yesterday, otherwise I would have paid a bit more attention to variable names [16:08:40] the main problem I had was that this extension is such a pain to set up, I couldn't make it work on my vm [16:08:57] lesson learned... never give up on setting the thing up on vm [16:10:35] grrrit-wm: Nominate me for what? [16:10:49] Bleh. [16:10:56] marktraceur: ^^^ [16:11:57] James_F: PMing the form refresh BetaFeature [16:12:00] Just kidding :) [16:12:09] marktraceur: I don't touch MW UI stuff. [16:12:31] marktraceur: And we're replacing it all with OOjs UI anyway… [16:12:50] (By "we" I mostly mean "you and MatmaRex and other lovely people, and I take the blame". ;-)) [16:12:58] oh, speaking of which, James_F [16:13:09] James_F: https://phabricator.wikimedia.org/T74715#1279614 [16:13:15] Ta. [16:13:22] (speaking of replacing, and hopefully not taking the blame ;) ) [16:14:19] MatmaRex: Is the narrower width because we're not setting it in VForm, or something else? [16:14:36] MatmaRex: (Did the form not specify a width before, and just rely on the components growing not too much?) [16:15:39] James_F: VForms were supposed to be this narrow. [16:15:57] apparently. [16:16:01] MatmaRex: So should the OOjs UI-ification explicitly set a width so the max-width components don't grow? [16:16:17] it does, but it limits to a saner value [16:16:21] Hmm. [16:16:23] 50em or 60em or something [16:16:36] "Saner" == design change. :-P [16:16:57] there's literally no part of that design that has not changed at least twice since VForm was implemented [16:17:26] anyway, that's just one CSS rule to change if we decide to go with the stupid narrow forms [16:17:27] (This is totally the wrong channel for this conversation; sorry, opsen.) [16:17:36] Let's do least-change-at-once. [16:17:46] And then we can do the make-them-less-narrow later, as its own change. [16:18:05] greg-g, i posted about graphoid service depl for this hour [16:18:18] should be quick ... hopefully ) [16:18:39] greg-g, any ongoing deployments not on the depl page? [16:18:45] (03CR) 10Jdouglas: [C: 031] admin: ebernhardson for elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/210250 (https://phabricator.wikimedia.org/T98766) (owner: 10Dzahn) [16:19:25] !log restarted HHVM on mw1061; T89912 [16:19:29] Logged the message, Master [16:19:32] 6operations, 10ops-eqiad, 5Patch-For-Review: humidity sensors in eqiad row c/d showing alarms - https://phabricator.wikimedia.org/T98721#1280103 (10RobH) The issue is odd since they are humidity low alarms, which usually are when heating is run without adding moisture to the air. The fact its hot there, and... [16:20:04] RECOVERY - HHVM rendering on mw1061 is OK: HTTP OK: HTTP/1.1 200 OK - 66677 bytes in 0.175 second response time [16:20:11] <_joe_> ori: oh, same old same old, apparently [16:20:30] <_joe_> thanks for looking into that [16:21:46] RECOVERY - Apache HTTP on mw1061 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.061 second response time [16:22:36] RECOVERY - RAID on ms-be2007 is OK optimal, 13 logical, 13 physical [16:23:42] 6operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T98726#1280104 (10Papaul) a:5Papaul>3fgiunchedi Drive replacement complete [16:24:00] !log graphoid service synced, now supports Cache Control headers [16:24:04] Logged the message, Master [16:25:05] (03CR) 10Manybubbles: [C: 031] admin: ebernhardson for elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/210250 (https://phabricator.wikimedia.org/T98766) (owner: 10Dzahn) [16:25:06] RECOVERY - HHVM queue size on mw1061 is OK Less than 30.00% above the threshold [10.0] [16:25:15] RECOVERY - HHVM busy threads on mw1061 is OK Less than 30.00% above the threshold [57.6] [16:25:26] PROBLEM - High load average on labstore1001 is CRITICAL 75.00% of data above the critical threshold [24.0] [16:27:12] 6operations, 10Graphoid, 10RESTBase, 10Traffic: Varnish does not honor Cache-Control for Graphoid - https://phabricator.wikimedia.org/T98803#1280127 (10Yurik) [16:27:20] bblack, ^ [16:27:30] graphoid has been updated [16:28:52] ok, thanks! [16:37:56] RECOVERY - puppet last run on ms-be2007 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:41:09] (03CR) 10Daniel Kinzler: [C: 031] Add wb_changes_subscription table to xml dumps [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/210072 (https://phabricator.wikimedia.org/T98742) (owner: 10Aude) [16:42:08] 6operations, 10ops-codfw: ms-be2007.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T98726#1280182 (10fgiunchedi) 5Open>3Resolved disk is back in service, thanks @papaul ! [16:42:42] fgiunchedi: yw [16:43:05] <_joe_> papaul: fgiunchedi is godog here :) [16:43:16] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [16:43:19] * godog actually highlights on both [16:43:32] joe: thanks [16:43:48] 10Ops-Access-Requests, 6operations, 10Graphoid, 5Patch-For-Review: Grant yurik access to sca1001 cluster for graphoid debugging/restarts - https://phabricator.wikimedia.org/T98371#1280184 (10Yurik) [16:46:35] 6operations, 10ops-codfw: document network switch stack cables in use - https://phabricator.wikimedia.org/T98344#1280196 (10RobH) 5Open>3Resolved I've updated our planning sheet for the row expansion with this info, thanks @papaul! [16:46:55] Robh: yw [16:57:20] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 3 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1280210 (10aaron) I was looking at 54816e2071fc3e38ad581a264967bd46cbb3647e. And the problem sean mentioned i... [16:59:27] (03PS1) 10Hashar: contint: disable unattended upgrade [puppet] - 10https://gerrit.wikimedia.org/r/210391 (https://phabricator.wikimedia.org/T98876) [17:00:25] (03CR) 10jenkins-bot: [V: 04-1] contint: disable unattended upgrade [puppet] - 10https://gerrit.wikimedia.org/r/210391 (https://phabricator.wikimedia.org/T98876) (owner: 10Hashar) [17:00:36] (03CR) 10Hashar: [C: 04-1] "Transient workaround to prevent HHVM from upgrading on CI slaves. No need to commit it." [puppet] - 10https://gerrit.wikimedia.org/r/210391 (https://phabricator.wikimedia.org/T98876) (owner: 10Hashar) [17:01:55] (03PS2) 10Hashar: contint: disable unattended upgrade [puppet] - 10https://gerrit.wikimedia.org/r/210391 (https://phabricator.wikimedia.org/T98876) [17:04:32] (03CR) 10Hashar: "Cherry picked on integration puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/210391 (https://phabricator.wikimedia.org/T98876) (owner: 10Hashar) [17:04:39] 10Ops-Access-Requests, 6operations, 10Graphoid, 5Patch-For-Review: Grant yurik access to sca1001 cluster for graphoid debugging/restarts - https://phabricator.wikimedia.org/T98371#1280222 (10Manybubbles) I agree Yurik should have this access. [17:06:34] (03PS1) 10Ori.livneh: statsv: migrate to text varnishes; nest under /beacon/. [puppet] - 10https://gerrit.wikimedia.org/r/210392 [17:06:38] ^ bblack [17:09:52] ori: needs mobile changes like text as well? [17:10:33] I don't know that mobile actually uses it of course, but trying to keep them from going further out of sync, expecting eventual code (at least) merger at some point [17:11:10] (Apps don't use statsv) [17:11:22] I guess you mean the varnishes [17:11:23] yeah, i guess mobile texts [17:11:35] right mobile as in the varnishes for en.m.wikipedia.org [17:11:40] yeah [17:12:08] i'll amend it [17:12:11] and yes when I said code (at least) merger, I meant on the varnish end. I know I can eventually get text+mobile running nearly-identical VCL. At some point they may merge clusters as well. [17:12:33] (03PS1) 10Ori.livneh: Set $wgWMEStatsdBaseUri to host-relative beacon/ path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210394 [17:13:06] we should file a task for consolidating all beacon/ loggers [17:13:53] (03PS2) 10Ori.livneh: statsv: migrate to text varnishes; nest under /beacon/. [puppet] - 10https://gerrit.wikimedia.org/r/210392 [17:15:31] bblack: ^ [17:15:36] (03CR) 10BBlack: [C: 031] statsv: migrate to text varnishes; nest under /beacon/. [puppet] - 10https://gerrit.wikimedia.org/r/210392 (owner: 10Ori.livneh) [17:15:55] (03CR) 10Ori.livneh: [C: 032] statsv: migrate to text varnishes; nest under /beacon/. [puppet] - 10https://gerrit.wikimedia.org/r/210392 (owner: 10Ori.livneh) [17:16:35] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [17:17:15] bblack: cool. the mobile app devs merged yuvipanda's patches so we can consider that done as well. (though with the caveat that it will take some time for users to upgrade and for traffic to bits to drop off) [17:17:47] They haven't made a release yet iirc [17:18:04] But yeah. A matter of time now [17:18:37] PROBLEM - Disk space on ms-be2007 is CRITICAL: DISK CRITICAL - free space: / 1027 MB (1% inode=96%) [17:18:47] right, but the social component of the change (communicating the migration and ensuring it doesn't leave anyone hanging) is at least done [17:20:11] bblack: for geoiplookup, we set a geoip cookie on text already for ipv4 clients, and dual-stack clients are instructed to hit the ipv4-only hostname to try and force them to use ipv4 [17:20:22] so is anything else needed, really? [17:21:20] we're good on geoiplookup I think, with the /geoiplookup URL working everywhere and the hostname moved over to text-lb in DNS [17:21:46] i wonder why /geoiplookup is needed at all [17:21:50] at least, good on the varnish end. I don't know if there's hardcoded refs to bits.wm.o/geoiplookup elsewhere [17:21:56] I have no idea :) [17:22:03] but code seems to be actively using both [17:22:38] https://github.com/search?q=%40wikimedia+bits.wikimedia.org&type=Code&utf8=%E2%9C%93 [17:22:47] all references to bits in wikimedia repos [17:23:19] ah, quoting the URL makes it not tokenize individual domain components [17:23:20] https://github.com/search?utf8=%E2%9C%93&q=%40wikimedia+%22bits.wikimedia.org%22&type=Code&ref=searchresults [17:25:07] so yeah, there will be a long tail to look at I'm sure :) [17:25:32] it's not too bad [17:26:33] (03PS1) 10Giuseppe Lavagetto: monitoring: add proper way to check systemd units [puppet] - 10https://gerrit.wikimedia.org/r/210396 [17:26:50] <_joe_> bblack: ^^ please take a look and tell me if it seems sane [17:27:10] <_joe_> as you're the major user of systemd in our group as of now [17:28:41] (03CR) 10Jforrester: "Caused T98196 maybe?" [puppet] - 10https://gerrit.wikimedia.org/r/210392 (owner: 10Ori.livneh) [17:29:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1280306 (10jcrespo) [17:30:24] James_F: no, fixed by that [17:30:27] 6operations, 10ops-eqiad, 5Patch-For-Review: humidity sensors in eqiad row c/d showing alarms - https://phabricator.wikimedia.org/T98721#1280312 (10fgiunchedi) another data point, from https://librenms.wikimedia.org/health/metric=humidity/ the rightmost column shows different range limits for recently discov... [17:30:40] (03CR) 10BBlack: [C: 031] "Not completely insane" [puppet] - 10https://gerrit.wikimedia.org/r/210396 (owner: 10Giuseppe Lavagetto) [17:32:36] ori: Thanks! [17:34:17] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:37:25] (03CR) 10Ori.livneh: [C: 032] Set $wgWMEStatsdBaseUri to host-relative beacon/ path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210394 (owner: 10Ori.livneh) [17:37:31] (03Merged) 10jenkins-bot: Set $wgWMEStatsdBaseUri to host-relative beacon/ path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210394 (owner: 10Ori.livneh) [17:38:25] !log ori Synchronized wmf-config/CommonSettings.php: Ie4641b6e4: Set $wgWMEStatsdBaseUri to host-relative beacon/ path (duration: 00m 12s) [17:38:30] Logged the message, Master [17:39:51] (03CR) 10Merlijn van Deen: [C: 04-1] Initial commit (0315 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [17:41:20] (03PS1) 10Ori.livneh: Simplify statsv varnishkafka RxURL regex [puppet] - 10https://gerrit.wikimedia.org/r/210398 [17:41:25] RECOVERY - Disk space on ms-be2007 is OK: DISK OK [17:42:43] bblack: ^ . statsv is only used from the web, so the config change means everything has been migrated [17:44:24] i'm going to merge it since it's trivial [17:44:40] (03CR) 10Ori.livneh: [C: 032] Simplify statsv varnishkafka RxURL regex [puppet] - 10https://gerrit.wikimedia.org/r/210398 (owner: 10Ori.livneh) [17:58:47] 6operations, 10Citoid, 6Services: Separate service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#1280379 (10Mvolz) @arielglenn Yup, citoid. [18:00:04] twentyafterfour, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150512T1800). [18:03:50] ori: there's no issues with cached older content holding JS that hits the old statsv URL? [18:04:18] bblack: no, it's in startup.js which has a 5min cache expiry [18:04:36] ok cool [18:05:17] (03CR) 10Merlijn van Deen: "- scripts: maybe use entry points instead?" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [18:06:35] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1280398 (10Stu) Bump on this. I'm on a hangout with @wwes right now trying to show him some stuff but don't have access. :-( [18:11:25] valhallasw: am still on phone - forgot to bring key for drawer laptop is locked in [18:11:28] So heading back [18:11:41] yuvipanda: :D crowbar! [18:11:55] valhallasw: but os.getlogin is insecure - just reports value of env variable USER [18:12:10] robh: regarding https://phabricator.wikimedia.org/T98740… do you know if labnet1001 is a weird box or if we have spares that resemble it? It has 10g ethernet I believe, that might count as weird. [18:12:28] yuvipanda: it's run as the current user to begin with [18:12:30] and exec was explicit because there is no point in having the parent process stick around. And it gives us env inheritance for free [18:12:58] hrm. [18:13:20] andrewbogott: it has 10G? [18:13:31] …I think? [18:13:33] valhallasw: yes but I would rather not write code that looks insecure at first glance [18:13:42] yuvipanda: wat. [18:13:47] robh: how would I check? [18:14:00] ahh, it sure does [18:14:13] lshw will tell you everything, the key is knowing the class names to filter ;D [18:14:18] sudo lshw -class network [18:14:33] So, yea... I need to find out where we allocated this, cuz a 610 with 10g isnt common [18:14:39] valhallasw: os.getlogin() allows the calling user to pretend to be whatever user they want [18:15:01] yuvipanda: I can change what the imports refer to to begin with! [18:15:08] robh: dang [18:15:22] andrewbogott: looks like we simply ordered a new 10g card for it [18:15:26] https://phabricator.wikimedia.org/T83539 [18:15:33] yuvipanda: and then the next thing is that it *doesn't* use $USER but calls getlogin(3) [18:15:37] so, we should be able to do the same for another spare [18:15:53] valhallasw: https://docs.python.org/2/library/os.html#os.getlogin [18:15:58] robh: that doesn’t sound so bad [18:16:03] oh, but getlogin(3) also mentions 'Unfortunately, it is often rather easy to fool getlogin().' [18:16:08] yuvipanda: in any case, it doesn't matter [18:16:13] yuvipanda: the code runs as the user [18:16:13] It basically says dont use it [18:16:18] yuvipanda: so the user has full control [18:16:35] Yes but the current code is secure and correct. Why change? [18:16:47] how is it secure? [18:16:51] Secure-er [18:16:53] Than getlogin [18:16:59] there's no such thing as 'secure-er' [18:17:08] I can add my own pwd.py to the path [18:17:33] andrewbogott: im looking now for the spare so we can get ball rolling on order nic [18:17:38] on nic order even [18:17:48] thanks [18:18:01] (03PS1) 1020after4: Group1 wikis to 1.26wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210400 [18:18:31] valhallasw: the docs for getlogin themselves suggest you shouldn't use it in most cases [18:18:31] (03CR) 1020after4: [C: 032] Group1 wikis to 1.26wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210400 (owner: 1020after4) [18:18:36] (03Merged) 10jenkins-bot: Group1 wikis to 1.26wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210400 (owner: 1020after4) [18:18:42] valhallasw: also it won't work when you have no tty [18:19:00] yuvipanda: then use getpass.getuser(). [18:19:00] andrewbogott: ahh, i found the batch we took labnet1001 from [18:19:07] so i can allocate an identical one =] [18:19:21] cool [18:19:37] valhallasw: what exactly is your problem with the current code? [18:19:38] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: Group1 wikis to 1.26wmf5 [18:19:45] Logged the message, Master [18:19:46] oh, wait, no, maybe not, but still [18:19:48] close [18:19:54] That it is importing pwd just for that call? [18:20:17] yuvipanda: that it's hard to parse for humans [18:20:56] valhallasw: its just what is suggested in the getlogin docs [18:21:11] I'll just add a comment around it. [18:21:32] yuvipanda: why not use getpass.getuser()? that seems to be what ze interwebz suggests, and it's much more readable [18:21:41] The rest of the comments I'll fix once I get my stupid key [18:23:21] valhallasw: that also used env variables [18:23:29] yuvipanda: na und? [18:23:42] Seriously - the current code is what is suggested in the official docs of os.getlogin [18:23:42] yuvipanda: as I just explained, I control PYTHONPATH, so I control whatever you do. [18:23:50] I suggest you take it up to them [18:24:13] ... [18:24:38] Environ also doesn't take into account effective userid [18:24:55] 6operations, 10ops-eqiad, 5Patch-For-Review: humidity sensors in eqiad row c/d showing alarms - https://phabricator.wikimedia.org/T98721#1280437 (10Cmjohnson) That is probably the cause of the alarm we have low threshold set to high. Row's C and D were the latest rows and I must have entered the wrong low se... [18:25:12] yuvipanda: which we need why, exactly? [18:25:40] Because it is the right thing to do? [18:26:26] If I'm running a process as root and seteuid to a tool and call this it should work as expected [18:27:34] also, you're not using geteuid :P [18:27:36] And not think its username is root [18:27:41] your argument is valid, though. [18:27:57] :) [18:28:31] I was just using this because it was in the os.getlogin docs tho :) [18:29:17] 6operations, 6Labs, 10hardware-requests: labnet1002 - https://phabricator.wikimedia.org/T98740#1280447 (10RobH) So we had to order a 10G card for labnet1001, we'll need to do the same for whatever system is allocated for labnet1002. labnet1001 has the following: 2x Intel Xeon(R) CPU X5650 @ 2.67GHz (6 core... [18:29:40] valhallasw: ideally webservicemonitor would seteuid instead of sudo tho. [18:29:40] andrewbogott: ^ so i dont wanna make that decision for you, but we have a few choices for you to peruse =] [18:30:07] 6operations, 6Labs, 10hardware-requests: labnet1002 - https://phabricator.wikimedia.org/T98740#1280451 (10RobH) p:5Triage>3Normal [18:30:42] yuvipanda: huh? [18:30:49] yuvipanda: also, sudo doesn't just set euid [18:31:11] valhallasw: true. Hence 'ideally' :) [18:31:40] yuvipanda: I'd just use sudo. Less opportunity for security booboos [18:33:12] valhallasw: hmm probably. Me and Coren had a reason for not wanting to use sudo but I forgot [18:33:14] yuvipanda: anyway. At least abstract it into a function. [18:33:36] and use geteuid if you want to use the euid :P [18:34:00] valhallasw: yeah I'm not sure about what to do about the common code between webservice-new and runner [18:34:13] yuvipanda: webservice -type blah -run [18:34:17] And its -new so I can test it concurrently with current webservice [18:34:23] ok [18:34:34] yuvipanda: * webservice run --type blah [18:34:45] or, you know, some other name, maybe more internal :-p [18:34:48] valhallasw: hmm let's see. [18:35:02] Yeah the more internal bit is what I'm not sure how to do [18:35:26] valhallasw: dammit I missed my stop :p [18:35:30] Brb [18:43:17] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1280465 (10Stu) @damons can you help unstick this blocker for Wes and me? [18:52:28] ok really. a page sent 18.36 and I get it now? really not ok [18:53:10] and it seems to be invalid anyways, grrrr [18:58:59] 6operations, 6Labs, 10hardware-requests: labnet1002 - https://phabricator.wikimedia.org/T98740#1280508 (10Andrew) No need for SSDs; lets go with the 410. Thanks. [19:12:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1280546 (10Dzahn) [19:13:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: onboarding Jaime Crespo in ops - https://phabricator.wikimedia.org/T98775#1280550 (10Dzahn) 5Open>3Resolved Since apergos and Jaime did the Icinga contacts and IRC things, we can call this resolved. Thanks!! [19:21:56] PROBLEM - puppet last run on mw1102 is CRITICAL Puppet has 1 failures [19:22:22] PROBLEM - puppet last run on mw1101 is CRITICAL Puppet has 1 failures [19:22:23] PROBLEM - puppet last run on mw2077 is CRITICAL Puppet has 1 failures [19:23:40] (03PS4) 10Andrew Bogott: Add a couple of settings to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/205979 [19:24:02] (03CR) 10Andrew Bogott: [C: 032] Add a couple of settings to the [libvirt] section. [puppet] - 10https://gerrit.wikimedia.org/r/205979 (owner: 10Andrew Bogott) [19:29:11] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1280595 (10Tbayer) [19:29:26] PROBLEM - puppet last run on mw1173 is CRITICAL Puppet has 1 failures [19:32:40] 10Ops-Access-Requests, 6operations, 10Graphoid, 5Patch-For-Review: Grant yurik access to sca1001 cluster for graphoid debugging/restarts - https://phabricator.wikimedia.org/T98371#1280606 (10Wwes) I approve as manager [19:36:16] PROBLEM - DPKG on mw1041 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:36:26] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [19:36:44] (03CR) 10BryanDavis: "Tested via cherry-pick in beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/210277 (https://phabricator.wikimedia.org/T98750) (owner: 10BryanDavis) [19:37:04] (03CR) 10BryanDavis: "Tested via cherry-pick in beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/210278 (owner: 10BryanDavis) [19:39:26] RECOVERY - puppet last run on mw1102 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:39:35] RECOVERY - DPKG on mw1041 is OK: All packages OK [19:39:46] RECOVERY - puppet last run on mw1101 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:39:56] RECOVERY - puppet last run on mw2077 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:42:33] 10Ops-Access-Requests, 6operations: Grant access to stat1002 and stat1003 - https://phabricator.wikimedia.org/T98536#1280637 (10Milimetric) The merged-in task says access is required for testing. I just wanted to point out that testing of Event Logging should happen in beta labs. Instructions for testing can... [19:44:16] PROBLEM - DPKG on mw1042 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:45:26] RECOVERY - puppet last run on mw1173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:46:03] (03PS1) 10Dzahn: admin: jdouglas for eventlogging-admins [puppet] - 10https://gerrit.wikimedia.org/r/210413 (https://phabricator.wikimedia.org/T98536) [19:47:26] RECOVERY - DPKG on mw1042 is OK: All packages OK [19:48:43] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant access to stat1002 and stat1003 - https://phabricator.wikimedia.org/T98536#1280654 (10Dzahn) What @milimetric said! (thanks). Ignore the patch i uploaded if testing can be done in BetaLabs. If there are other reasons to need it we can go ahead wit... [19:50:16] !log Upgrading several app servers to new version of HHVM, expect transient 'DPKG CRITICAL' alerts [19:50:21] Logged the message, Master [19:53:16] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [19:53:36] PROBLEM - DPKG on mw1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:56:17] PROBLEM - DPKG on mw1043 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:59:36] RECOVERY - DPKG on mw1043 is OK: All packages OK [20:01:27] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [20:02:46] PROBLEM - puppet last run on mw1078 is CRITICAL Puppet has 1 failures [20:02:46] PROBLEM - puppet last run on db1049 is CRITICAL Puppet has 2 failures [20:02:57] PROBLEM - puppet last run on mw1083 is CRITICAL Puppet has 2 failures [20:03:06] PROBLEM - puppet last run on mw1095 is CRITICAL Puppet has 3 failures [20:03:06] PROBLEM - puppet last run on mw1085 is CRITICAL Puppet has 1 failures [20:03:15] RECOVERY - DPKG on mw1001 is OK: All packages OK [20:03:16] PROBLEM - puppet last run on mw1103 is CRITICAL Puppet has 1 failures [20:03:26] PROBLEM - puppet last run on mw1102 is CRITICAL Puppet has 1 failures [20:03:41] looks for those puppet fails [20:03:47] PROBLEM - puppet last run on mw1101 is CRITICAL Puppet has 2 failures [20:03:49] and expects to see none [20:04:51] mutante: it's possibly related to the upgrade, if apt-get update can't get a lock on the apt cache dir [20:04:53] transient if so [20:05:37] ori: ah, right, that would apply for the DPKG checks, *nod*, but the puppet fails too? [20:05:55] mutante: we run 'apt-get update' as part of the puppet run, so possibly [20:06:15] ok, yea, and confirmed those are ok [20:06:17] Notice: /Stage[main]/Hhvm::Debug/Exec[/usr/local/sbin/install-pkg-src hhvm]/returns: executed successfully [20:06:20] Notice: Finished catalog run in 110.60 seconds [20:06:23] [mw1043:~] $ [20:06:26] for example [20:07:28] nod [20:09:05] PROBLEM - DPKG on mw1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:10:36] RECOVERY - DPKG on mw1003 is OK: All packages OK [20:13:00] (03CR) 10Milimetric: [C: 04-1] "Nobody besides analytics should have eventlogging-admins. This is not needed and probably dangerous, unless approved by someone on the An" [puppet] - 10https://gerrit.wikimedia.org/r/210413 (https://phabricator.wikimedia.org/T98536) (owner: 10Dzahn) [20:13:06] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:14:29] (03CR) 10Dzahn: "fair! and thanks, less access is good" [puppet] - 10https://gerrit.wikimedia.org/r/210413 (https://phabricator.wikimedia.org/T98536) (owner: 10Dzahn) [20:14:34] (03Abandoned) 10Dzahn: admin: jdouglas for eventlogging-admins [puppet] - 10https://gerrit.wikimedia.org/r/210413 (https://phabricator.wikimedia.org/T98536) (owner: 10Dzahn) [20:15:27] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Grant access to stat1002 and stat1003 - https://phabricator.wikimedia.org/T98536#1280735 (10Milimetric) Definitely nobody should have eventlogging-admins outside of the analytics team. To look at data on stats1003, people just need access to this file:... [20:17:36] RECOVERY - puppet last run on mw1103 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:17:46] RECOVERY - puppet last run on mw1102 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:18:35] RECOVERY - puppet last run on mw1078 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:36] RECOVERY - puppet last run on db1049 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:56] RECOVERY - puppet last run on mw1083 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [20:19:06] RECOVERY - puppet last run on mw1095 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:06] RECOVERY - puppet last run on mw1085 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:19:46] RECOVERY - puppet last run on mw1101 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:26:55] PROBLEM - DPKG on mw1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:30:05] RECOVERY - DPKG on mw1002 is OK: All packages OK [20:32:39] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1280788 (10EoRdE6) Job queue seems to have begun to drop slowly, still near 29 million jobs though. [20:33:05] PROBLEM - DPKG on mw1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:34:36] RECOVERY - DPKG on mw1004 is OK: All packages OK [20:38:30] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1280807 (10Krenair) "mwscript showJobs.php enwiki --group" shows it as still going up [20:48:17] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 3 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1280883 (10aaron) Ignore the last 3 gerrit comments. [20:56:24] !log upgrading salt packages on tin [20:56:30] Logged the message, Master [21:00:04] rmoen, kaldari: Dear anthropoid, the time has come. Please deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150512T2100). [21:06:12] !log manually installed trigger-trebuchet update on tin after accidental salt upgrade there woops :-D [21:06:18] Logged the message, Master [21:06:33] :p [21:07:12] s kids, this package is not yet in the repo, as the code for it is not yet +2ed, it's the quick manual fix [21:07:27] * apergos goes to do a test testrepo check just to make shore [21:10:15] PROBLEM - DPKG on mw1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:11:46] RECOVERY - DPKG on mw1005 is OK: All packages OK [21:14:36] PROBLEM - DPKG on mw1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:14:50] rmoen: still waiting on Jenkins :P [21:15:24] rmoen: should have it up on test.wiki shortly [21:16:01] just merged :) [21:16:11] kaldari, sigh... jenkins is so slow. [21:16:16] RECOVERY - DPKG on mw1006 is OK: All packages OK [21:16:42] last week, i started verifying submodule updates to save time. swat would have been impossible otherwise. [21:16:56] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [21:17:51] (03CR) 10BryanDavis: Cleanup base::remote-syslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/210253 (https://phabricator.wikimedia.org/T98289) (owner: 10BryanDavis) [21:18:05] (03PS3) 10Merlijn van Deen: Tools: Let bigbrother ignore empty lines and comments [puppet] - 10https://gerrit.wikimedia.org/r/202363 (https://phabricator.wikimedia.org/T94990) (owner: 10Tim Landscheidt) [21:18:49] bd808: bah, by ‘comment’ I mean put a comment in the code :D [21:19:22] valhallasw: let me merge. can you test? [21:19:34] yuvipanda: eeeeh [21:19:35] maybe [21:19:42] yuvipanda: sheesh. will do ;) [21:19:47] valhallasw: :P it’s ok to say no :) [21:19:56] yuvipanda: I'm not sure how :P [21:20:02] bigbrother runs... somewhere [21:20:21] valhallasw: it’s mostly: 1. run puppet on tools-submit, 2. verify that bigbrother is still up [21:20:25] valhallasw: it runs on tools-submit [21:20:54] rmoen: it's up on test.wiki now. Looks like it needs a scap though. [21:21:05] kaldari, yeah so that was going to be my question.. [21:21:05] PROBLEM - DPKG on mw1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:21:28] rmoen: I'll start the scap now, but you can go ahead and test [21:21:31] kaldari, does the language cache get auto rebuilt on deployment train ? [21:21:54] rmoen: I believe so [21:21:59] (03PS4) 10BryanDavis: Cleanup base::remote-syslog [puppet] - 10https://gerrit.wikimedia.org/r/210253 (https://phabricator.wikimedia.org/T98289) [21:22:26] (03PS5) 10Yuvipanda: Cleanup base::remote-syslog [puppet] - 10https://gerrit.wikimedia.org/r/210253 (https://phabricator.wikimedia.org/T98289) (owner: 10BryanDavis) [21:22:29] bd808: <3 sorry about the pedantness :) [21:22:36] RECOVERY - DPKG on mw1007 is OK: All packages OK [21:22:51] yuvipanda: no worries. srsbizness [21:23:22] (03CR) 10Yuvipanda: [C: 032] Cleanup base::remote-syslog [puppet] - 10https://gerrit.wikimedia.org/r/210253 (https://phabricator.wikimedia.org/T98289) (owner: 10BryanDavis) [21:24:05] yuvipanda: re https://gerrit.wikimedia.org/r/#/c/194095/, I'm not sure why it shouldn't be in the role instead? I don't really get the argument [21:24:17] yuvipanda: it's something that the role requires, so the role should provision it, right? [21:24:32] !log kaldari Synchronized php-1.26wmf5/extensions/Gather: Updating Gather for 1.26wmf5 (duration: 00m 12s) [21:24:35] this hardcodes it for certain instances, so it wouldn't apply to tools-redis for instance [21:24:39] Logged the message, Master [21:24:44] ty kaldari [21:25:16] !log kaldari Started scap: updating i18n for Gather (1.26wmf5) [21:25:21] Logged the message, Master [21:26:37] bd808: yup, a nop :) [21:26:50] Info: Applying configuration version '1431465837' [21:26:50] Notice: Finished catalog run in 42.66 seconds [21:26:55] * bd808 wipes sweat from brow [21:27:11] and less crappy $::realm switches [21:28:18] bd808: ah [21:28:19] ESC[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Must pass central_host to Class[Base::Remote_syslog] at /etc/puppet/modules/base/manifests/init.pp:62 on node i-00000bd8.eqiad.wmflabsESC[0m [21:28:22] bd808: on toollabs [21:28:36] * yuvipanda looks at cause [21:28:37] oh... [21:28:52] is there another hiera place to add things for toollabs? [21:29:00] a wiki page [21:29:01] bd808: ah, defaulting central_host to undef should work, I guess [21:29:21] bd808: you aren’t assigning central_host to anything for labs outside of beta [21:29:24] oh! right. empty string or some such [21:29:28] yeah [21:29:37] explicit undef should work better, I think [21:29:54] I guess I could set an undef in the class for that [21:30:07] yeah [21:30:22] then I should toss an error if enable=true and still undef [21:30:29] want me to make a followup? [21:31:05] PROBLEM - DPKG on mw1009 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:31:21] bd808: sure! [21:31:25] bd808: I can too if you’re busy with other stuff [21:31:59] (03CR) 10Dereckson: "If you don't plan to perform other operations if wmgVectorUseSimpleSearch is true, you can directly add in InitialiseSettings.php wgVecto" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) (owner: 10Nemo bis) [21:32:23] yuvipanda: on it [21:32:29] bd808: thanks! [21:32:36] RECOVERY - DPKG on mw1009 is OK: All packages OK [21:34:36] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:34:42] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1281013 (10Dzahn) >>! In T98722#1278624, @Krenair wrote: >>>! In T98722#1277985, @Dzahn wrote: >>>>! In T98722#1277350, @Krenair wrote: >>> #WMF-NDA-Requests is really supposed to be for volunteers to... [21:35:01] (03PS1) 10BryanDavis: base::remote_syslog: handle $enable=>false more gracefully [puppet] - 10https://gerrit.wikimedia.org/r/210614 [21:35:50] yuvipanda: ^ error message bikshedding allowed [21:36:14] bd808: nah, lgtm [21:36:22] (03CR) 10Yuvipanda: [C: 032] base::remote_syslog: handle $enable=>false more gracefully [puppet] - 10https://gerrit.wikimedia.org/r/210614 (owner: 10BryanDavis) [21:37:37] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1281024 (10Slaporte) @Dzahn, I confirmed that @Zhouz is an employee and under NDA. [21:38:25] PROBLEM - DPKG on mw1011 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:38:59] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1281026 (10Dzahn) >>! In T98722#1278472, @Qgil wrote:> >Someone with permissions could update the description of #WMF-NDA and add @ZhouZ. @qgil agree, looking at history you have added over 100 peop... [21:40:55] 6operations, 10Citoid, 6Services: Separate service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#1281036 (10GWicke) @mvolz, I see deployment-zotero01 in deployment-prep, which has the puppet 'citoid' role applied. Is that what you were looking for? [21:41:45] RECOVERY - DPKG on mw1011 is OK: All packages OK [21:42:44] bd808: hmm, maybe not. so after a puppet run, I still see [21:42:46] root@tools-exec-1211:/home/yuvipanda# cat /etc/rsyslog.d/30-remote-syslog.conf [21:42:49] *.info;mail.none;authpriv.none;cron.none @deployment-bastion.eqiad.wmflabs [21:42:54] * yuvipanda looks at puppet [21:43:11] yuvipanda: hmm... [21:44:11] ::rsyslog should clean that out -- https://github.com/wikimedia/operations-puppet/blob/production/modules/rsyslog/manifests/init.pp#L17-L19 [21:44:42] unless the host you picked doesn't apply rsyslog? [21:45:08] yuvipanda: It could be that -- https://github.com/wikimedia/operations-puppet/blob/production/modules/rsyslog/manifests/conf.pp#L35 [21:45:29] bd808: yeah, that sounds like it [21:45:29] without the ::rsyslog::conf then ::rsyslog might not be applied at all [21:45:30] err [21:45:33] looks like it [21:45:38] hmm, should it be? [21:45:46] well, I guess it should - for it to be truly a noop [21:45:48] seems likely [21:46:01] well also so that there is a syslog server [21:46:06] PROBLEM - DPKG on mw1012 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:46:17] heh, right [21:46:54] so we could include ::rsyslog in ::base::remote_syslog [21:47:10] even if forwarding is disabled [21:47:21] hmm [21:47:22] or we could include in ::base [21:47:25] maybe we should just include it in base [21:47:32] and not include base::remote_syslog [21:47:44] hmm, that’ll confuse prod [21:47:59] yeah. that's not the way [21:48:03] yeah [21:48:06] I think ::rsyslog in ::base [21:48:10] yup, +1 [21:48:33] !log kaldari Finished scap: updating i18n for Gather (1.26wmf5) (duration: 23m 17s) [21:48:33] bd808 patch incoming [21:48:38] Logged the message, Master [21:49:04] (03PS1) 10Yuvipanda: base: Include ::rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/210615 [21:49:06] bd808: ^ [21:49:16] RECOVERY - DPKG on mw1012 is OK: All packages OK [21:49:16] rmoen: hmm, scap finished, but I'm still seeing a broken message on test.wiki [21:49:29] gather-create-collection-button-label [21:49:40] kaldari: loaded via javascript? [21:49:57] (03CR) 10BryanDavis: [C: 031] base: Include ::rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/210615 (owner: 10Yuvipanda) [21:50:42] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1281048 (10Dzahn) >>! In T98722#1281024, @Slaporte wrote: > @Dzahn, I confirmed that @Zhouz is an employee and under NDA. He will need to access WMF-NDA as part of his work. Ok, thanks! Added to the... [21:50:49] (03CR) 10Yuvipanda: [C: 032] base: Include ::rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/210615 (owner: 10Yuvipanda) [21:51:07] 6operations, 6WMF-NDA-Requests: ZhouZ needs access to WMF-NDA group - https://phabricator.wikimedia.org/T98722#1281053 (10Dzahn) 5Open>3Resolved [21:51:08] bd808: probably [21:51:24] I see it in Special::AllMessages [21:51:35] scap doesn't purge RL caches [21:51:36] PROBLEM - DPKG on mw1013 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:53:37] yuvipanda: the next question to ask yourself is where those labs host really should be sending their syslogs [21:53:53] bd808: in general nowhere, I guess. [21:53:57] unless explicitly configured [21:54:17] does that just effect VMs in labs? [21:54:46] RECOVERY - DPKG on mw1013 is OK: All packages OK [21:54:48] bd808: right now? yes [21:55:10] bd808: When do RL caches expire? or do they need to be purged explicitly? [21:55:13] bd808: it’s kind of status quo anyway - anything outside of beta has so far not been sending their logs to anywhere where they were being received [21:56:32] kaldari: the nightly l10nupdate job purges the RL caches. I never have internalized the rules for when they expire normally [21:56:43] bd808: so basically, we’re back to status quo now but just… cleaner [21:56:58] now I’m tempted to setup logstash for toollabs right away, but must…do…other…things…first [21:56:59] bd808: that's fine, as long as it's purged by tomorrow [21:57:23] It should be purged in ~5 hours or so by l10nupdate [21:57:51] yuvipanda: patience grasshopper [21:58:06] yeah, I do keep telling myself that :) [21:58:46] PROBLEM - DPKG on mw1014 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:59:30] 6operations, 10Architecture, 10MediaWiki-RfCs, 10RESTBase, and 4 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#1281085 (10GWicke) @arielglenn, we [discussed this last week at the IRC meeting](http://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015... [22:00:35] RECOVERY - DPKG on mw1014 is OK: All packages OK [22:02:36] Hi! Can anyone give me some advice about getting this core change merged and deployed? https://gerrit.wikimedia.org/r/#/c/202925/ [22:03:21] It changes a method signature that's overridden in several extensions [22:03:33] The change is needed for some new features in CentralNotice [22:05:07] PROBLEM - DPKG on mw1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:05:08] legoktm: greg-g: ^ ? (sorry not sure whom to ping) [22:05:38] RL changes... RoanKattouw ? ^ [22:05:47] AndyRussG: a better commit message explaining the use-case and impact would go a long way [22:06:23] "This could be used by a CentralNotice feature that determines dependencies dynamically sever-side." is quite vague. You link to another patchset, but do the work for your reviewers and distill it for them into the commit message of this specific change that you're asking them to review. [22:06:49] RECOVERY - DPKG on mw1015 is OK: All packages OK [22:06:50] ori: sure...! an explanation is in my comment from Apr. 8, but I can certainly add more info to the commit message [22:06:54] for example: ' It changes a method signature that's overridden in several extensions' -- which extensions? how are you proposing to avoid breakage? [22:07:16] ori: all that is also in the Gerrit discussion, there are patches ready to go for all the extensions involved [22:07:29] I'll add those to the commit message too [22:07:33] cool [22:07:37] :) [22:08:13] Yeah, WRT RL itself, it's actually not that involved... I'm more concerned with the deploy side of things, coordinating it all, etc [22:12:38] 6operations, 10ops-eqiad, 5Patch-For-Review: humidity sensors in eqiad row c/d showing alarms - https://phabricator.wikimedia.org/T98721#1281106 (10Cmjohnson) Received this update Please find the Sites notes on the Ticket….The standard temperature service level agreements (SLA) are 64.4-80.6. The humidity... [22:13:50] 6operations: upgrade salt on production cluster - https://phabricator.wikimedia.org/T98580#1281107 (10ArielGlenn) so salt on tin was accidentally upgraded late tonight. I have a package built for trebuchet-trigger that I have installed manually on tin so that deployments aren't broken (see /home/ariel for the d... [22:19:12] (03CR) 10Andrew Bogott: "@Giuseppe, did you say on IRC that you have a working version of this someplace?" [puppet] - 10https://gerrit.wikimedia.org/r/202924 (owner: 10Andrew Bogott) [22:30:21] 6operations: Need WMF-NDA group access for Zhou Zhou (Legal Counsel, WMF) - https://phabricator.wikimedia.org/T98787#1281140 (10Dzahn) 5Open>3Resolved a:3Dzahn Hi ZhouZ, this looks like a duplicate of T98722. You have been added to the WMF-NDA group there already. Closing this as resolved. [22:55:07] (03PS1) 10Dzahn: hack needed for ru.wp being https-only [debs/wikistats] - 10https://gerrit.wikimedia.org/r/210617 (https://phabricator.wikimedia.org/T97476) [22:55:45] (03PS2) 10Dzahn: hack needed for ru.wp being https-only [debs/wikistats] - 10https://gerrit.wikimedia.org/r/210617 (https://phabricator.wikimedia.org/T97476) [22:56:20] (03CR) 10Dzahn: [C: 032] hack needed for ru.wp being https-only [debs/wikistats] - 10https://gerrit.wikimedia.org/r/210617 (https://phabricator.wikimedia.org/T97476) (owner: 10Dzahn) [22:56:28] (03CR) 10Dzahn: [V: 032] hack needed for ru.wp being https-only [debs/wikistats] - 10https://gerrit.wikimedia.org/r/210617 (https://phabricator.wikimedia.org/T97476) (owner: 10Dzahn) [23:00:08] RoanKattouw, ^d, kaldari, AaronSchulz, ^d, rmoen: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150512T2300). [23:00:20] I guess it's me since I have a patch :) [23:00:49] rmoen, AaronSchulz, kaldari: Pingity [23:01:03] howdy [23:01:05] tacotuesday, hi [23:01:17] tacotuesday: You doing the SWAT? [23:01:17] You guys are first, easy config [23:01:18] Yeah I shall [23:01:25] I have a patch myself today [23:01:25] thanks [23:02:06] (03CR) 10Chad: [C: 032] Updating trademark symbols in mobile per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210254 (owner: 10Kaldari) [23:02:08] (03CR) 10Chad: [C: 032] Enable Gather on WikiVoyage and Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [23:02:14] (03Merged) 10jenkins-bot: Updating trademark symbols in mobile per Legal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210254 (owner: 10Kaldari) [23:02:17] (03Merged) 10jenkins-bot: Enable Gather on WikiVoyage and Hebrew wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208615 (https://phabricator.wikimedia.org/T97488) (owner: 10Jdlrobson) [23:03:11] !log demon Synchronized wmf-config/InitialiseSettings.php: swat (duration: 00m 12s) [23:03:17] Logged the message, Master [23:03:20] kaldari, rmoen: That's both of you, plz verify ^ [23:03:27] tacotuesday, db updates [23:03:29] checking... [23:03:55] on enwikivoyage and hewiki [23:04:23] tacotuesday: looks great [23:04:28] Is there some doc on how the extensions repo is supposed to be used? [23:04:48] No [23:04:59] Negative24: are you new around here? :P [23:04:59] rmoen: gather_list_item.sql and gather_list.sql? [23:05:15] tacotuesday, aye the sql files [23:05:38] (03PS1) 10Thcipriani: Add script from jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 [23:05:41] legoktm: relatively yes but the question isn't as obvious as it seems. :) I'm looking for something specific... [23:06:08] Negative24: I'm joking :P it's a reference to: are there docs for this anywhere? <@^demon> Hahahahaha. You must be new here. [23:06:10] rmoen: All done, both wikis [23:06:27] (03CR) 10jenkins-bot: [V: 04-1] Add script from jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 (owner: 10Thcipriani) [23:06:38] (03CR) 10Chad: [C: 032] Bumped the $wgJobBackoffThrottling refreshLinks limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210246 (https://phabricator.wikimedia.org/T98621) (owner: 10Aaron Schulz) [23:06:40] (03CR) 10jenkins-bot: [V: 04-1] Bumped the $wgJobBackoffThrottling refreshLinks limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210246 (https://phabricator.wikimedia.org/T98621) (owner: 10Aaron Schulz) [23:06:56] tacotuesday, looks good [23:06:59] tacotuesday, ty [23:07:07] legoktm: haha [23:07:16] (03PS4) 10Chad: Bumped the $wgJobBackoffThrottling refreshLinks limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210246 (https://phabricator.wikimedia.org/T98621) (owner: 10Aaron Schulz) [23:09:27] legoktm: That ^demon guy sounds like a real jerk [23:10:56] !log demon Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 11s) [23:11:01] Logged the message, Master [23:11:02] (03PS2) 10Thcipriani: Add script from jenkins beta-update-databases [puppet] - 10https://gerrit.wikimedia.org/r/210618 (https://phabricator.wikimedia.org/T96199) [23:11:03] AaronSchulz: That's you ^ [23:11:36] tacotuesday: he does, but he has some great quips :P [23:15:15] thedj: ori: greg-g: (and anyone else :) ) another related question: if I checkout a production branch of core and do submodule update --init --recursive, will I get all the extensions that are deployed on WMF wikis? Or are there any that aren't in there? [23:15:27] you'll get all of them [23:16:31] With a fair bit of luck we can stop using those awful submodules soon [23:16:47] ori: fantastic, thx! [23:18:01] tacotuesday: what are you alluding to? [23:18:03] !log Upgrading more HHVMs; DPKG alerts likely but they will be transient. [23:18:08] Logged the message, Master [23:18:18] * Negative24 looks suspiciously at ^demon [23:18:27] Negative24: Hopefully we can stop using submodules in the deployment branches anymore. [23:19:31] how would that happen? [23:19:46] PROBLEM - DPKG on mw1054 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:19:55] PROBLEM - DPKG on mw1056 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:19:55] PROBLEM - DPKG on mw1029 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:19:55] PROBLEM - DPKG on mw1036 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:19:56] PROBLEM - DPKG on mw1059 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:19:56] PROBLEM - DPKG on mw1063 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:19:56] PROBLEM - DPKG on mw1031 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:19:56] PROBLEM - DPKG on mw1044 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:06] PROBLEM - DPKG on mw1045 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:06] PROBLEM - DPKG on mw1050 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:06] PROBLEM - DPKG on mw1058 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:15] PROBLEM - DPKG on mw1016 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:15] PROBLEM - DPKG on mw1034 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:16] PROBLEM - DPKG on mw1035 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:16] PROBLEM - DPKG on mw1069 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:16] PROBLEM - DPKG on mw1046 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:16] PROBLEM - DPKG on mw1064 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:16] PROBLEM - DPKG on mw1040 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:16] PROBLEM - DPKG on mw1061 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:17] PROBLEM - DPKG on mw1037 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:21] Negative24: By finding a better way, likely using git subtrees. [23:20:21] that's me, see log message above. [23:20:26] PROBLEM - DPKG on mw1076 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:26] PROBLEM - DPKG on mw1068 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:26] PROBLEM - DPKG on mw1028 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:27] submodules are bleh [23:20:27] PROBLEM - DPKG on mw1032 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:36] PROBLEM - DPKG on mw1072 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:38] PROBLEM - DPKG on mw1039 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:38] PROBLEM - DPKG on mw1051 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:38] PROBLEM - DPKG on mw1057 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:45] PROBLEM - DPKG on mw1027 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:46] PROBLEM - DPKG on mw1033 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:46] PROBLEM - DPKG on mw1047 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:55] PROBLEM - DPKG on mw1073 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:56] PROBLEM - DPKG on mw1048 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:56] PROBLEM - DPKG on mw1055 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:56] PROBLEM - DPKG on mw1052 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:20:56] PROBLEM - DPKG on mw1038 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:05] PROBLEM - DPKG on mw1066 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:05] PROBLEM - DPKG on mw1067 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:05] tacotuesday: I have noticed. I think twentyafterfour was looking into that [23:21:06] PROBLEM - DPKG on mw1053 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:06] PROBLEM - DPKG on mw1071 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:06] PROBLEM - DPKG on mw1077 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:07] PROBLEM - DPKG on mw1070 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:14] oh... [23:21:15] PROBLEM - DPKG on mw1060 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:15] PROBLEM - DPKG on mw1065 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:15] PROBLEM - DPKG on mw1026 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:16] PROBLEM - DPKG on mw1078 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:16] PROBLEM - DPKG on mw1075 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:17] my... [23:21:25] PROBLEM - DPKG on mw1030 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:26] PROBLEM - DPKG on mw1074 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:21:29] Negative24: Yes, which is what I was talking about ;-) [23:21:35] sorry for the alert spam. the check should really return UNKNOWN if there is an apt process running [23:21:45] RECOVERY - DPKG on mw1045 is OK: All packages OK [23:21:56] RECOVERY - DPKG on mw1046 is OK: All packages OK [23:21:56] RECOVERY - DPKG on mw1037 is OK: All packages OK [23:22:06] PROBLEM - puppet last run on mw1070 is CRITICAL Puppet has 3 failures [23:22:06] PROBLEM - puppet last run on mw1058 is CRITICAL Puppet has 1 failures [23:22:15] RECOVERY - DPKG on mw1051 is OK: All packages OK [23:22:16] RECOVERY - DPKG on mw1057 is OK: All packages OK [23:22:17] RECOVERY - DPKG on mw1033 is OK: All packages OK [23:22:25] PROBLEM - puppet last run on mw1047 is CRITICAL Puppet has 2 failures [23:22:25] RECOVERY - DPKG on mw1047 is OK: All packages OK [23:22:26] RECOVERY - DPKG on mw1073 is OK: All packages OK [23:22:26] RECOVERY - DPKG on mw1048 is OK: All packages OK [23:22:36] RECOVERY - DPKG on mw1055 is OK: All packages OK [23:22:36] RECOVERY - DPKG on mw1052 is OK: All packages OK [23:22:36] RECOVERY - DPKG on mw1038 is OK: All packages OK [23:22:36] RECOVERY - DPKG on mw1066 is OK: All packages OK [23:22:36] RECOVERY - DPKG on mw1067 is OK: All packages OK [23:22:45] RECOVERY - DPKG on mw1071 is OK: All packages OK [23:22:46] RECOVERY - DPKG on mw1077 is OK: All packages OK [23:22:46] RECOVERY - DPKG on mw1070 is OK: All packages OK [23:22:47] RECOVERY - DPKG on mw1060 is OK: All packages OK [23:22:47] RECOVERY - DPKG on mw1065 is OK: All packages OK [23:22:56] RECOVERY - DPKG on mw1075 is OK: All packages OK [23:22:57] RECOVERY - DPKG on mw1030 is OK: All packages OK [23:23:06] RECOVERY - DPKG on mw1054 is OK: All packages OK [23:23:06] RECOVERY - DPKG on mw1074 is OK: All packages OK [23:23:07] RECOVERY - DPKG on mw1056 is OK: All packages OK [23:23:07] RECOVERY - DPKG on mw1029 is OK: All packages OK [23:23:07] RECOVERY - DPKG on mw1036 is OK: All packages OK [23:23:15] RECOVERY - DPKG on mw1059 is OK: All packages OK [23:23:15] RECOVERY - DPKG on mw1063 is OK: All packages OK [23:23:16] RECOVERY - DPKG on mw1031 is OK: All packages OK [23:23:16] PROBLEM - puppet last run on mw1075 is CRITICAL Puppet has 2 failures [23:23:16] RECOVERY - DPKG on mw1044 is OK: All packages OK [23:23:17] RECOVERY - DPKG on mw1050 is OK: All packages OK [23:23:17] RECOVERY - DPKG on mw1058 is OK: All packages OK [23:23:26] !log demon Synchronized php-1.26wmf5/includes/media/DjVu.php: (no message) (duration: 00m 12s) [23:23:26] RECOVERY - DPKG on mw1016 is OK: All packages OK [23:23:26] RECOVERY - DPKG on mw1034 is OK: All packages OK [23:23:27] RECOVERY - DPKG on mw1035 is OK: All packages OK [23:23:27] RECOVERY - DPKG on mw1069 is OK: All packages OK [23:23:27] RECOVERY - DPKG on mw1064 is OK: All packages OK [23:23:27] RECOVERY - DPKG on mw1040 is OK: All packages OK [23:23:27] RECOVERY - DPKG on mw1061 is OK: All packages OK [23:23:31] Logged the message, Master [23:23:36] RECOVERY - DPKG on mw1076 is OK: All packages OK [23:23:36] RECOVERY - DPKG on mw1068 is OK: All packages OK [23:23:36] RECOVERY - DPKG on mw1028 is OK: All packages OK [23:23:37] RECOVERY - DPKG on mw1032 is OK: All packages OK [23:23:45] !log demon Synchronized php-1.26wmf5/includes/jobqueue/jobs/RefreshLinksJob.php: (no message) (duration: 00m 12s) [23:23:46] RECOVERY - DPKG on mw1072 is OK: All packages OK [23:23:46] RECOVERY - DPKG on mw1039 is OK: All packages OK [23:23:53] Logged the message, Master [23:23:56] RECOVERY - DPKG on mw1027 is OK: All packages OK [23:24:04] !log demon Synchronized php-1.26wmf4/includes/jobqueue/jobs/RefreshLinksJob.php: (no message) (duration: 00m 11s) [23:24:10] Logged the message, Master [23:24:15] AaronSchulz: All your swat stuff is live [23:24:16] PROBLEM - DPKG on mw1079 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:24:26] RECOVERY - DPKG on mw1026 is OK: All packages OK [23:24:26] RECOVERY - DPKG on mw1078 is OK: All packages OK [23:24:46] PROBLEM - DPKG on mw1080 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:06] PROBLEM - DPKG on mw1086 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:06] PROBLEM - DPKG on mw1085 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:13] mutante: (T97476) ah ;) - but why not access all of them via HTTPS? it gives me an uneasy feeling knowing that only the Russians are safe from having the NSA manipulate their article numbers using QUANTUM insert attacks [23:25:16] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 13.33% of data above the critical threshold [500.0] [23:25:27] PROBLEM - DPKG on mw1087 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:35] PROBLEM - DPKG on mw1081 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:35] PROBLEM - DPKG on mw1090 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:36] PROBLEM - DPKG on mw1096 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:46] PROBLEM - DPKG on mw1101 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:47] PROBLEM - DPKG on mw1088 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:47] PROBLEM - DPKG on mw1092 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:47] PROBLEM - DPKG on mw1084 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:47] PROBLEM - DPKG on mw1083 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:47] PROBLEM - DPKG on mw1097 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:55] RECOVERY - DPKG on mw1079 is OK: All packages OK [23:25:55] PROBLEM - DPKG on mw1104 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:25:57] PROBLEM - DPKG on mw1100 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:07] PROBLEM - DPKG on mw1103 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:09] !log demon Synchronized php-1.26wmf4/extensions/CirrusSearch/: (no message) (duration: 00m 12s) [23:26:14] Logged the message, Master [23:26:16] PROBLEM - DPKG on mw1125 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:16] PROBLEM - DPKG on mw1129 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:16] PROBLEM - DPKG on mw1105 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:16] PROBLEM - DPKG on mw1093 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:16] PROBLEM - DPKG on mw1094 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:26] PROBLEM - DPKG on mw1082 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:26] PROBLEM - DPKG on mw1107 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:36] PROBLEM - DPKG on mw1108 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:36] PROBLEM - DPKG on mw1120 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:37] PROBLEM - DPKG on mw1130 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:37] PROBLEM - DPKG on mw1109 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:37] PROBLEM - DPKG on mw1098 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:45] PROBLEM - DPKG on mw1124 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:45] PROBLEM - DPKG on mw1133 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:46] PROBLEM - DPKG on mw1095 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:46] PROBLEM - DPKG on mw1089 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:46] PROBLEM - DPKG on mw1091 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:46] PROBLEM - DPKG on mw1106 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:46] PROBLEM - DPKG on mw1111 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:56] PROBLEM - DPKG on mw1099 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:56] PROBLEM - DPKG on mw1110 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:57] PROBLEM - DPKG on mw1102 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:57] PROBLEM - DPKG on mw1123 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:26:57] PROBLEM - DPKG on mw1113 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:27:05] PROBLEM - DPKG on mw1112 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:27:06] PROBLEM - DPKG on mw1128 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:27:06] PROBLEM - DPKG on mw1122 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:27:25] PROBLEM - DPKG on mw1134 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:27:25] RECOVERY - DPKG on mw1101 is OK: All packages OK [23:27:25] PROBLEM - DPKG on mw1121 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:27:26] PROBLEM - DPKG on mw1126 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:27:26] RECOVERY - DPKG on mw1084 is OK: All packages OK [23:27:26] RECOVERY - DPKG on mw1053 is OK: All packages OK [23:27:26] PROBLEM - DPKG on mw1132 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:27:35] PROBLEM - DPKG on mw1127 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:27:56] PROBLEM - DPKG on mw1131 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:27:56] RECOVERY - DPKG on mw1093 is OK: All packages OK [23:27:56] RECOVERY - DPKG on mw1094 is OK: All packages OK [23:27:57] RECOVERY - DPKG on mw1082 is OK: All packages OK [23:27:57] RECOVERY - DPKG on mw1080 is OK: All packages OK [23:28:05] RECOVERY - DPKG on mw1107 is OK: All packages OK [23:28:07] RECOVERY - DPKG on mw1108 is OK: All packages OK [23:28:16] RECOVERY - DPKG on mw1120 is OK: All packages OK [23:28:16] RECOVERY - DPKG on mw1109 is OK: All packages OK [23:28:16] RECOVERY - DPKG on mw1098 is OK: All packages OK [23:28:16] RECOVERY - DPKG on mw1124 is OK: All packages OK [23:28:16] RECOVERY - DPKG on mw1095 is OK: All packages OK [23:28:17] RECOVERY - DPKG on mw1086 is OK: All packages OK [23:28:17] RECOVERY - DPKG on mw1089 is OK: All packages OK [23:28:17] RECOVERY - DPKG on mw1085 is OK: All packages OK [23:28:18] RECOVERY - DPKG on mw1091 is OK: All packages OK [23:28:25] RECOVERY - DPKG on mw1106 is OK: All packages OK [23:28:26] RECOVERY - DPKG on mw1111 is OK: All packages OK [23:28:26] RECOVERY - DPKG on mw1099 is OK: All packages OK [23:28:35] RECOVERY - DPKG on mw1110 is OK: All packages OK [23:28:35] RECOVERY - DPKG on mw1102 is OK: All packages OK [23:28:35] RECOVERY - DPKG on mw1123 is OK: All packages OK [23:28:36] RECOVERY - DPKG on mw1113 is OK: All packages OK [23:28:36] RECOVERY - DPKG on mw1112 is OK: All packages OK [23:28:36] RECOVERY - DPKG on mw1087 is OK: All packages OK [23:28:45] RECOVERY - DPKG on mw1128 is OK: All packages OK [23:28:45] RECOVERY - DPKG on mw1122 is OK: All packages OK [23:28:45] RECOVERY - DPKG on mw1081 is OK: All packages OK [23:28:46] RECOVERY - DPKG on mw1090 is OK: All packages OK [23:28:46] RECOVERY - DPKG on mw1096 is OK: All packages OK [23:28:56] RECOVERY - DPKG on mw1134 is OK: All packages OK [23:28:56] RECOVERY - DPKG on mw1088 is OK: All packages OK [23:28:56] RECOVERY - DPKG on mw1092 is OK: All packages OK [23:28:57] RECOVERY - DPKG on mw1121 is OK: All packages OK [23:28:57] RECOVERY - DPKG on mw1126 is OK: All packages OK [23:28:57] RECOVERY - DPKG on mw1083 is OK: All packages OK [23:28:57] RECOVERY - DPKG on mw1097 is OK: All packages OK [23:29:06] RECOVERY - DPKG on mw1132 is OK: All packages OK [23:29:06] RECOVERY - DPKG on mw1104 is OK: All packages OK [23:29:06] RECOVERY - DPKG on mw1127 is OK: All packages OK [23:29:15] RECOVERY - DPKG on mw1100 is OK: All packages OK [23:29:16] RECOVERY - DPKG on mw1103 is OK: All packages OK [23:29:27] RECOVERY - DPKG on mw1125 is OK: All packages OK [23:29:27] RECOVERY - DPKG on mw1131 is OK: All packages OK [23:29:27] RECOVERY - DPKG on mw1129 is OK: All packages OK [23:29:27] RECOVERY - DPKG on mw1105 is OK: All packages OK [23:29:55] RECOVERY - DPKG on mw1130 is OK: All packages OK [23:29:56] RECOVERY - DPKG on mw1133 is OK: All packages OK [23:30:06] PROBLEM - DPKG on mw1135 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:30:57] PROBLEM - DPKG on mw1136 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:06] PROBLEM - DPKG on mw1140 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:06] PROBLEM - DPKG on mw1166 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:06] PROBLEM - DPKG on mw1137 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:06] PROBLEM - DPKG on mw1142 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:07] PROBLEM - DPKG on mw1165 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:15] PROBLEM - DPKG on mw1178 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:25] PROBLEM - DPKG on mw1143 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:25] PROBLEM - DPKG on mw1167 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:26] PROBLEM - DPKG on mw1181 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:35] PROBLEM - DPKG on mw1139 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:36] PROBLEM - DPKG on mw1171 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:36] PROBLEM - DPKG on mw1144 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:37] PROBLEM - DPKG on mw1138 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:37] RECOVERY - DPKG on mw1135 is OK: All packages OK [23:31:45] PROBLEM - DPKG on mw1150 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:45] PROBLEM - DPKG on mw1141 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:46] PROBLEM - DPKG on mw1190 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:46] PROBLEM - DPKG on mw1184 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:46] PROBLEM - DPKG on mw1172 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:46] PROBLEM - DPKG on mw1175 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:46] PROBLEM - DPKG on mw1145 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:47] PROBLEM - DPKG on mw1189 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:55] PROBLEM - DPKG on mw1147 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:56] PROBLEM - DPKG on mw1162 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:56] PROBLEM - DPKG on mw1169 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:56] PROBLEM - DPKG on mw1149 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:56] PROBLEM - DPKG on mw1191 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:31:56] PROBLEM - DPKG on mw1186 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:06] PROBLEM - DPKG on mw1182 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:07] PROBLEM - DPKG on mw1185 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:07] PROBLEM - DPKG on mw1163 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:07] PROBLEM - DPKG on mw1148 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:13] (03PS1) 10EBernhardson: Disable leading wildcard searches in CirrusSearch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210620 (https://phabricator.wikimedia.org/T91666) [23:32:17] PROBLEM - DPKG on mw1176 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:17] PROBLEM - DPKG on mw1170 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:17] PROBLEM - DPKG on mw1161 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:25] PROBLEM - DPKG on mw1146 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:25] PROBLEM - DPKG on mw1183 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:25] PROBLEM - DPKG on mw1179 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:25] PROBLEM - DPKG on mw1177 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:26] PROBLEM - DPKG on mw1164 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:26] PROBLEM - DPKG on mw1168 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:26] PROBLEM - DPKG on mw1174 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:36] RECOVERY - DPKG on mw1136 is OK: All packages OK [23:32:36] PROBLEM - DPKG on mw1194 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:36] PROBLEM - DPKG on mw1187 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:36] PROBLEM - DPKG on mw1180 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:36] PROBLEM - DPKG on mw1188 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:36] PROBLEM - DPKG on mw1173 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:32:46] RECOVERY - DPKG on mw1165 is OK: All packages OK [23:32:46] PROBLEM - puppet last run on mw1180 is CRITICAL Puppet has 1 failures [23:32:46] RECOVERY - DPKG on mw1178 is OK: All packages OK [23:32:56] RECOVERY - DPKG on mw1143 is OK: All packages OK [23:32:57] RECOVERY - DPKG on mw1167 is OK: All packages OK [23:33:06] PROBLEM - DPKG on mw1193 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:33:07] RECOVERY - DPKG on mw1171 is OK: All packages OK [23:33:07] RECOVERY - DPKG on mw1139 is OK: All packages OK [23:33:07] RECOVERY - DPKG on mw1144 is OK: All packages OK [23:33:15] PROBLEM - puppet last run on mw1151 is CRITICAL Puppet has 1 failures [23:33:16] RECOVERY - DPKG on mw1138 is OK: All packages OK [23:33:16] RECOVERY - DPKG on mw1150 is OK: All packages OK [23:33:16] PROBLEM - DPKG on mw1151 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:33:17] RECOVERY - DPKG on mw1141 is OK: All packages OK [23:33:17] RECOVERY - DPKG on mw1190 is OK: All packages OK [23:33:25] RECOVERY - DPKG on mw1184 is OK: All packages OK [23:33:25] RECOVERY - DPKG on mw1172 is OK: All packages OK [23:33:26] RECOVERY - DPKG on mw1175 is OK: All packages OK [23:33:26] RECOVERY - DPKG on mw1145 is OK: All packages OK [23:33:35] RECOVERY - DPKG on mw1147 is OK: All packages OK [23:33:35] RECOVERY - DPKG on mw1162 is OK: All packages OK [23:33:35] RECOVERY - DPKG on mw1169 is OK: All packages OK [23:33:35] RECOVERY - DPKG on mw1149 is OK: All packages OK [23:33:36] RECOVERY - DPKG on mw1191 is OK: All packages OK [23:33:36] RECOVERY - DPKG on mw1186 is OK: All packages OK [23:33:46] RECOVERY - DPKG on mw1182 is OK: All packages OK [23:33:46] RECOVERY - DPKG on mw1185 is OK: All packages OK [23:33:46] RECOVERY - DPKG on mw1148 is OK: All packages OK [23:33:56] RECOVERY - DPKG on mw1176 is OK: All packages OK [23:33:56] RECOVERY - DPKG on mw1170 is OK: All packages OK [23:33:56] RECOVERY - DPKG on mw1161 is OK: All packages OK [23:33:56] RECOVERY - DPKG on mw1146 is OK: All packages OK [23:33:56] RECOVERY - DPKG on mw1183 is OK: All packages OK [23:33:57] RECOVERY - DPKG on mw1179 is OK: All packages OK [23:33:57] RECOVERY - DPKG on mw1177 is OK: All packages OK [23:33:58] RECOVERY - DPKG on mw1164 is OK: All packages OK [23:34:06] RECOVERY - DPKG on mw1168 is OK: All packages OK [23:34:06] RECOVERY - DPKG on mw1174 is OK: All packages OK [23:34:12] (03PS1) 10EBernhardson: Enable the CirrusSearch per-user pool counter everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210622 (https://phabricator.wikimedia.org/T76497) [23:34:15] RECOVERY - DPKG on mw1194 is OK: All packages OK [23:34:15] RECOVERY - DPKG on mw1187 is OK: All packages OK [23:34:16] RECOVERY - DPKG on mw1180 is OK: All packages OK [23:34:16] RECOVERY - DPKG on mw1140 is OK: All packages OK [23:34:16] RECOVERY - DPKG on mw1166 is OK: All packages OK [23:34:16] RECOVERY - DPKG on mw1188 is OK: All packages OK [23:34:16] RECOVERY - DPKG on mw1173 is OK: All packages OK [23:34:17] RECOVERY - DPKG on mw1137 is OK: All packages OK [23:34:17] RECOVERY - DPKG on mw1142 is OK: All packages OK [23:34:26] PROBLEM - DPKG on mw1195 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:34:36] RECOVERY - DPKG on mw1181 is OK: All packages OK [23:34:45] RECOVERY - DPKG on mw1193 is OK: All packages OK [23:35:06] RECOVERY - DPKG on mw1189 is OK: All packages OK [23:35:25] RECOVERY - DPKG on mw1163 is OK: All packages OK [23:35:27] PROBLEM - DPKG on mw1197 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:05] RECOVERY - DPKG on mw1195 is OK: All packages OK [23:36:06] PROBLEM - DPKG on mw1207 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:16] PROBLEM - puppet last run on mw2140 is CRITICAL Puppet has 1 failures [23:36:16] PROBLEM - DPKG on mw1225 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:17] PROBLEM - DPKG on mw1229 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:25] PROBLEM - DPKG on mw1212 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:25] PROBLEM - DPKG on mw1208 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:25] PROBLEM - DPKG on mw1223 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:25] PROBLEM - DPKG on mw1237 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:26] PROBLEM - DPKG on mw1217 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:26] PROBLEM - DPKG on mw1206 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:26] PROBLEM - DPKG on mw1196 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:27] PROBLEM - DPKG on mw1213 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:27] PROBLEM - DPKG on mw1209 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:35] RECOVERY - DPKG on mw1151 is OK: All packages OK [23:36:36] PROBLEM - DPKG on mw1222 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:36] PROBLEM - DPKG on mw1211 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:36] PROBLEM - DPKG on mw1230 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:45] PROBLEM - DPKG on mw1232 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:45] PROBLEM - puppet last run on mw2187 is CRITICAL Puppet has 1 failures [23:36:45] PROBLEM - DPKG on mw1205 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:45] PROBLEM - DPKG on mw1202 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:46] PROBLEM - DPKG on mw1218 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:46] PROBLEM - DPKG on mw1215 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:55] PROBLEM - DPKG on mw1227 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:56] PROBLEM - DPKG on mw1201 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:56] PROBLEM - DPKG on mw1219 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:36:56] PROBLEM - DPKG on mw1220 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:06] PROBLEM - DPKG on mw1226 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:06] PROBLEM - DPKG on mw1240 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:06] PROBLEM - DPKG on mw1236 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:06] PROBLEM - DPKG on mw1238 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:06] PROBLEM - DPKG on mw1210 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:07] RECOVERY - DPKG on mw1197 is OK: All packages OK [23:37:07] PROBLEM - DPKG on mw1203 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:16] PROBLEM - DPKG on mw1221 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:16] PROBLEM - DPKG on mw1199 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:16] PROBLEM - DPKG on mw1214 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:25] PROBLEM - DPKG on mw1200 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:26] PROBLEM - DPKG on mw1216 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:35] PROBLEM - DPKG on mw1228 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:35] PROBLEM - DPKG on mw1233 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:36] PROBLEM - DPKG on mw1224 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:37] PROBLEM - DPKG on mw1235 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:46] RECOVERY - DPKG on mw1207 is OK: All packages OK [23:37:46] PROBLEM - DPKG on mw1231 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:37:56] RECOVERY - DPKG on mw1225 is OK: All packages OK [23:37:57] RECOVERY - DPKG on mw1229 is OK: All packages OK [23:37:57] RECOVERY - DPKG on mw1212 is OK: All packages OK [23:38:05] RECOVERY - DPKG on mw1208 is OK: All packages OK [23:38:05] RECOVERY - DPKG on mw1223 is OK: All packages OK [23:38:05] RECOVERY - DPKG on mw1217 is OK: All packages OK [23:38:05] RECOVERY - DPKG on mw1206 is OK: All packages OK [23:38:06] RECOVERY - DPKG on mw1196 is OK: All packages OK [23:38:06] RECOVERY - DPKG on mw1213 is OK: All packages OK [23:38:06] PROBLEM - DPKG on mw1253 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:06] RECOVERY - DPKG on mw1209 is OK: All packages OK [23:38:07] PROBLEM - DPKG on mw1241 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:08] PROBLEM - DPKG on mw1252 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:15] PROBLEM - DPKG on mw1239 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:15] RECOVERY - DPKG on mw1222 is OK: All packages OK [23:38:16] RECOVERY - DPKG on mw1211 is OK: All packages OK [23:38:16] RECOVERY - DPKG on mw1230 is OK: All packages OK [23:38:16] RECOVERY - DPKG on mw1232 is OK: All packages OK [23:38:16] PROBLEM - DPKG on mw1250 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:16] RECOVERY - puppet last run on mw1047 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [23:38:25] RECOVERY - DPKG on mw1205 is OK: All packages OK [23:38:25] RECOVERY - DPKG on mw1202 is OK: All packages OK [23:38:26] RECOVERY - DPKG on mw1218 is OK: All packages OK [23:38:26] RECOVERY - DPKG on mw1215 is OK: All packages OK [23:38:26] PROBLEM - puppet last run on mw1209 is CRITICAL Puppet has 2 failures [23:38:35] RECOVERY - DPKG on mw1227 is OK: All packages OK [23:38:35] RECOVERY - DPKG on mw1201 is OK: All packages OK [23:38:36] RECOVERY - DPKG on mw1219 is OK: All packages OK [23:38:36] RECOVERY - DPKG on mw1220 is OK: All packages OK [23:38:36] PROBLEM - DPKG on mw1258 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:36] PROBLEM - DPKG on mw1248 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:36] PROBLEM - DPKG on mw1257 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:37] RECOVERY - DPKG on mw1226 is OK: All packages OK [23:38:45] PROBLEM - DPKG on mw1247 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:45] RECOVERY - DPKG on mw1240 is OK: All packages OK [23:38:46] RECOVERY - DPKG on mw1236 is OK: All packages OK [23:38:46] PROBLEM - DPKG on mw1243 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:46] RECOVERY - DPKG on mw1238 is OK: All packages OK [23:38:46] RECOVERY - DPKG on mw1210 is OK: All packages OK [23:38:46] PROBLEM - DPKG on mw1245 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:47] RECOVERY - DPKG on mw1203 is OK: All packages OK [23:38:47] PROBLEM - DPKG on mw1249 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:38:47] RECOVERY - DPKG on mw1221 is OK: All packages OK [23:38:56] RECOVERY - DPKG on mw1199 is OK: All packages OK [23:38:56] RECOVERY - DPKG on mw1214 is OK: All packages OK [23:38:57] RECOVERY - DPKG on mw1200 is OK: All packages OK [23:39:01] (03CR) 10QChris: Adding task support instead of using Bug: which was for bugzilla (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/209741 (owner: 10Paladox) [23:39:05] RECOVERY - DPKG on mw1216 is OK: All packages OK [23:39:06] RECOVERY - DPKG on mw1233 is OK: All packages OK [23:39:06] RECOVERY - DPKG on mw1228 is OK: All packages OK [23:39:06] PROBLEM - DPKG on mw1254 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:39:15] RECOVERY - puppet last run on mw1075 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:39:16] RECOVERY - DPKG on mw1224 is OK: All packages OK [23:39:16] RECOVERY - DPKG on mw1235 is OK: All packages OK [23:39:16] PROBLEM - puppet last run on mw2075 is CRITICAL Puppet has 1 failures [23:39:16] PROBLEM - puppet last run on mw2020 is CRITICAL Puppet has 1 failures [23:39:26] PROBLEM - DPKG on mw1251 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:39:26] PROBLEM - DPKG on mw1256 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:39:26] RECOVERY - DPKG on mw1231 is OK: All packages OK [23:39:35] PROBLEM - puppet last run on mw2069 is CRITICAL Puppet has 1 failures [23:39:36] PROBLEM - DPKG on mw1255 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:39:45] RECOVERY - DPKG on mw1237 is OK: All packages OK [23:39:45] RECOVERY - puppet last run on mw1070 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:39:46] RECOVERY - DPKG on mw1253 is OK: All packages OK [23:39:46] RECOVERY - puppet last run on mw1058 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:39:46] RECOVERY - DPKG on mw1241 is OK: All packages OK [23:39:46] PROBLEM - puppet last run on mw1241 is CRITICAL Puppet has 2 failures [23:39:46] RECOVERY - DPKG on mw1252 is OK: All packages OK [23:39:47] RECOVERY - DPKG on mw1239 is OK: All packages OK [23:39:47] (03CR) 10QChris: [C: 04-1] "Since the main issue has not yet been covered, bringing over" [puppet] - 10https://gerrit.wikimedia.org/r/209741 (owner: 10Paladox) [23:39:57] RECOVERY - DPKG on mw1250 is OK: All packages OK [23:40:16] RECOVERY - DPKG on mw1258 is OK: All packages OK [23:40:16] RECOVERY - DPKG on mw1248 is OK: All packages OK [23:40:25] RECOVERY - DPKG on mw1257 is OK: All packages OK [23:40:26] RECOVERY - DPKG on mw1247 is OK: All packages OK [23:40:26] RECOVERY - DPKG on mw1243 is OK: All packages OK [23:40:26] RECOVERY - DPKG on mw1245 is OK: All packages OK [23:40:35] RECOVERY - DPKG on mw1249 is OK: All packages OK [23:40:51] !log Upgraded all Apaches to HHVM 3.6.1+dfsg1-1+wm2 and Apache 2.4.7-1ubuntu4.4 [23:40:56] RECOVERY - DPKG on mw1254 is OK: All packages OK [23:40:56] PROBLEM - puppet last run on mw1255 is CRITICAL Puppet has 2 failures [23:40:56] Logged the message, Master [23:41:06] RECOVERY - DPKG on mw1251 is OK: All packages OK [23:41:06] RECOVERY - DPKG on mw1256 is OK: All packages OK [23:41:16] RECOVERY - DPKG on mw1255 is OK: All packages OK [23:50:26] RECOVERY - puppet last run on mw1180 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [23:50:55] RECOVERY - puppet last run on mw1151 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [23:52:46] RECOVERY - puppet last run on mw2187 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [23:53:56] RECOVERY - puppet last run on mw2140 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:54:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [23:54:26] RECOVERY - puppet last run on mw1209 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [23:55:27] RECOVERY - puppet last run on mw2069 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [23:55:46] RECOVERY - puppet last run on mw1241 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [23:56:11] (03PS6) 10Ori.livneh: Add commons to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) (owner: 10GWicke) [23:56:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Add commons to restbase config [puppet] - 10https://gerrit.wikimedia.org/r/208193 (https://phabricator.wikimedia.org/T97840) (owner: 10GWicke) [23:56:46] RECOVERY - puppet last run on mw1255 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [23:56:56] RECOVERY - puppet last run on mw2075 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:56:56] RECOVERY - puppet last run on mw2020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures