[00:00:04] * YuviPanda should fix that [00:00:41] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Fri Aug 1 00:00:33 UTC 2014 [00:01:21] "SQL Quarry would like to have basic access on your behalf on all projects of this site." [00:01:26] exactly wth is basic access? [00:01:35] ori: ah, I see. it's the 'all the projects' that's probably terrifying, I'll change that to just mw.org [00:01:42] RECOVERY - Disk space on elastic1016 is OK: DISK OK [00:01:50] I assume that "basic access" means "all your permissions" because that's so vague [00:02:12] OAuth access messages are ... opaque [00:02:17] mwalker: hmm, it's just able to verify that you're who you are, get your userid, rights, groups [00:02:27] and delete all my instances [00:02:43] I mean, I know because I know you Yuvi [00:02:48] mwalker: no write actions, can't view watchlist or prefs either [00:02:52] but... that is just a horrible dialog [00:02:56] mwalker: :D indeed [00:02:57] Oh that would be cool. We could build a chaos monkey tool [00:03:01] hehe [00:03:08] 'randomly delete instances, see if it still fails' [00:03:43] do you sanitize this sql at all yuvi? [00:04:04] mwalker: I've a query killer that kills things after 1m (will be raised to 10m after 'launch') [00:04:09] mwalker: and the user has no rights other than SELECT [00:04:20] mwalker: labsdb pretty much takes care of the sanitizing for me [00:05:05] the UI / UX could use a bit more love [00:05:08] and I've to fix a goddamned google font [00:05:15] * YuviPanda does the fixing now [00:06:55] (03PS3) 10Ori.livneh: mediawiki: use HHVM module on trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 [00:09:51] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [00:11:47] mwalker: hmm, I can't seem to see the results of your queries, but mine seem to run fine http://quarry.wmflabs.org/query/4 [00:12:19] wat's happening?! :O [00:12:20] * YuviPanda sighs [00:13:49] (03PS1) 10Ori.livneh: HHVM: set hhvm.server.gzip_compression_level = 0 [operations/puppet] - 10https://gerrit.wikimedia.org/r/150995 [00:14:28] (03CR) 10Ori.livneh: [C: 032 V: 032] "We declared this setting before, but I didn't port it to the new module." [operations/puppet] - 10https://gerrit.wikimedia.org/r/150995 (owner: 10Ori.livneh) [00:19:11] YuviPanda, muahahahahaha [00:19:22] basically; they've all failed [00:19:41] even "use enwiki; select * from users limit 10" which I though should've succeeded [00:20:33] mwalker: yeah, weird, since http://quarry.wmflabs.org/query/1 succeeds [00:20:48] mwalker: I also did a full destroy/redeploy since you tried, so maybe try again? :D [00:27:41] YuviPanda, it appears to be attempting to repeat my queries [00:27:58] I... don't know what it's doing [00:28:08] mwalker: hmm, me neither :| [00:28:10] I'll investigate [00:28:14] YuviPanda +1 :p (/me just ran two queries as the first one; failed) [00:28:28] JohnLewis: :O why do mine succeed?!?! [00:28:52] YuviPanda: in all defense; I did use a nonexistent DB first time round :p [00:29:05] JohnLewis: heh [00:29:08] JohnLewis: oh, so http://quarry.wmflabs.org/query/4 succeeded [00:29:23] mwalker: http://quarry.wmflabs.org/query/3 seems to be fine? [00:29:49] yes; but I had to submit it a couple of times before it executed the right query [00:29:53] yeah [00:29:53] it first gave me a list of tables [00:30:02] and then it errored out [00:30:08] and then it did something [00:30:13] and then it gave me the users [00:31:46] YuviPanda: I see you have yet to work on naming :p [00:35:10] JohnLewis: :) [00:35:26] mwalker: hmm, maybe I'm running into a (not so weird) race in the backend [00:35:29] I'll investigate [00:35:45] mwalker: oh, BAM, I know what's wrong! [00:35:52] mwalker: I didn't clear the results from the previous installations! [00:36:21] mwalker: clearing the database doesn't actually clear the ouptut files, so naturally you're getting output files of random things that were run before that happened to have the exact query and user id [00:36:23] heh [00:36:38] that makes some sense [00:37:35] * mwalker now spots an evil method to kill your server [00:37:37] muhahahaha [00:37:48] create lots of biiiiig queries [00:37:53] mwalker: spam all of NFS with file output? :) [00:37:57] *nods* [00:38:09] mwalker: yeah, my primary concern so far is to not kill *labsdb*. Quarry going down is ok [00:38:41] hmm... so lots of tiny queries to labsdb? [00:38:51] do you limit the number of connections? [00:38:55] mwalker: well, they're lots of tiny queries, so labsdb should be fine [00:38:57] and ironholds is dragging me out [00:39:06] mwalker: yeah, I've a concurency limit [00:39:13] mwalker: :D have fun! [00:39:16] mwalker: select * from users; with no limit? :D? [00:39:28] JohnLewis: it'll just run for a minute and get killed [00:39:39] YuviPanda: oh :( [00:43:34] JohnLewis: :P i'll limit it to 10m soon [00:44:25] YuviPanda: limit it to 10 hours by accident >:D [00:44:34] JohnLewis: :P [00:44:51] Actually, I shouldn't discuss ways to overload the servers in the operations channel.. [00:45:41] JohnLewis: mwalker reset everything properly now, things should be more stable. [00:45:50] good [00:45:58] JohnLewis: try? :D [00:47:51] queued two by accident :p [00:48:16] knowing me I mistyped something but meh :p [00:52:50] !log catrope Synchronized php-1.24wmf16/extensions/VisualEditor/lib/ve/modules/ve/ui/inspectors/ve.ui.CommentInspector.js: Fix typo in class name (duration: 00m 10s) [00:52:56] Logged the message, Master [00:54:01] JohnLewis: :) it should show you an error message if you mistyped things [00:54:21] YuviPanda: ah well it's queued then :p [00:55:17] JohnLewis: hmm, nothing executing? [00:55:22] nope [00:55:31] hmm, grri [00:55:36] just changed db to a different one and same thing [00:55:45] JohnLewis: try now? [00:56:17] that's it :) [00:57:40] JohnLewis: seems to work consistently now? [00:58:17] [17:19:42] even "use enwiki; select * from users limit 10" which I though should've succeeded <-- the table is named "user"... [00:58:20] yeah [00:58:41] hmm, so error messages aren't being reported properly [00:58:54] nope; just results. Unless that's bad ;) [00:59:17] JohnLewis: hmm, http://quarry.wmflabs.org/query/5 [00:59:22] that shows a proper error [00:59:38] oh I thought you meant with my queires :p [00:59:44] heh [00:59:51] JohnLewis: yeah, I dunno what's up with your prior queries [00:59:53] * YuviPanda should add logging [01:00:35] YuviPanda: that's pretty nifty [01:00:41] (quarry) [01:01:16] ori: :D needs some bugs to be ironed out still... but hopefully all in place before the research hackathon on monday [01:01:16] err [01:01:19] on wikimania day 1 [01:03:00] (03CR) 10Ori.livneh: [C: 04-1] "I'm not sure that this is a good addition yet." [operations/puppet] - 10https://gerrit.wikimedia.org/r/150992 (owner: 10Ori.livneh) [01:03:10] ori: also, ipython notebook guys have a grant from Sloan foundation to build a 'true' multiuser ipython notebook server, and expect it to be done by end of year. I'll get that on labs when that's done too [01:03:18] (tied into wikitech accounts, I suppose) [01:03:24] yeah ipython notebook took off in a big way over the past year [01:03:57] ori: yup [01:16:38] JohnLewis: quarry is now explicitly marked 'beta'! [01:17:25] YuviPanda: :o where [01:17:40] JohnLewis: hard refresh on quarry.wmflabs.org [01:17:54] ah [01:24:03] JohnLewis: I'm going to make the data display with http://www.datatables.net/ [01:24:53] nice [01:46:08] YuviPanda: http://quarry.wmflabs.org/query/6 query killed but status is running [01:46:56] JohnLewis: yeah, I need to fix that [01:46:59] JohnLewis: http://quarry.wmflabs.org/query/runs/all is a mess [01:47:13] kay [01:47:25] JohnLewis: but does the thing in general work? :) [01:47:32] pretty good [01:52:07] YuviPanda: right I'm getting off - keep me up to date with quarry :) [01:52:15] JohnLewis: will do [01:52:19] and enjoy the rest of the day mwalker :) [02:10:51] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [02:29:47] YuviPanda: anthing hitting information_schema.tables without filtering by table_schema will get killed [02:30:13] springle: hmm? did something from my tool hit that? [02:30:15] 100000+ tables on labsdbs now. that churns the table cache up something horrid [02:30:33] ^ that query 6 from JohnLewis [02:30:50] springle: oh, right. [02:31:04] springle: right, so that's killed at the mysql level? [02:31:04] must we expose information_schema to the world? [02:31:09] springle: don't think so [02:31:13] correct [02:31:52] springle: I guess it's easier for you to kill its visibility than for me to add code that prevents queries to information_schema? [02:32:37] no, it isn't :) [02:33:15] besides people use it legitimately, just not via something like quarry [02:33:51] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 00:32:57 UTC [02:34:33] springle: :) if it's a 'must kill NOW', I can add a simple 'if information_schema in query then kill', if not, I'll add it to 'phase2' (which I'd probably do in two weeks time) where I'll actually use https://sqlparse.readthedocs.org/en/latest/ [02:34:46] cool [02:34:54] springle: :) thanks! [02:35:08] springle: hmm, also - do you think I can get a couple of 'special' labsdb accounts that have even more restricted permissions? [02:35:23] like, no create db and no information_schema? [02:36:34] quarry looks neat. [02:36:41] springle: or I can give you the userid I'm using, and you can kill its information_schema permissions [02:36:43] No DB selector? [02:36:53] Carmela: no, need to add. you can use 'use' for now tho [02:37:12] (03CR) 10TTO: ""there was no consensus in the first place for enabling it anyway"" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150301 (https://bugzilla.wikimedia.org/68815) (owner: 10Reedy) [02:37:13] Carmela: it has a 1m query killer right now, though [02:37:44] YuviPanda: no create db can be blocked. information_schema isn't controlled by grants [02:38:05] Carmela: also note that there are no data guarantees with the current instance. I'll 'announce' it tomorrow, but until then queries / user logins can disappear anytime [02:38:15] springle: hmm, right. so can I have an account with createdb blocked? [02:38:19] it's a virtual database generated per use on the fly. which is also why it's horribly slow [02:38:24] K. [02:38:38] YuviPanda: i expect so. Coren is the gatekeeper for acounts [02:38:53] springle: hmm, he's going to be busy till wikimania, I suspect. I'll take a poke at some point, tho [02:38:55] I've thought about creating something like this many times. But direct queries are pretty nasty to deal with. Maybe you'll have good luck. [02:39:08] Carmela: ty [02:40:00] !log LocalisationUpdate completed (1.24wmf15) at 2014-08-01 02:38:56+00:00 [02:40:03] springle: I'm going to put in a 'dumb' check for information_schema anyway now. might as well. [02:40:07] Logged the message, Master [02:40:36] YuviPanda: btw each labs user has grant option for their own prefix dbs [02:40:53] springle: yup, and since the user account for this isn't public, it's not that easy to create dbs [02:40:57] which is why I'm not *that* concerned [02:41:29] springle: this project also follows Labs Privacy Policy, so nobody without NDA gets access either [02:41:39] to the project on labs, that is [02:41:55] springle: still, defence in depth and all that :) I'll poke Coren next week in person [02:50:21] PROBLEM - MySQL Replication Heartbeat on db1009 is CRITICAL: CRIT replication delay 306 seconds [02:50:22] PROBLEM - MySQL Slave Delay on db1009 is CRITICAL: CRIT replication delay 310 seconds [02:50:59] Reported server lag on en.wikipedia.org's Special:Contributions... [02:51:53] db1055 and db1051, I guess. [03:00:28] Carmela: it's the PopulateBacklinkNamespace::doDBUpdates jobs. every once in a while a certain page_id is causing lag [03:02:21] RECOVERY - MySQL Replication Heartbeat on db1009 is OK: OK replication delay -1 seconds [03:02:31] RECOVERY - MySQL Slave Delay on db1009 is OK: OK replication delay 0 seconds [03:04:49] Ah. [03:07:42] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:09:41] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [03:12:17] !log LocalisationUpdate completed (1.24wmf16) at 2014-08-01 03:11:14+00:00 [03:12:22] Logged the message, Master [03:26:21] PROBLEM - MySQL Replication Heartbeat on db1009 is CRITICAL: CRIT replication delay 304 seconds [03:26:31] PROBLEM - MySQL Slave Delay on db1009 is CRITICAL: CRIT replication delay 308 seconds [03:32:52] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Fri Aug 1 03:32:48 UTC 2014 [03:41:01] (03PS4) 10Ori.livneh: Use the new "dispatcher" config format and use curl with HHVM [operations/puppet] - 10https://gerrit.wikimedia.org/r/150900 (owner: 10Aaron Schulz) [03:43:41] RECOVERY - MySQL Replication Heartbeat on db74 is OK: OK replication delay 0 seconds [03:43:55] (03CR) 10Ori.livneh: [V: 032] Use the new "dispatcher" config format and use curl with HHVM [operations/puppet] - 10https://gerrit.wikimedia.org/r/150900 (owner: 10Aaron Schulz) [03:44:21] RECOVERY - MySQL Slave Delay on db74 is OK: OK replication delay 0 seconds [03:44:38] (03PS1) 10Ori.livneh: Revert "Use the new "dispatcher" config format and use curl with HHVM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151010 [03:44:56] AaronSchulz: sorry, i noticed an issue with the patch [03:53:41] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:54:01] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:55:53] that's me, i'll sort it out in a moment [04:00:52] PROBLEM - MySQL Slave Delay on db1055 is CRITICAL: CRIT replication delay 308 seconds [04:00:55] PROBLEM - MySQL Replication Heartbeat on db1055 is CRITICAL: CRIT replication delay 308 seconds [04:03:51] (03PS2) 10Ori.livneh: mediawiki: Redo cURL dispatcher config for jobrunner [operations/puppet] - 10https://gerrit.wikimedia.org/r/151010 [04:06:36] (03CR) 10Ori.livneh: [C: 032] mediawiki: Redo cURL dispatcher config for jobrunner [operations/puppet] - 10https://gerrit.wikimedia.org/r/151010 (owner: 10Ori.livneh) [04:06:42] (03CR) 10Aaron Schulz: mediawiki: Redo cURL dispatcher config for jobrunner (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151010 (owner: 10Ori.livneh) [04:06:51] PROBLEM - LighttpdHTTP on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:08:31] PROBLEM - DPKG on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:51] PROBLEM - RAID on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:08:52] PROBLEM - puppet disabled on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:09:01] PROBLEM - check if dhclient is running on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:09:01] PROBLEM - check configured eth on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:09:01] PROBLEM - puppet last run on dataset1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:09:52] PROBLEM - SSH on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:10:00] (03PS1) 10Ori.livneh: jobrunner: update 'wrapper' parameter name to 'dispatcher' [operations/puppet] - 10https://gerrit.wikimedia.org/r/151013 [04:10:19] (03CR) 10Ori.livneh: [C: 032 V: 032] jobrunner: update 'wrapper' parameter name to 'dispatcher' [operations/puppet] - 10https://gerrit.wikimedia.org/r/151013 (owner: 10Ori.livneh) [04:11:01] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:11:41] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [04:11:51] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [04:21:02] PROBLEM - NTP on dataset1001 is CRITICAL: NTP CRITICAL: No response from NTP server [04:21:24] (03PS1) 10Ori.livneh: jobrunner: strip whitespace from dispatcher command [operations/puppet] - 10https://gerrit.wikimedia.org/r/151014 [04:21:31] PROBLEM - MySQL Replication Heartbeat on db1051 is CRITICAL: CRIT replication delay 303 seconds [04:21:41] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 305 seconds [04:21:52] PROBLEM - MySQL Slave Delay on db1051 is CRITICAL: CRIT replication delay 307 seconds [04:22:20] (03CR) 10Ori.livneh: [C: 032 V: 032] "trivial fix" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151014 (owner: 10Ori.livneh) [04:23:37] springle: are you around? what's up with the db alerts? [04:25:02] ah, i see your remark to Carmela in the scroll-back [04:26:31] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 375 seconds [04:27:21] RECOVERY - MySQL Replication Heartbeat on db1009 is OK: OK replication delay -0 seconds [04:27:31] RECOVERY - MySQL Slave Delay on db1009 is OK: OK replication delay 0 seconds [04:33:31] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 0 seconds [04:33:37] what's with the lag on the watchlist server for like the past hour? [04:36:31] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 471 seconds [04:38:31] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 0 seconds [04:41:31] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 485 seconds [04:42:28] this is getting silly [04:43:57] (03PS1) 10Aaron Schulz: Restore nice -19 command to job dispatcher command [operations/puppet] - 10https://gerrit.wikimedia.org/r/151018 [04:44:18] ori: ^ [04:44:26] springle: what's going on? [04:44:27] AaronSchulz: are the populateBacklinks jobs pauseable? [04:44:31] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 0 seconds [04:45:28] innodb logs filling up on some slaves. would be good to throttle the jobs for a while [04:45:50] * springle experimenting with flushing methods [04:46:02] springle: not really, since they don't have a option, but that could be added [04:46:22] so it's not a huge deal to stop them [04:46:55] what would it be throttled on? sleep()? [04:47:45] actually the issue seems to be only certain expensive page_ids [04:48:02] springle: can you merge https://gerrit.wikimedia.org/r/151018 btw ;) [04:48:22] eg, page_id with enough links to hit a checkpoint bottleneck on a 2G innodb logfile [04:48:24] springle: like a page full of links (that can probably barely render) [04:48:36] there aren't many, yeah [04:48:41] * AaronSchulz has seen a few of those [04:48:52] could it be LIMIT and a loop for each page_id? [04:48:54] somehow [04:49:51] (03CR) 10Springle: [C: 032] Restore nice -19 command to job dispatcher command [operations/puppet] - 10https://gerrit.wikimedia.org/r/151018 (owner: 10Aaron Schulz) [04:50:32] merged ^ [04:51:11] springle: LIMIT + ORDER BY (*_from,*_to_namespace,*_to_title) maybe [04:51:21] yes [04:51:43] and WHERE *_from_namespace = 0 maybe [04:51:58] depending on how you choose the start point [04:52:31] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 651 seconds [04:56:04] springle: which dbs have trouble now? [04:56:29] mostly enwiki [04:56:48] something on s2 lagged for a while, but not badly [04:58:20] why the two slaves at http://noc.wikimedia.org/dbtree/ and not the others? Are their logs different sizes for some reason? [04:59:14] those are the slaves handling watchlist, vslow, etc [04:59:59] their logs are the same size, but buffer pools hold quite different type of data [05:00:06] the links there are mostly cold data [05:00:20] makes sense [05:02:07] they are also 96G boxes, while the other enwiki slaves are now 128G or 160G [05:02:26] small effects adding up [05:05:01] (03PS1) 10Aaron Schulz: Apply proper JSON encoding in dispatchers/trusty.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/151021 [05:05:09] springle: sorry, another random change ^ [05:05:33] hopefully not actually 'random' :) [05:06:00] (03CR) 10Springle: [C: 032] Apply proper JSON encoding in dispatchers/trusty.erb [operations/puppet] - 10https://gerrit.wikimedia.org/r/151021 (owner: 10Aaron Schulz) [05:14:13] (03PS1) 10Springle: Move enwiki api traffic away from lagging slaves [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/151022 [05:15:20] (03CR) 10Springle: [C: 032] Move enwiki api traffic away from lagging slaves [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/151022 (owner: 10Springle) [05:15:24] (03Merged) 10jenkins-bot: Move enwiki api traffic away from lagging slaves [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/151022 (owner: 10Springle) [05:16:15] !log springle Synchronized wmf-config/db-eqiad.php: Move enwiki api traffic away from lagging slaves (duration: 00m 07s) [05:16:20] Logged the message, Master [05:27:58] (03PS1) 10Ori.livneh: Beta: depool deployment-mediawiki02 to investigate HHVM lock-up [operations/puppet] - 10https://gerrit.wikimedia.org/r/151024 [05:45:45] AaronSchulz: does wfWaitForSlaves() give up if a slaves falls too far behind? [05:46:36] just wondering why updates are still running on enwiki master with those slaves lagged... [05:46:48] I think you are right [05:46:50] $result = $conn->masterPosWait( $this->mWaitForPos, $this->mWaitTimeout ); [05:47:00] huh [05:47:12] how about we just fix that :) [05:47:13] doWait() will return false [05:47:22] it won't pass that up the call stack though [05:47:38] that seems patently stupid (it should at least pass it up or be overridable) [05:47:47] gah, +1 to todo list [05:47:51] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:46:50 UTC [05:48:14] i guess this would be why beta cluster lagged so much [05:48:33] springle: yeah, I though it odd that a modest batch size would cause issues there [05:49:34] AaronSchulz: any idea how far through enwiki we are? [05:49:51] PROBLEM - Puppet freshness on stat1002 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:49:17 UTC [05:50:04] springle: ok, screw it, I'm stopping the script on enwiki for now [05:50:08] I just stopped it, it was at 8404974 [05:50:17] I'll fix that method and add a flag tomorrow [05:50:58] s/flag/option [05:53:51] PROBLEM - Puppet freshness on stat1003 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:53:16 UTC [05:55:51] RECOVERY - MySQL Slave Delay on db1051 is OK: OK replication delay 91 seconds [05:56:31] RECOVERY - MySQL Replication Heartbeat on db1051 is OK: OK replication delay -1 seconds [05:57:58] AaronSchulz: ok fair enough. thank you [06:00:51] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 04:00:22 UTC [06:12:51] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [06:13:44] springle: is there a bug about queries that change no rows still being in the binlog? Maybe the answer is just "use rbr". [06:13:52] RECOVERY - MySQL Replication Heartbeat on db1055 is OK: OK replication delay 145 seconds [06:13:58] that's probably the main reason for a --start option ;) [06:14:29] wasted time and I/O being considerations too of course [06:14:51] RECOVERY - MySQL Slave Delay on db1055 is OK: OK replication delay 0 seconds [06:18:22] AaronSchulz: rbr would also shunt the entire links tables around. bulk updates + rbr are troublesome [06:22:29] springle: I'm talking about the updates up to the point were it reached [06:22:39] those would affect very few rows [06:23:04] yep, i get the point. but after that rbr will suck :) [06:23:09] so RBR would be lightweight until it reached the spot were it left off [06:23:32] just saying, it's useful when the programmer is too lazy to change the script :) [06:23:40] bah [06:23:41] ;) [06:29:01] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:31] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:32] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:41] RECOVERY - Disk space on vanadium is OK: DISK OK [06:30:01] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:01] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:21] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:22] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:22] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:22] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:32] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:51] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] springle: I guess you can move the loads back [06:31:01] * AaronSchulz wanders off [06:34:52] (03CR) 10Ori.livneh: [C: 04-2] "Cherry-picked on beta for debugging; this need not and should not be merged" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151024 (owner: 10Ori.livneh) [06:45:31] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:01] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:46:01] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:46:01] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:21] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [06:46:31] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:32] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:51] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:47:22] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:47:28] (03CR) 10Nemo bis: "Yes, so what, you just confirmed it yourself. I didn't say it was invalid because of lack of consensus, just that there was no consensus r" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150301 (https://bugzilla.wikimedia.org/68815) (owner: 10Reedy) [06:48:11] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 2 failures [07:00:08] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Fri Aug 1 06:59:54 UTC 2014 [07:05:11] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [07:08:54] (03PS2) 10Ori.livneh: Beta: depool deployment-mediawiki01 to investigate HHVM lock-up [operations/puppet] - 10https://gerrit.wikimedia.org/r/151024 [07:21:04] (03PS4) 10Giuseppe Lavagetto: mediawiki::web: get rid of envvars.appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/147514 (owner: 10Ori.livneh) [07:24:34] <_joe_> !log stopping puppet on appservers to deploy a potentially dangerous case [07:24:40] Logged the message, Master [07:25:17] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: get rid of envvars.appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/147514 (owner: 10Ori.livneh) [07:33:51] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 05:33:30 UTC [07:40:39] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix envvars permissions, remove cruft. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151032 [07:40:51] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 05:40:11 UTC [07:41:25] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: fix envvars permissions, remove cruft. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151032 (owner: 10Giuseppe Lavagetto) [07:47:10] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Aug 1 07:46:04 UTC 2014 (duration 46m 3s) [07:47:14] Logged the message, Master [07:48:41] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet last ran 14423 seconds ago, expected 14400 [07:48:51] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:46:50 UTC [07:49:22] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet last ran 14412 seconds ago, expected 14400 [07:50:42] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet last ran 14423 seconds ago, expected 14400 [07:50:51] PROBLEM - Puppet freshness on stat1002 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:49:17 UTC [07:50:52] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet last ran 14457 seconds ago, expected 14400 [07:51:01] (03PS2) 10Giuseppe Lavagetto: apache: remove $rejected_pkgs [operations/puppet] - 10https://gerrit.wikimedia.org/r/150816 [07:51:31] (03CR) 10Giuseppe Lavagetto: [C: 032] apache: remove $rejected_pkgs [operations/puppet] - 10https://gerrit.wikimedia.org/r/150816 (owner: 10Giuseppe Lavagetto) [07:51:48] <_joe_> yes, I'm releasing patches on friday morning [07:54:41] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet last ran 14418 seconds ago, expected 14400 [07:54:51] PROBLEM - Puppet freshness on stat1003 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:53:16 UTC [07:55:07] (03CR) 10Alexandros Kosiaris: [C: 031] Separate kafka-mirror out into its own package [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150883 (owner: 10Ottomata) [07:59:06] (03PS2) 10Giuseppe Lavagetto: mediawiki: get rid of envvars files in puppet. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150492 [08:00:31] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Fri Aug 1 08:00:30 UTC 2014 [08:01:00] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: get rid of envvars files in puppet. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150492 (owner: 10Giuseppe Lavagetto) [08:04:22] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Puppet has 1 failures [08:04:42] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet has 1 failures [08:04:52] PROBLEM - puppet last run on mw1038 is CRITICAL: CRITICAL: Puppet has 1 failures [08:04:52] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: Puppet last ran 14445 seconds ago, expected 14400 [08:04:52] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 1 failures [08:04:52] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [08:05:22] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 1 failures [08:05:41] <_joe_> wtf??? [08:06:14] buongiorno [08:06:50] <_joe_> bonjour hashar [08:07:22] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:08:23] (03PS1) 10Giuseppe Lavagetto: Revert "mediawiki: get rid of envvars files in puppet." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151033 [08:08:40] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "mediawiki: get rid of envvars files in puppet." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151033 (owner: 10Giuseppe Lavagetto) [08:13:41] RECOVERY - Puppet freshness on db1011 is OK: puppet ran at Fri Aug 1 08:13:34 UTC 2014 [08:13:51] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [08:18:11] PROBLEM - puppetmaster https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:21:02] RECOVERY - puppetmaster https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.200 second response time [08:27:01] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [08:27:22] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [08:27:42] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:27:52] RECOVERY - puppet last run on mw1038 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:28:03] (03PS1) 10Giuseppe Lavagetto: Fix the envvars file to our specifications [operations/puppet] - 10https://gerrit.wikimedia.org/r/151036 [08:28:58] (03PS2) 10Giuseppe Lavagetto: Fix the envvars file to our specifications [operations/puppet] - 10https://gerrit.wikimedia.org/r/151036 [08:41:44] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Fix the envvars file to our specifications [operations/puppet] - 10https://gerrit.wikimedia.org/r/151036 (owner: 10Giuseppe Lavagetto) [08:43:31] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Epic puppet fail [08:44:47] (03PS1) 10Giuseppe Lavagetto: fix file_line matching [operations/puppet] - 10https://gerrit.wikimedia.org/r/151037 [08:45:29] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] fix file_line matching [operations/puppet] - 10https://gerrit.wikimedia.org/r/151037 (owner: 10Giuseppe Lavagetto) [08:47:31] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:57:50] _joe_: any clue what is the git repo being used to build HHVM packages? [08:57:57] I can see [08:57:57] operations/debs/hhvm [08:57:57] operations/software/hhvm-dev [08:58:36] <_joe_> the first one [08:58:53] <_joe_> hashar: but building the thing is far from being straightforward atm [08:59:09] adding that to my backlog [08:59:10] <_joe_> and I don't have time right now to make it so [08:59:18] I want to have jenkins build the .deb for us when we send a patch :] [08:59:20] *evil* [08:59:28] <_joe_> hashar: foolish [08:59:41] <_joe_> building hhvm takes ~ 40 minutes with 4 cpus [08:59:46] <_joe_> ~25 with 8 [08:59:52] (03PS26) 10Alexandros Kosiaris: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [09:00:13] or 200minutes on 1 cpu :] [09:00:16] <_joe_> also, there are a few corner cases where that would be terrible [09:00:31] it was merely to report lintian / piuparts errors [09:00:36] <_joe_> (rebuilding for one commit) [09:01:06] <_joe_> hashar: well, let's say we want the build process in jenkins _eventually_ [09:01:17] <_joe_> right now, it's still a no-go [09:01:39] <_joe_> (and even when it's a jenkins job, I'd leave that out of CI, and do it on-demand) [09:02:58] (03PS1) 10Giuseppe Lavagetto: mediawiki: add envvars class, get rid of cleaning directive. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151041 [09:04:04] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: add envvars class, get rid of cleaning directive. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151041 (owner: 10Giuseppe Lavagetto) [09:06:20] (03CR) 10Alexandros Kosiaris: "I just did the refactoring to use the apache module. Let's see what the puppet compiler says now :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [09:07:51] (03PS1) 10Hashar: Merge tag 'v0.10.0' into gerrit-master [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/151042 [09:08:09] _joe_: fair :-) [09:10:02] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 1 failures [09:10:22] PROBLEM - puppet last run on mw1059 is CRITICAL: CRITICAL: Epic puppet fail [09:10:31] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 1 failures [09:10:31] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Puppet has 1 failures [09:10:35] <_joe_> !log apache mediawiki::web train finished its run. re-enabling puppet on all appservers [09:10:42] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 1 failures [09:10:42] Logged the message, Master [09:10:49] (03PS2) 10Hashar: Merge tag 'v0.10.0' into gerrit-master [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/151042 (https://bugzilla.wikimedia.org/68995) [09:11:59] (03CR) 10Filippo Giunchedi: "minor comments, looks good overall" (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150781 (owner: 10Giuseppe Lavagetto) [09:12:14] <_joe_> godog: thanks! [09:12:25] <_joe_> ouch I forgot one change to add to the mix [09:13:12] np [09:13:14] springle: hey, I'm curious on how the contenthandler schema changes are coming along (https://bugzilla.wikimedia.org/show_bug.cgi?id=49193) [09:14:26] (03PS6) 10Giuseppe Lavagetto: mediawiki::web: compatability fixes for apache2.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/147994 (owner: 10Ori.livneh) [09:22:30] (03PS3) 10Hashar: Merge tag 'v0.10.0' into gerrit-master [operations/debs/jenkins-debian-glue] - 10https://gerrit.wikimedia.org/r/151042 (https://bugzilla.wikimedia.org/68995) [09:23:31] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [09:23:42] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [09:24:02] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [09:24:31] RECOVERY - puppet last run on mw1059 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [09:24:31] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [09:29:49] (03CR) 10Filippo Giunchedi: "LGTM, just one comment" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 (owner: 10Ori.livneh) [09:32:31] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki::web: compatability fixes for apache2.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/147994 (owner: 10Ori.livneh) [09:35:51] (03PS1) 10Hashar: contint: tie android SDK packages to Precise [operations/puppet] - 10https://gerrit.wikimedia.org/r/151048 [09:37:51] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 07:37:03 UTC [09:39:45] (03CR) 10Hashar: [C: 031 V: 031] "Cherry picked on integration puppetmaster. Solve the wrong package issue on Trusty instances. Precise instances are happy." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151048 (owner: 10Hashar) [09:39:55] easy merge ^^^ :D [09:40:53] (03CR) 10Filippo Giunchedi: "LGTM" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/149800 (owner: 10Ori.livneh) [09:45:51] (03PS27) 10Alexandros Kosiaris: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [09:46:56] (03CR) 10Filippo Giunchedi: [C: 031] contint: tie android SDK packages to Precise (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151048 (owner: 10Hashar) [09:48:04] hashar: /win 40 [09:48:08] nope! [09:48:11] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [09:48:13] /fail 40 [09:48:58] gotta leave, kid sick :/ [09:49:17] <_joe_> :/ [09:49:51] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:46:50 UTC [09:51:51] PROBLEM - Puppet freshness on stat1002 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:49:17 UTC [09:55:51] PROBLEM - Puppet freshness on stat1003 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:53:16 UTC [09:57:31] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Fri Aug 1 09:57:21 UTC 2014 [10:00:51] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 08:00:30 UTC [10:01:31] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "One very small comment, in general LGTM but I'd like to review the hhvm module in depth before merging, also in light of changes we're mak" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 (owner: 10Ori.livneh) [10:14:51] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [10:16:19] (03CR) 10Nikerabbit: Add HHVM module (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 (owner: 10Ori.livneh) [11:00:21] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Fri Aug 1 11:00:14 UTC 2014 [11:31:33] (03CR) 10Mark Bergsma: [C: 04-2] Removed exim errors_to to support custom Return-Path (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas) [11:33:51] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 09:33:16 UTC [11:50:51] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:46:50 UTC [11:52:51] PROBLEM - Puppet freshness on stat1002 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:49:17 UTC [11:53:41] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Fri Aug 1 11:53:35 UTC 2014 [11:54:56] https://dumps.wikimedia.org/ is down for a couple of hours now - I also can't reach the dumps server via stat1002:/mnt/data (weirdly commands cd or df don't fail, just hang) [11:55:21] ezachte: ah, thanks... [11:55:46] do you know whom to ping? maybe? [11:56:18] jorn, is this the wrong channel? [11:56:51] PROBLEM - Puppet freshness on stat1003 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:53:16 UTC [11:57:14] no, i just was thinking about writing that here as well but didn't know if it's the right channel... [11:57:39] ezachte, jorn: andrewbogott_afk (see channel topic) would be the one to ask. [11:57:41] ezachte: thanks, i'll contact ariel [11:57:46] Bat as he is not arount. [11:57:59] * qchris cannot type :-( [11:58:10] he isn't I tried mail couple of hours ago [11:58:29] no, of course he's not around in this timezone ;) [11:58:40] but ariel is here now [11:58:43] ezachte: immediately before you joined the channel icinga-wm posted: PROBLEM - Puppet freshness on stat1002 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 03:49:17 UTC [11:58:50] ezachte: I'm here and looking at it [11:59:08] ok, thx :-) [12:06:01] PROBLEM - Host dataset1001 is DOWN: CRITICAL - Host Unreachable (208.80.154.11) [12:06:52] RECOVERY - Host dataset1001 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [12:06:52] RECOVERY - DPKG on dataset1001 is OK: All packages OK [12:06:53] RECOVERY - LighttpdHTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5122 bytes in 0.009 second response time [12:06:53] RECOVERY - SSH on dataset1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [12:06:53] RECOVERY - RAID on dataset1001 is OK: OK: optimal, 2 logical, 24 physical [12:07:11] RECOVERY - Puppet freshness on dataset1001 is OK: puppet ran at Fri Aug 1 12:07:01 UTC 2014 [12:07:21] RECOVERY - puppet disabled on dataset1001 is OK: OK [12:07:22] (03PS11) 1001tonythomas: Removed exim errors_to to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 [12:07:22] RECOVERY - check if dhclient is running on dataset1001 is OK: PROCS OK: 0 processes with command name dhclient [12:07:22] RECOVERY - check configured eth on dataset1001 is OK: NRPE: Unable to read output [12:07:28] !log powercycled dataset1001, inaccessible via mgmt console, only visible message was 'mnt.nfs failed' [12:07:34] Logged the message, Master [12:08:41] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:08:44] (03CR) 1001tonythomas: "@Mark:- My mistake. Looks like I got it wrong after the rebase. Now looks good ?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas) [12:08:46] (03CR) 10Alexandros Kosiaris: [C: 032] "Catalog compiled successfully. Finally merging" [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 (owner: 10Matanya) [12:08:51] RECOVERY - Puppet freshness on stat1002 is OK: puppet ran at Fri Aug 1 12:08:46 UTC 2014 [12:09:01] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:09:31] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:09:52] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [12:10:42] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [12:13:41] RECOVERY - Puppet freshness on stat1003 is OK: puppet ran at Fri Aug 1 12:13:35 UTC 2014 [12:13:42] PROBLEM - https.etherpad.wikimedia.org on zirconium is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string titleEtherpad not found on https://etherpad.wikimedia.org:443/p/Etherpad - 566 bytes in 0.021 second response time [12:14:41] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [12:15:51] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [12:17:26] atop showed nothing, I don't see anything too crazy in the logs but going to keep an eye on it, maybe the switchover to the new nfs service went awry somehow [12:17:34] shouldn't really hang the box... anyways [12:23:23] PROBLEM - etherpad_lite_process_running on zirconium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^node node_modules/ep_etherpad-lite/node/server.js [12:23:44] ah yes... I know what this is... [12:23:47] (03PS1) 10Alexandros Kosiaris: Fix errors introduced in 487eead [operations/puppet] - 10https://gerrit.wikimedia.org/r/151063 [12:28:18] (03PS2) 10Alexandros Kosiaris: Fix errors introduced in 487eead [operations/puppet] - 10https://gerrit.wikimedia.org/r/151063 [12:28:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix errors introduced in 487eead [operations/puppet] - 10https://gerrit.wikimedia.org/r/151063 (owner: 10Alexandros Kosiaris) [12:31:24] RECOVERY - etherpad_lite_process_running on zirconium is OK: PROCS OK: 1 process with regex args ^node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [12:40:18] (03Abandoned) 10Giuseppe Lavagetto: hhvm: lintian fixes [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150213 (owner: 10Giuseppe Lavagetto) [12:40:42] (03Abandoned) 10Giuseppe Lavagetto: hhvm: lintian fixes [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150826 (owner: 10Giuseppe Lavagetto) [12:45:13] (03PS3) 10Giuseppe Lavagetto: hhvm: provide hhvm-api-$VERSION [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150845 [12:45:36] Hi mark. When you are free, can you take one more look at https://gerrit.wikimedia.org/r/#/c/141287/ ? [12:49:15] (03CR) 10Mark Bergsma: "Please remove the mail.ini file from the repository as well in this commit. Then it's good to go." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas) [12:54:36] (03PS12) 1001tonythomas: Removed exim errors_to to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 [12:55:04] mark: :) done. Looks good ? [12:56:16] (03CR) 10Mark Bergsma: [C: 032] Removed exim errors_to to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas) [12:59:12] mark: yay! Thanks [12:59:44] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:04] PROBLEM - puppet last run on mw1058 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:04] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:07] hm that broke [13:00:11] i'll fix that [13:00:14] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:18] (03CR) 10Alexandros Kosiaris: Mathoid configuration for beta labs (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/148836 (owner: 10Physikerwelt) [13:00:24] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:37] (03PS16) 10Alexandros Kosiaris: Mathoid configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/148836 (owner: 10Physikerwelt) [13:00:43] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:43] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:44] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:53] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: Puppet has 1 failures [13:00:53] PROBLEM - puppet last run on mw1070 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:03] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:13] PROBLEM - puppet last run on mw1094 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:13] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:14] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:24] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:31] <_joe_> mmmh [13:01:36] (03PS1) 10Mark Bergsma: Remove source reference to mail.ini [operations/puppet] - 10https://gerrit.wikimedia.org/r/151075 [13:01:43] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:44] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:44] PROBLEM - puppet last run on mw1127 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:53] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [13:01:54] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:01] <_joe_> for the record [13:02:03] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:03] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:06] <_joe_> it's working fine [13:02:22] what is? [13:02:23] PROBLEM - puppet last run on mw1138 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:24] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:24] PROBLEM - puppet last run on mw1196 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:24] PROBLEM - puppet last run on mw1036 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:33] <_joe_> no sorry it isn't [13:02:35] (03CR) 10Mark Bergsma: [C: 032] Remove source reference to mail.ini [operations/puppet] - 10https://gerrit.wikimedia.org/r/151075 (owner: 10Mark Bergsma) [13:02:43] PROBLEM - puppet last run on mw1218 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:43] PROBLEM - puppet last run on mw1184 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:44] PROBLEM - puppet last run on mw1013 is CRITICAL: CRITICAL: Puppet has 1 failures [13:02:53] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:04] PROBLEM - puppet last run on mw1035 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:05] PROBLEM - puppet last run on mw1040 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:13] PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:43] PROBLEM - puppet last run on mw1147 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:43] PROBLEM - puppet last run on mw1192 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:43] PROBLEM - puppet last run on mw1062 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:43] PROBLEM - puppet last run on mw1161 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:44] PROBLEM - puppet last run on mw1028 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:44] PROBLEM - puppet last run on mw1124 is CRITICAL: CRITICAL: Puppet has 1 failures [13:03:53] PROBLEM - puppet last run on mw1132 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:03] PROBLEM - puppet last run on mw1048 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:13] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:13] PROBLEM - puppet last run on mw1130 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:23] PROBLEM - puppet last run on mw1216 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:44] PROBLEM - puppet last run on mw1080 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:44] PROBLEM - puppet last run on mw1089 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:44] PROBLEM - puppet last run on mw1005 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:53] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:53] PROBLEM - puppet last run on mw1067 is CRITICAL: CRITICAL: Puppet has 1 failures [13:04:53] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:03] PROBLEM - puppet last run on mw1038 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:04] PROBLEM - puppet last run on mw1106 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:13] PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:13] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:13] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:14] PROBLEM - puppet last run on mw1072 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:23] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [13:05:34] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:43] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:43] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:43] PROBLEM - puppet last run on mw1045 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:53] PROBLEM - puppet last run on mw1006 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:53] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:53] PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: Puppet has 1 failures [13:05:53] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 1 failures [13:06:13] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: Puppet has 1 failures [13:06:24] PROBLEM - puppet last run on mw1082 is CRITICAL: CRITICAL: Puppet has 1 failures [13:06:33] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: Puppet has 1 failures [13:06:43] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 1 failures [13:06:44] PROBLEM - puppet last run on mw1059 is CRITICAL: CRITICAL: Puppet has 1 failures [13:07:03] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures [13:07:38] looks like we'll need to restart apache, puppet doesn't do so [13:13:03] PROBLEM - Puppet freshness on db1011 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 11:12:49 UTC [13:13:43] RECOVERY - Puppet freshness on db1011 is OK: puppet ran at Fri Aug 1 13:13:34 UTC 2014 [13:14:28] mark: oh no. [13:14:40] fixed now ? [13:18:04] RECOVERY - puppet last run on mw1058 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:18:04] RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:18:14] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:18:14] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [13:18:24] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [13:18:33] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:18:43] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:18:43] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [13:18:44] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [13:18:44] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [13:18:44] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:18:53] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:18:53] RECOVERY - puppet last run on mw1070 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:19:03] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:19:43] RECOVERY - puppet last run on mw1184 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [13:19:43] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:19:53] RECOVERY - puppet last run on mw1127 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [13:19:53] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:19:54] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:20:03] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:20:04] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [13:20:13] RECOVERY - puppet last run on mw1094 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:20:13] RECOVERY - puppet last run on tmh1002 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:20:23] RECOVERY - puppet last run on mw1138 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:20:43] RECOVERY - puppet last run on mw1218 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:20:44] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [13:21:04] RECOVERY - puppet last run on mw1035 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [13:21:04] RECOVERY - puppet last run on mw1040 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [13:21:15] RECOVERY - puppet last run on mw1096 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:21:15] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [13:21:23] RECOVERY - puppet last run on mw1216 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:21:24] RECOVERY - puppet last run on mw1196 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:21:24] RECOVERY - puppet last run on mw1036 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [13:21:43] RECOVERY - puppet last run on mw1147 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [13:21:43] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:21:43] RECOVERY - puppet last run on mw1192 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:21:44] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [13:21:53] RECOVERY - puppet last run on mw1109 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [13:22:13] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [13:22:43] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [13:22:44] RECOVERY - puppet last run on mw1080 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:22:44] RECOVERY - puppet last run on mw1089 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [13:22:44] RECOVERY - puppet last run on mw1028 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:22:44] RECOVERY - puppet last run on mw1124 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:22:44] RECOVERY - puppet last run on mw1005 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [13:22:53] RECOVERY - puppet last run on mw1067 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [13:22:54] RECOVERY - puppet last run on mw1132 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [13:23:03] RECOVERY - puppet last run on mw1038 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [13:23:03] RECOVERY - puppet last run on mw1048 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:23:13] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [13:23:14] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [13:23:14] RECOVERY - puppet last run on mw1072 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:23:43] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:23:44] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [13:23:53] RECOVERY - puppet last run on mw1006 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [13:23:54] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [13:24:13] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:24:13] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [13:24:24] RECOVERY - puppet last run on mw1082 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [13:24:33] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [13:24:43] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [13:24:44] RECOVERY - puppet last run on mw1059 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:24:53] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:24:54] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:24:54] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [13:24:54] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [13:25:03] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [13:25:03] RECOVERY - puppet last run on mw1106 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [13:26:44] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:30:14] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Mathoid configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/148836 (owner: 10Physikerwelt) [13:31:22] (03PS2) 10Filippo Giunchedi: beta + hhvm: Add bt-hhvm dump script [operations/puppet] - 10https://gerrit.wikimedia.org/r/150593 (owner: 10BryanDavis) [13:31:43] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [13:32:07] (03CR) 10Filippo Giunchedi: "updated bt-hhvm to also copy the binary as per bsimmers" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150593 (owner: 10BryanDavis) [13:43:17] (03PS3) 10BBlack: beta: Remove require of Ferm::Rule['bastion-ssh'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/150576 (owner: 10BryanDavis) [13:43:23] (03CR) 10BBlack: [C: 032] beta: Remove require of Ferm::Rule['bastion-ssh'] [operations/puppet] - 10https://gerrit.wikimedia.org/r/150576 (owner: 10BryanDavis) [13:46:54] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:44] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.046 second response time [13:48:54] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 326 seconds [13:49:13] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 333 seconds [13:49:24] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 340 seconds [13:50:03] (03PS2) 10Ottomata: stats: Install php commandline packages to the crunchers [operations/puppet] - 10https://gerrit.wikimedia.org/r/150985 (https://bugzilla.wikimedia.org/68937) (owner: 10Yuvipanda) [13:50:11] (03CR) 10Ottomata: [C: 032 V: 032] stats: Install php commandline packages to the crunchers [operations/puppet] - 10https://gerrit.wikimedia.org/r/150985 (https://bugzilla.wikimedia.org/68937) (owner: 10Yuvipanda) [13:51:43] (03PS2) 10Filippo Giunchedi: swift-thumb-stats: dump thumb stats from swift [operations/software] - 10https://gerrit.wikimedia.org/r/148997 [13:52:43] nice filippo :) [13:53:48] mark: ye it takes a while to walk 330M objects and swift returns only 10k pages :( [13:54:08] i'm sure :) [13:54:15] RECOVERY - MySQL Replication Heartbeat on db1018 is OK: OK replication delay 137 seconds [13:54:24] RECOVERY - MySQL Slave Delay on db1018 is OK: OK replication delay 130 seconds [13:54:54] RECOVERY - MySQL Replication Heartbeat on db69 is OK: OK replication delay 142 seconds [13:58:25] heya chasemp, q about admin data.yaml [13:58:28] yt? [13:59:43] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:07:21] strontium!? [14:07:48] weird, [14:07:56] fixed. but dunno why they hook wouldn't have worked [14:08:03] the* [14:08:43] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [14:08:53] ottomata: I noticed it fails asking for a name, and that in turn means it is trying to commit while merging, no idea why though it doesn't considers that as a fast forward [14:09:18] asking for a name? [14:09:31] asking for the user's identity [14:09:43] name/email that is [14:10:11] <_joe_> what did rsync did to us? :P [14:10:21] <_joe_> s/did/do/ [14:16:05] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [14:27:39] (03PS2) 10Ottomata: Separate kafka-mirror out into its own package [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150883 [14:34:03] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 12:32:57 UTC [14:42:00] and I am out for the weekend. Have family at home [14:42:04] see you on monday [14:44:35] akosiaris: Thank you for reviewing the Mathoid change [14:50:06] physikerwelt: you are welcome. I merged it so that part is done [14:52:24] ok… who understands about submodules? Specifically the 'worktree' setting in .git/modules/blahblah/config? [14:52:33] I ask because that contains an absolute path... [14:52:40] The next step would be to assign the mathoid role to at least one instance [14:52:44] whereas pretty much everything else about a git tree can be mv'd [14:52:47] am I right? [14:52:50] but apparently not if there are submodules? [14:52:57] YuviPanda: thoughts about ^ ? [14:53:14] hmm, unsure. [14:53:17] Reedy: ^ [14:53:43] andrewbogott: btw, RoanKattouw_away's issues with VE were apparently *exactly* the same. he apparently just did a fresh clone of the repo at the same path, and everything 'magically worked' [14:54:27] physikerwelt: yup. In labs I suppose first and well obviously production later on [14:55:02] bah, apparently this is a bug in git, fixed in 1.7.10 (which I'm now running 1.7.9.5. So close!) [14:55:44] andrewbogott: hmm, 'sudo apt-get install git'? I wonder if there's an update in precise [14:55:58] bah [14:55:59] nope [14:55:59] no [14:56:04] grr [14:59:27] YuviPanda: yep, that's what was happening last night. All the commands I was issuing were happening… elsewhere. [14:59:37] Due to that absolute path not pointing to the actual working dir. [14:59:40] ugh [14:59:56] first time I've ever hit a bug on git [14:59:58] itself [15:00:00] So, moved that to point to the right place, and all is well. [15:00:08] Well, maybe self-inflicted because a mv'd a repo? [15:00:15] But that works in more-or-less every other situation. [15:00:23] Anyway, now I just have to fix that path in… every other submodule :( [15:00:30] Want me to turn OAuth back on so you can try? [15:00:46] andrewbogott: yeah, sure [15:01:42] Hm, well, wikitech still loads, that's something! Try now? [15:01:48] :D [15:01:49] moment [15:04:33] andrewbogott: hmm, did yo u get a fatal this time? [15:04:46] nope [15:04:52] * YuviPanda checks [15:05:35] andrewbogott: ah, hmm. 'Error: An error occurred in the OAuth protocol: Invalid consumer' [15:05:43] andrewbogott: maybe I should register another one. [15:05:44] * YuviPanda does [15:13:03] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Fri Aug 1 15:13:01 UTC 2014 [15:14:09] (03PS3) 10Ottomata: Separate kafka-mirror out into its own package [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150883 [15:14:59] (03CR) 10Ottomata: [C: 032 V: 032] Separate kafka-mirror out into its own package [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150883 (owner: 10Ottomata) [15:18:00] andrewbogott: Y'know, I'm sure we had a very similar issue on production originally [15:18:03] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 13:17:07 UTC [15:18:06] When we git cloned to one location, then moved it [15:18:09] then git hell broke loose [15:18:15] Reedy: yep [15:18:26] Seems to be a documented git bug. Or, lack of git feature :) [15:18:34] (03CR) 10BryanDavis: "Cherry-picked patch set #2 to deployment-salt" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150593 (owner: 10BryanDavis) [15:18:34] I think I commented yesterday to Roan/Yuvi that we're probably running a "buggy" git version [15:19:29] Folks on stack overflow think that I can sed the workdir in all these submodule configs to be a relative path. [15:19:33] Relative to what, I'm wondering? [15:19:52] lol [15:20:01] I'd be tempted to just nuke the tree and reclone [15:20:11] the move cache/images etc back in [15:20:37] Well, that will /guarantee/ an outage, and still get me a tree that can't be moved... [15:20:59] I guess since I'm in the middle of a SWAT window I should just stand back. [15:21:09] I'll try to sed in a bit. [15:21:55] symlinks, moving and stuff [15:21:56] god knows :) [15:22:44] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [15:26:19] andrewbogott: YUSSSS, IT WORKSSSSS [15:26:20] thanks :D [15:26:25] cool! [15:32:15] (03PS1) 10Ottomata: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 [15:33:48] andrewbogott: merge https://gerrit.wikimedia.org/r/#/c/150970/? trivial, and is already running on quarry.wmflabs [15:36:28] (03CR) 10Andrew Bogott: [C: 032] quarry: Switch to halfak's mwoauth library [operations/puppet] - 10https://gerrit.wikimedia.org/r/150970 (owner: 10Yuvipanda) [15:36:34] andrewbogott: ty [15:36:46] Reedy, andrewbogott: Oh. I know about the git submodules making non-relocatable paths. That's why I wrote https://gerrit.wikimedia.org/r/#/c/130498/. The version of git we have from precise makes submodule paths absolute rather than relative in the working copy. [15:37:20] This has been changed in newer versions of git to use relative paths which is much more sane IMHO [15:37:22] yeah, I noticed :( [15:38:42] (03PS2) 10Ottomata: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 [15:38:55] hey is cmjohnson1 out todya? [15:42:43] !next [15:42:50] Reedy: That talk of relative links in submodules reminds me of something else. I think we should use relative links in the .gitmodules file for wmf branches too. -- http://blog.tremily.us/posts/Relative_submodules/ [15:43:03] jouncebot: next [15:43:20] jouncebot: reload [15:43:30] jouncebot: OP PLZ REPLY [15:43:31] SWAT window [15:43:50] on friday? [15:44:03] Does it restart after being killed yet? [15:44:08] <_joe_> afternoon [15:44:21] c'mon jenkins, hurry merge kthx [15:46:04] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [15:48:21] !log reedy Synchronized php-1.24wmf16/includes/specials/SpecialRecentchangeslinked.php: (no message) (duration: 00m 14s) [15:48:27] Logged the message, Master [15:53:27] (03PS4) 10Ori.livneh: mediawiki: use HHVM module on trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 [15:53:39] ^ _joe_, godog. (and: afternoon) [15:56:06] Wooo I actually closed an RT ticket [15:57:33] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Fri Aug 1 15:57:24 UTC 2014 [15:58:13] * YuviPanda congratulates andrewbogott [16:01:03] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 14:00:08 UTC [16:01:24] ori: hey [16:01:33] hello [16:03:23] ori: hiya [16:03:24] haa [16:03:30] how do I fix vagrant?! [16:03:43] when running setup.sh [16:03:44] Message: The mediawiki-vagrant plugin hasn't been installed yet. Please run `setup.sh`. [16:03:44] Failed to execute command `vagrant plugin list` (pid 13482 exit 1) [16:03:44] ottomata: How did you break it? [16:03:50] i pulled! [16:04:03] Upgrade your Vagrant install [16:04:06] i did! [16:04:19] i think... [16:04:28] vagrant --version [16:04:37] hm, no it is old, i thought i installed the package already though [16:04:38] grr [16:04:40] ok, thanks... [16:04:54] hm, mabye I just upgraded virtual box? [16:04:55] ok.. [16:07:24] (03PS1) 10BryanDavis: beta: puppet rebase script [operations/puppet] - 10https://gerrit.wikimedia.org/r/151099 (https://bugzilla.wikimedia.org/66683) [16:08:37] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Please note that all this does is adding CSS class names to the HTML output. At the moment there is no CSS assigned to these class names." (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149918 (https://bugzilla.wikimedia.org/40810) (owner: 10Bene) [16:11:18] (03CR) 10Filippo Giunchedi: [C: 031] mediawiki: use HHVM module on trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 (owner: 10Ori.livneh) [16:11:35] ori: LGTM, did you had a chance to give it to the puppet compiler? [16:12:35] godog: no, it was broken yesterday when i tried [16:12:43] (the compiler, not the patch) [16:12:52] i did apply it on beta, where it did the right hting [16:13:05] but not with the very latest patchset [16:13:11] if you'll be around in an hour i'd like to try then [16:13:48] <_joe_> the compiler is broken because something broke labs ldap [16:13:55] <_joe_> and I had no time to look at it [16:14:40] (03PS2) 10BryanDavis: beta: puppet rebase script [operations/puppet] - 10https://gerrit.wikimedia.org/r/151099 (https://bugzilla.wikimedia.org/66683) [16:15:23] 'something broke labs ldap'? [16:16:08] <_joe_> andrewbogott: two days ago, in the european morning, I guess somebody did something that fixed it [16:16:46] hm… there was a brief wikitech outage around then. But I can't think how that's related to ldap. [16:16:49] You don't know any more? [16:17:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [16:18:08] ori: I should be around, I may play the paranoid card though on a friday before wikimania :)) [16:18:13] (03PS3) 10BryanDavis: beta: puppet rebase script [operations/puppet] - 10https://gerrit.wikimedia.org/r/151099 (https://bugzilla.wikimedia.org/66683) [16:20:08] godog, andrewbogott: There was an apt package installed that modifies an /etc file. Until puppet puts the right contents back instances can't see labs ldap [16:20:20] It caused problems in beta too [16:20:32] bd808: so things are still broken? [16:20:40] Or just during the period between two puppet runs? [16:20:44] A puppet run fixes it [16:20:48] I see, ok. [16:20:53] RECOVERY - Puppet freshness on db1010 is OK: puppet ran at Fri Aug 1 16:20:49 UTC 2014 [16:21:20] but we had 4 hosts where puppet manifests were breaking the run so I had to fix that first for us [16:21:31] Anybody who has puppet disabled will be messed up [16:21:51] anyone who has puppet disabled won't get the apt update will they? [16:21:58] Or was that salted? [16:22:34] I don't think it was sent with salt because it happened in beta and we are attached to our own salt master [16:22:53] But the packge was updated on hosts that had puppet disabled [16:23:20] So maybe we have something cron'd for security updates? [16:23:39] Doubt it. [16:24:07] hashar tracked down the package change that caused the problem... I could try to find it in my irc logs [16:24:17] UHGHGHGHHG TRUSTY [16:24:24] growl growl growl [16:24:36] i cant' use vagrant to develop puppet anymore :( :( [16:25:05] ottomata: There is an mw-vagrant branch to stay on precise [16:25:20] (03CR) 10Thiemo Mättig (WMDE): [C: 031] "Let's try this!" (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149928 (https://bugzilla.wikimedia.org/40810) (owner: 10Bene) [16:25:33] ottomata: Checkout the precise-compat branch [16:25:56] You won't be getting new hotness; it's unmaintained at this point [16:25:57] ah ok! [16:26:01] thank you! [16:26:06] yay useful branch? :) [16:26:17] yes think so! [16:30:36] (03CR) 10QChris: "Except for the exit code, only Nits." (0312 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [16:32:46] godog, andrewbogott: Apparently it was partial puppet runs that caused the problem in beta. libpam-ldapd went from '0.8.4ubuntu0.2' to '0.8.4ubuntu0.3' and then /etc/nslcd.conf wasn't changed back to the labs custom version. [16:33:01] (03CR) 10QChris: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [16:33:11] bd808: ok, that makes a lot more sense [16:36:59] gah, ensure => latest alright, still baffled that it used the package version rather than the local version tho [16:37:58] godog: hashar complained about the same thing. [16:38:28] (03PS1) 10Alexandros Kosiaris: Improve package in various way [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/151106 [16:38:56] a tad late, but upstart support for etherpad-lite !! [16:39:05] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt and manually added cron for bd808 removed in favor of this." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151099 (https://bugzilla.wikimedia.org/66683) (owner: 10BryanDavis) [16:39:07] (03CR) 10QChris: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [16:39:38] need to add systemd too at some point [16:40:42] akosiaris: heh, I wonder which LTS version would be on systemd [16:41:54] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2035: active_shards: 6104: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [16:41:54] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2035: active_shards: 6104: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [16:41:55] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2035: active_shards: 6104: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [16:41:55] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2035: active_shards: 6104: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [16:41:55] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2035: active_shards: 6104: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [16:41:59] YuviPanda: that is easy. 16.04 [16:42:03] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2035: active_shards: 6104: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 1 [16:42:04] so in 2 years [16:42:08] ah, nice [16:42:13] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: No changes to merge. [16:42:14] (03PS1) 10Ottomata: chown /var/spool/kafka/ in kafka-server.postinst [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/151107 [16:42:33] uh oh red elasticsearch... [16:42:54] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2036: active_shards: 6107: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [16:42:54] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2036: active_shards: 6107: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [16:42:54] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2036: active_shards: 6107: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [16:42:54] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2036: active_shards: 6107: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [16:43:03] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2036: active_shards: 6107: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [16:43:03] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 19: number_of_data_nodes: 19: active_primary_shards: 2036: active_shards: 6107: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [16:43:03] flapping ? false alarm ? [16:44:59] (03PS1) 10Chad: Adding missing Swift dependencies [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/151108 [16:45:43] rejected execution (queue capacity 1000) [16:46:00] looks like not quite false...i think they are busy nodes [16:47:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Improve package in various way [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/151106 (owner: 10Alexandros Kosiaris) [16:47:44] <^d> Hmm. [16:47:46] <^d> What? [16:47:59] ^d, dunno, just saw those alerts and am looking for any clues [16:48:01] saw that in the logs [16:48:15] [DEBUG][action.search.type ] [elastic1016] [351389940] Failed to execute fetch phase [16:48:15] org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.search.action.SearchServiceTransportAction$23@790ea0d6 [16:48:32] oh tthat's old though [16:48:37] 14:40 [16:48:42] <^d> 1016 is has way too much load. [16:48:42] 2 hours old [16:48:44] yeah [16:48:46] <^d> I'm going to roll it. [16:48:50] that was from 1016 [16:48:51] roll it? [16:49:13] <^d> Well, move shards off, restart ES, move back in. [16:49:49] <^d> Hmm. [16:51:15] <^d> The heck? [16:51:21] <^d> Snapshots? [16:51:23] (03CR) 10Filippo Giunchedi: beta: puppet rebase script (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151099 (https://bugzilla.wikimedia.org/66683) (owner: 10BryanDavis) [16:51:54] (03PS5) 10Ori.livneh: mediawiki: use HHVM module on trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 [16:52:37] <^d> Oh, hmm. [16:52:48] (03PS4) 10BryanDavis: beta: puppet rebase script [operations/puppet] - 10https://gerrit.wikimedia.org/r/151099 (https://bugzilla.wikimedia.org/66683) [16:52:58] (03CR) 10BryanDavis: beta: puppet rebase script (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151099 (https://bugzilla.wikimedia.org/66683) (owner: 10BryanDavis) [16:53:57] <^d> Yo manybubbles! [16:54:11] ^d: I'm back! [16:54:23] <^d> 1016 is kind of unhappy. hot_threads says its churning on merging enwiki. [16:54:48] andrewbogott: it's still failing: https://integration.wikimedia.org/ci/computer/puppet-compiler02.eqiad.wmflabs/log [16:55:01] "ERROR: Server rejected the 1 private key(s) for jenkins-deploy" [16:55:36] <^d> ottomata: That error you see is ES rejecting a search query because it's too full doing other things. [16:56:02] (03CR) 10Filippo Giunchedi: [C: 031] beta: puppet rebase script [operations/puppet] - 10https://gerrit.wikimedia.org/r/151099 (https://bugzilla.wikimedia.org/66683) (owner: 10BryanDavis) [16:56:17] ^d: I don't see any such in the log - but its certainly possible something is up [16:56:23] bd808: r/151099 looks good, happy to merge too if need be [16:57:12] the message i saw was 2 hours old [16:57:17] ^d: that one has too many large shards on it - the best thing we can do to it if we think it is unstable is to swap a large shard on it with a small one using the forces allocation api [16:57:23] manybubbles: cluster state was red for a few minutes [16:57:28] ori: Is that the ldap thing? Or… [16:57:31] godog: Cool. Fastest puppet review I've ever had. :) [16:57:40] ottomata: ah - possible - I'm rebuilding shards [16:58:27] oh ok [16:58:28] phew [16:58:59] bd808: haha easy enough, eyeballed it and it is beta only :)) [16:59:06] ori: Want to quickly give tonythomas shell on beta? [16:59:22] gsoc student [16:59:35] hoo: Does he have a signed NDA? [16:59:38] hoo: I think he needs to sign a... [16:59:42] yeah, what bd808 said :) [16:59:51] signed NDA ? [17:00:14] beta has logs with private info (real ip addresses) [17:00:39] bd808: Do we require that now? [17:00:50] hoo: As long as I've been here, yeah [17:00:51] * YuviPanda thinks it's a terrible idea [17:00:54] bd808: uh [17:00:59] really? [17:01:00] not for shell in general [17:01:04] I don't disagree, just wonder [17:01:07] I thought that was just for root [17:01:14] shell can see logs, logs have ips [17:01:18] and that was also partly because there was talk of getting ssl certs. [17:01:20] there's a lot of volunteers even having root there [17:01:22] bd808: do you have a source for that? [17:01:29] YuviPanda: greg-g [17:01:30] https://wikitech.wikimedia.org/wiki/Help:Getting_Started#Request_Shell_Access [17:01:48] <^d> +1 to NDA for beta being freaking stupid. [17:02:04] bd808: beta labs !== beta cluster [17:02:06] ottomata and ^d: so if you want to fix it two things should happen - 1. I should keep working on the feature that I'm writing now that should prevent this from happening and 2. one of you should use the allocation api to bounce a big shard from elastic1016 to another node and swap it with a small shard on that node. [17:02:32] ori: thanks. will fill that rightaway [17:02:53] ori: now I'm confused. WMFLabs !== beta labs, but beta labs == beta cluster, no? [17:02:57] or is beta cluster something new? [17:03:13] NDA is for deploymnet-prep project [17:03:24] I don't know what "beta labs" is [17:03:28] (03PS1) 10Alexandros Kosiaris: Fix a JSON syntax error introduced in 9a90eb7 [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/151112 [17:03:32] <^d> ^ that's stupid [17:03:36] <^d> NDA for that ^ [17:03:37] <^d> Stupid [17:03:38] <^d> Stupid [17:03:52] greg-g: if it is indeed policy to require NDA for betalabs shell, I think it should be documented with rational somewhere [17:03:53] ottomata: another hting that'd be nice - we should reduce the merge threads to 1. I've seen more of these spikes related to two merges [17:03:59] * YuviPanda doesn't see a reason for it either [17:04:05] I'm pretty sure that parameter isn't dynamic so we have to schedule it for the next release with puppet [17:04:14] it was supposed to be dynamic but it didn't work [17:04:19] beta has logs with private info (real ip addresses) [17:04:25] good thing we don't expose that on wikipedia anywhere [17:04:33] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix a JSON syntax error introduced in 9a90eb7 [operations/debs/etherpad-lite] - 10https://gerrit.wikimedia.org/r/151112 (owner: 10Alexandros Kosiaris) [17:04:47] though - we're not dropping searches on the ground - no pool queue fulls. [17:04:53] that node is just sad [17:05:16] ori: just saw this one line, not responding to rest of context, what does "beta labs" mean to you that "beta cluster" doesn't? [17:05:17] The rationale (and I'm just stating, not defending) is that *.beta.wmflabs.org is open to the "real internet" not behind an ip stripping proxy. This means that logs there are just as sensitive as in production. [17:05:29] ori: I ask because I'm worried about the confusion in the org/community on what is what [17:05:45] bd808: err, labs general proxy doesn't strip ip addresses [17:05:47] bd808: it sends XFF [17:06:05] bd808: and lots of projects have their own public IPs as well. Tools is the only one that strips IP [17:06:46] greg-g: sorry, i meant labs != beta cluster [17:06:51] A) the reasoning bd808 stated was the reasoning I was under the impression mattered [17:06:54] * bd808 would love to remove the password from logstash-beta.wmflabs.org and give everyone root in deployment-prep [17:06:54] ori: ah, right [17:07:03] B) if that's not valid, let's make something clear [17:07:21] C) if we still decide to stick with NDAs for *beta cluster* shell, yes, let's document [17:07:33] let's not stick to that [17:07:38] <^d> No, let's not. [17:07:39] <^d> That same argument (omg IPs) could be extended to almost anything in labs. If we're going to require NDA to be root on any labs project then we've just defeated the entire purpose of labs. [17:07:39] let's not stick to that [17:08:35] ori: https://wikitech.wikimedia.org/wiki/Shell_Request/01tonythomas [17:08:42] k, I feel slightly unqualified to make the final decision, is it ok if I confirm with one of robla or mark? [17:09:05] I mean, I should be qualified, but I want to make sure we're not missing something I don't know about, org-legally-whatever [17:09:10] Best to include louis in the conversation as well [17:09:11] tonythomas: 14:35, October 30, 2013 Coren (Talk | contribs | block) changed group membership for User:01tonythomas from (none) to shell [17:09:15] andrewbogott: touche [17:09:20] grr, legal [17:09:21] :) [17:09:25] <^d> Who will say omg ips. [17:09:29] <^d> And we're back to square one. [17:09:46] The root sudoers group in deployment-prep is named "under_NDA" so ... it was truth at some point. Maybe hashar has ancient wisdom. [17:09:49] well, I'd rather not have luis yell at me, lawyers are scary [17:09:51] !log aaron Synchronized php-1.24wmf16/includes: f1a8ff7f802b57cc9f452d47c4c762a185ed93c2 (duration: 00m 06s) [17:09:52] greg-g: I'm kind of with ^d -- yes, take it up with legal, but frame it in a way that won't provoke the reflexive ZOMG IP NUMBERS [17:09:55] manybubbles: sorry, makin lunch [17:09:58] Logged the message, Master [17:09:59] ori: agreed [17:10:01] ok, so, swap a big shard for a small shard [17:10:17] ottomata: doesn't sound like a fair trade [17:10:24] ^d or ottomata: yeah, that. it should help balance it. You'll have to do it a bunch of times..... [17:10:26] bd808: that's from https://bugzilla.wikimedia.org/show_bug.cgi?id=48501 [17:10:30] alright guys, it's now successfully in the management-o-sphere, you'll hear something soon [17:10:32] <^d> ori: It is when one node is hogging the big shards! [17:10:35] <^d> :) [17:10:39] ori: its a winning trade :) [17:10:51] a bunch of times? [17:10:52] bd808: which never happened. [17:11:07] YuviPanda: good point..... [17:11:27] YuviPanda: if we go with no NDA needed, would that make that not possible in the future if we decide we were to proceed? [17:11:33] There is another open bug about that. ssl doesn't work at all right now for beta [17:11:56] https://bugzilla.wikimedia.org/show_bug.cgi?id=48501#c65 was when they were removed. [17:11:58] !log aaron Synchronized php-1.24wmf15/includes: d218d86dff90a5f0110353c492bd2e8ddaf35497 (duration: 00m 08s) [17:12:04] Logged the message, Master [17:12:05] greg-g: only for roots, thouhg. [17:12:13] PROBLEM - DPKG on zirconium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:12:14] <^d> ottomata: https://wikitech.wikimedia.org/wiki/Search#Unbalanced_numbers_of_shards.2FConstantly_moving_shards_around.2FOne_node_using_much_more_disk_then_the_others this is the section you're looking for. [17:12:25] greg-g: so you could grant people access, just not root. [17:12:26] greg-g: We could give shell without giving root level sudo. That's what the group I mentioned does now. [17:12:36] * greg-g nods [17:12:41] greg-g: you could just put them in different sudoers groups, and give them root *now*, and then perhaps take it away if we ever do ssl [17:12:47] kk [17:12:50] just making sure [17:13:00] my current reading of the bug doesn't make it appear as if we're going to do that anytime soon tho [17:13:09] also we could just not grant non-NDA folks root on the ssl terminators [17:13:11] (when we have them) [17:13:20] * greg-g nods [17:13:21] And you can do a lot in beta without root, just as deployers can do in prod [17:13:30] ja was reading that [17:13:42] , ok ^d, manybubbles, lemme eat my lunch, then let's do that together? [17:13:50] ok, so, JohnLewis, you just have shell on beta cluster not root right? [17:13:54] <^d> ottomata: Can do, sure. [17:14:08] <^d> I'm going to start working on my tool for this too. [17:14:11] <^d> Moving shards is annoying :) [17:14:13] RECOVERY - DPKG on zirconium is OK: All packages OK [17:14:21] greg-g: I've not tried but since I'm a member; I assume just shell. [17:14:26] * greg-g nods [17:15:34] greg-g: I'll try if you want :p [17:15:54] sure, you'll just let Santa know your on the naughty list [17:16:04] !log upgraded etherpad-lite on zirconium to 1.4.0-2. Uploaded etherpad-lite_1.4.0-2 on apt.wikimedia.org [17:16:05] JohnLewis: please. `sudo ls /` should be enough to tell [17:16:09] Logged the message, Master [17:16:21] I think you'll be asked for a sudoer password [17:16:27] I am :p [17:16:32] bd808 ori greg-g I think we required NDA for beta labs at the time we implemented SSL there [17:16:34] (asked for the password) [17:16:35] you're a root? [17:16:36] Cool. THen you don't have root :) [17:16:38] oh [17:16:48] bd808: Cool? I say that's bad ;) [17:16:57] makes our lives simplier :) [17:17:23] chrismcmahon: Yeah. Which apparently never really happened by https://bugzilla.wikimedia.org/show_bug.cgi?id=48501 [17:17:44] bd808: right, never really happened in a meaningful way at least [17:18:20] ok, things are starting to re-make sense [17:18:30] See also https://bugzilla.wikimedia.org/show_bug.cgi?id=68387 [17:18:47] Where we broke whatever ssl there was when we moved to eqiad [17:18:55] bd808: greg-g because the early origins of beta labs was a project petan started and we took over [17:19:16] oh really? didn't know that [17:19:39] <^d> I've still not figured out why we have to purchase SSL certs and can't make our own CA and do them ourselves. [17:19:46] greg-g: remind me to tell you that story sometime, thrills, chills, and spills [17:20:00] galore [17:20:17] (03PS1) 10Alexandros Kosiaris: Have etherpad log directly to file [operations/puppet] - 10https://gerrit.wikimedia.org/r/151115 [17:20:27] chrismcmahon: over a beer in London? :) [17:20:35] ^d: browser tests, mostly [17:20:40] is my understanding [17:20:45] ^d: We totally can do self signed but it will make browser testing harder if it uses ssl [17:20:59] auth stuff for Steipp from time to time also [17:21:19] Getting an automated browser test to use a self-singed cert is an adventure in pain [17:21:32] chrismcmahon: heh, yeah, will want to know the story as well :) [17:22:04] <^d> *sigh* [17:22:40] bd808: yeah, and also basically pointless, needless overhead [17:27:40] (03CR) 10Alexandros Kosiaris: [C: 032] Have etherpad log directly to file [operations/puppet] - 10https://gerrit.wikimedia.org/r/151115 (owner: 10Alexandros Kosiaris) [17:31:39] !log aaron Synchronized php-1.24wmf15/maintenance/populateBacklinkNamespace.php: e1cea29342f964cd9a720310185b09ca41eb1a4a (duration: 00m 04s) [17:31:45] Logged the message, Master [17:32:39] !log Restarted maintenance/populateBacklinkNamespace.php on enwiki [17:32:45] Logged the message, Master [17:33:43] PROBLEM - etherpad_lite_process_running on zirconium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js [17:38:33] ok, ^d [17:38:40] shall we? [17:38:51] i'm looking at large shards on 1016 (thanks manybubbles for the really nice commands) [17:39:05] does the number of shards matter? [17:39:14] there are 20 enwiki_general_1403910324 shards [17:39:29] which is relatively a lot [17:40:14] <^d> There's 20 total. [17:40:34] <^d> manybubbles: How do you list the shards on an individual node again? From _cluster/state? [17:40:59] curl -s localhost:9200/_stats?level=shards > /tmp/stats [17:41:00] jq '.indices | keys[] as $index | { [17:41:00] index: $index, [17:41:00] shards: ([.[$index].shards[]] | length), [17:41:01] average_size: ([.[$index].shards[][].store.size_in_bytes] | add / length / 1024 / 1024 / 1024) [17:41:03] } [17:41:05] | select(.average_size > 2)' /tmp/stats | jq -s 'sort_by(.average_size)' [17:41:14] sorry about that [17:41:19] I meant that to be a private message..... [17:41:30] and, also, I _thought_ I jammed that into the page of goodies [17:42:00] <^d> That's just for the one node? [17:42:04] <^d> I thought that was cluster wide. [17:42:07] ^d: I forget [17:42:09] manybubbles: you worry about paste flood in a channel where icinga-wm lives? ;) [17:42:27] Nemo_bis: icinga is a lot less bad then grrit-wm [17:43:37] <^d> curl -s localhost:9200/_nodes/elastic1016/stats?level=shards ? [17:43:44] its there manybubbles [17:43:44] that command [17:44:31] ok so, 20 enwiki shards on this machine, ^d, you are trying to find which other machines also have enwiki shards? [17:45:03] <^d> Trying to list location of all enwiki shards. [17:45:59] <^d> _cat/shards [17:46:02] <^d> That's what I want. [17:47:04] <^d> curl -s localhost:9200/_cat/shards | grep -P '^enwiki_' [17:48:03] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:48:25] <^d> Here's our 1016 offenders: http://p.defau.lt/?8gyJN1Ojt2NK0a0WkN8HZQ [17:49:34] <^d> ottomata: ^ There's your list of candidates for moving off [17:49:38] looking [17:50:26] ok 1016 has 4 enwiki generals [17:50:37] 1017 has 15? [17:50:44] 1017 i guess is still not part of the cluster? [17:50:47] or not eligible for searchers? [17:51:13] <^d> 17-18 are. [17:51:15] all others have 2 or 3 enwiki_general shards [17:51:17] oh ok [17:51:28] <^d> 19 isn't for SSD testing. [17:51:47] ok [17:52:00] why does 1017 have 15 shards on it? [17:52:59] <^d> It's not allocating fairly. [17:53:10] greg-g: Mind if I deploy https://gerrit.wikimedia.org/r/#/c/151120/ to fix some old users logging in on mw.o? [17:54:05] <^d> ottomata: One of my favorite commands. curl -s localhost:9200/_cat/allocation?v [17:54:55] <^d> elastic1016-17 have fewer shards but bigger ones. because bugs. [17:55:00] ah [17:55:05] ok... [17:55:09] csteipp: doit [17:55:48] <^d> ottomata: 1017 is handling it better because it's got newer/nicer cpus. [17:55:54] hm [17:56:00] but less ram, rigth? [17:56:12] <^d> No, ram is upgraded! [17:56:18] oh! aweosme! [17:56:21] cool [17:56:24] so, ok, in your paste [17:56:31] the shards are of varying sizes [17:56:36] what's the first number? [17:56:40] after the shard name? [17:56:43] http://p.defau.lt/?8gyJN1Ojt2NK0a0WkN8HZQ [17:56:44] <^d> Shard # [17:56:49] ah k :p [17:56:55] <^d> index name, shard # [17:56:57] and the number after STARTED? [17:57:14] documents? [17:57:16] <^d> yep [17:57:18] k [17:57:23] glad I guesse dthe right term there :p [17:57:26] <^d> r means "replica" [17:57:31] ok [17:57:36] so, how do we decide which to move? [17:57:39] i suppose we should only move one, right? [17:57:43] as most of the others have 2 or 3 [17:57:45] and this has 4 [17:57:52] so, its only misbalanced by 1 [17:57:52] <^d> Yeah just one should be enough. [17:58:09] so um, how about the 15g one then? [17:58:14] its not 10, and its not 30 :) [17:58:18] <^d> Sounds good as any :) [17:58:22] in the middle :) [17:58:23] ok [17:58:31] so shard 0 [17:59:24] to 1018 maybe? [17:59:53] keep it on the same hardware? [17:59:57] hm, 1018 has 359 shards on it already [17:59:58] (03PS1) 10Andrew Bogott: Add a filter for Special:HideBanners on Meta. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151124 [17:59:59] Jeff_Green: Like this, approximately? ^ [18:00:12] hm, that's about what most have [18:00:13] I chose 'erbium' more or less at random [18:00:13] cept 1016 and 1017 [18:00:33] oh sorry, 1018 is the newer hardware [18:00:56] soooo, maybe that is good anyway? [18:00:58] ^d, thoughts? [18:01:13] <^d> 17 already has a lot. maybe 18? [18:01:45] yeah, tha'ts what i'm sayingm [18:01:46] 1018 [18:01:49] either that or 1002 i think [18:01:52] <^d> Oh dur. [18:01:54] <^d> I misread. [18:01:58] <^d> Yeah 18 [18:02:00] k [18:02:32] ok, going to run this... [18:02:42] move enwiki_general_1403910324 shard 0 from 1016 to 1018 [18:02:43] ja? [18:02:48] andrewbogott: I can't load any graph on gdash, e.g. http://gdash.wikimedia.org/dashboards/frontend/ [18:03:46] <^d> ottomata: goforit :) [18:03:48] Nemo_bis: has that worked in the past? It's new enough that I've never looked at it. [18:04:12] andrewbogott: yes, it usually does :) [18:04:19] <^d> ottomata: And cluster's moving an itwiki shard to 1004 to compensate. Looks good :) [18:04:47] (03CR) 10Jgreen: [C: 031] Add a filter for Special:HideBanners on Meta. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151124 (owner: 10Andrew Bogott) [18:04:50] to 1004? [18:04:53] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 29 data above and 0 below the confidence bounds [18:04:53] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 29 data above and 0 below the confidence bounds [18:04:53] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.012 second response time [18:04:56] from 1018? [18:04:59] <^d> Yep [18:05:06] interesting [18:05:43] and now it works indeed, you're right icinga-wm [18:06:04] (03PS1) 10Aaron Schulz: Made RunJobs use the MW exception handler [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/151129 [18:06:25] (03CR) 10Andrew Bogott: [C: 032] Add a filter for Special:HideBanners on Meta. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151124 (owner: 10Andrew Bogott) [18:07:25] Nemo_bis: I can't ssh into that box, might be OOM [18:08:54] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Epic puppet fail [18:10:21] andrewbogott: ok thanks; is tweaking an HTML file on datasets something appropriate to poke you about? https://bugzilla.wikimedia.org/show_bug.cgi?id=44464 [18:10:31] (03PS1) 10Jgreen: add fundraising log rotation for hideBanners-sampled100.tsv [operations/puppet] - 10https://gerrit.wikimedia.org/r/151130 [18:10:44] you = the person on RT dity :p [18:10:48] duty [18:10:54] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [18:10:56] !log csteipp Synchronized php-1.24wmf16/extensions/CentralAuth: Fix for bug 69007 - logins failing for old style hashes (duration: 00m 06s) [18:11:00] (03CR) 10Andrew Bogott: [C: 031] "Thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151130 (owner: 10Jgreen) [18:11:03] Logged the message, Master [18:11:28] Nemo_bis: I didn't do anything to gdash, it seems to have recovered on its own. [18:12:35] Nemo_bis: I don't know about datasets, lemme see if I can figure out how to do that... [18:13:12] Nemo_bis: do you know anything about that html file? Like, where is it, is it puppetized, etc? [18:13:37] (03PS1) 10BryanDavis: beta: Clone mediawiki/vendor instead of mediawiki/core/vendor [operations/puppet] - 10https://gerrit.wikimedia.org/r/151131 (https://bugzilla.wikimedia.org/68485) [18:13:58] andrewbogott: AFAIK it's not puppetized, or I'd submit a patch [18:14:25] ori: you might want to keep an eye on beta labs for the next little while. We're running simultaneous browser test builds and I think I already see some slowness. [18:15:29] andrewbogott: found something, operations/puppet/modules/dataset/files/pagecounts/generate-pagecount-main-index.sh [18:15:33] (03CR) 10BryanDavis: "Cherry-picked to deployment-salt." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151131 (https://bugzilla.wikimedia.org/68485) (owner: 10BryanDavis) [18:15:46] * Nemo_bis might have missed it before because it's .sh [18:15:46] Nemo_bis: ok, sorry, I can't understand that bug. It says "Please replace the link with https://archive.org/search.php?query=wikipedia_visitor_stats" please replace /what/ link with that? And then there are two other links right after that, which I can't… tell what are. [18:16:33] (03CR) 10Jgreen: [C: 032 V: 031] add fundraising log rotation for hideBanners-sampled100.tsv [operations/puppet] - 10https://gerrit.wikimedia.org/r/151130 (owner: 10Jgreen) [18:18:02] andrewbogott: thanks for pointing it out, I now clarified: That is, replace the archive.org link in "are also ato be replaced is the one vailable at the Internet Archive" with https://archive.org/search.php?query=wikipedia_visitor_stats [18:18:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [18:23:29] Nemo_bis: sorry, I'm still not following. I need more context for this, is the thing that you pasted in quote marks from the .sh file you mentioned? [18:25:00] oh, wait, maybe I found it... [18:25:02] or something like it [18:25:58] (03PS1) 10Mwalker: Let OCG do file cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/151133 [18:26:42] Jeff_Green, ^ at your leisure [18:27:15] (03PS1) 10Andrew Bogott: Change to a more stable stats link at archive.org. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151134 [18:27:17] Nemo_bis: is this what you're after? ^ [18:29:28] (03CR) 10Nemo bis: [C: 031] Change to a more stable stats link at archive.org. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151134 (owner: 10Andrew Bogott) [18:29:32] andrewbogott: yes :) [18:29:42] (03PS2) 10Nemo bis: Change to a more stable stats link at archive.org. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151134 (https://bugzilla.wikimedia.org/44464) (owner: 10Andrew Bogott) [18:31:44] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 3 below the confidence bounds [18:34:39] (03CR) 10Jgreen: [C: 032 V: 031] Let OCG do file cleanup [operations/puppet] - 10https://gerrit.wikimedia.org/r/151133 (owner: 10Mwalker) [18:35:20] mwalker: deployed [18:35:37] thankee [18:35:41] np [18:36:06] as a note; your nagios alerts, if they go off, the garbage collector is the first thing that should be suspect [18:39:14] (03PS1) 10Aaron Schulz: Set a 300M memory limit for fcgi job runners [operations/puppet] - 10https://gerrit.wikimedia.org/r/151138 [18:47:31] (03PS1) 10Aaron Schulz: Set 300M memory_limit default for HHVM fcgi servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/151139 [18:49:01] aude: any chance the "Uncaught exception 'BadMethodCallException' with message 'Call to a member function getGuid() on a non-object (NULL)" could be fixed in prod branches? [18:49:14] (03PS6) 10Ori.livneh: mediawiki: use HHVM module on trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 [18:52:48] (03CR) 10Ori.livneh: [C: 032] "Verified on beta. Made apache service restarts call /bin/true to make this easier to verify; will revert that in another commit." [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 (owner: 10Ori.livneh) [19:03:44] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [19:11:30] (03CR) 10Andrew Bogott: [C: 032] Change to a more stable stats link at archive.org. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151134 (https://bugzilla.wikimedia.org/44464) (owner: 10Andrew Bogott) [19:17:46] Nemo_bis: I deployed that change on dataset1001, but i can't tell if puppet actually runs that script. Did the link update? [19:20:18] (03PS1) 10Ori.livneh: Tidy up after I4aa104920, dropping declaration of absented resources [operations/puppet] - 10https://gerrit.wikimedia.org/r/151150 [19:27:44] manybubbles: new ssds are installed on elastic1019...needs OS again [19:28:02] cmjohnson: cool! ottomata - I imagine that is you? [19:28:07] (03PS2) 10Ori.livneh: Tidy up after I4aa104920, dropping declaration of absented resources [operations/puppet] - 10https://gerrit.wikimedia.org/r/151150 [19:28:42] i will do base install but will wait on puppet/salt for ottomata [19:30:15] manybubbles: elastic1019 has been removed from shards right? [19:30:43] <^d> Yes. [19:30:45] cmjohnson: yes but backwards - all shards have been removed from elastic1019 [19:31:13] PROBLEM - Host elastic1019 is DOWN: PING CRITICAL - Packet loss = 100% [19:31:31] sorry about icinga noise [19:31:35] <^d> tango down! [19:32:32] I can add puppet certs back then and will leave it at that state. Then you will have ssh [19:32:43] (03CR) 10Ori.livneh: [C: 032 V: 032] Tidy up after I4aa104920, dropping declaration of absented resources [operations/puppet] - 10https://gerrit.wikimedia.org/r/151150 (owner: 10Ori.livneh) [19:33:06] oh awesome [19:33:09] so, what's up with 1019 [19:33:14] its getting new SSDs? [19:33:22] new ssds to test [19:33:33] ah ok [19:33:34] cool [19:33:34] ok [19:33:40] ja i can do the rest [19:33:41] thanks [19:36:21] (03PS1) 10Ottomata: Update for kafka 0.8.1.1-2 packaging [operations/puppet/kafka] - 10https://gerrit.wikimedia.org/r/151193 [19:39:31] great...carbon's not serving up image [19:39:34] (03CR) 10Ottomata: [C: 032 V: 032] chown /var/spool/kafka/ in kafka-server.postinst [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/151107 (owner: 10Ottomata) [19:40:02] wohooo [19:42:04] (03PS1) 10Ottomata: kafka-server.postinst needs to chown /var/spool/kafka [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/151196 [20:00:43] PROBLEM - Disk space on vanadium is CRITICAL: DISK CRITICAL - free space: / 4271 MB (3% inode=94%): [20:03:46] (03PS2) 10Ottomata: kafka-server.postinst needs to chown /var/spool/kafka [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/151196 [20:12:23] ottomata: do you know about eventlogging? [20:12:43] Are we intentionally keeping logs for 60 days? [20:12:53] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [20:14:18] andrewbogott: I think EventLogging is under analytics umbrella nowadays [20:15:08] andrewbogott: I don't know if we have a process in place to purge them yet, actually [20:15:14] andrewbogott: qchris would also know [20:15:47] andrewbogott: i do not know, qchris and milimetric and nuria are your best bets [20:16:07] I think milimetric and nuria are on vacation [20:16:16] andrewbogott: We keep them even for longer on stats1002+stats1003 :-( [20:16:32] qchris: I'm looking at the logrotate rule, it says 90 days. [20:16:42] But, we're at ~58 now, and vanadium's HD is full [20:16:52] Oh. :-) [20:17:10] So, I would like to change that to 45… but don't want to erase history in important ways... [20:17:13] We' re rsyncing to other hosts, so it would be ok to shorten that window a bit. [20:17:21] I'm not sure if that same rule applies on stats1002 and stats1003, lemme check. [20:17:24] Let me double check on stat1002+stat1003. [20:17:27] Unless you know that it doesn't? [20:17:30] (03CR) 10Ottomata: [C: 032 V: 032] kafka-server.postinst needs to chown /var/spool/kafka [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/151196 (owner: 10Ottomata) [20:18:13] andrewbogott: they still have all the old files. [20:18:24] PROBLEM - SSH on db1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:24] So it's ok to shorten the window on vanadium [20:18:33] (03PS1) 10Andrew Bogott: Save logs on vanadium for 45 days, no 90. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151204 [20:18:34] qchris: ^ [20:18:44] PROBLEM - check configured eth on db1062 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:18:49] oops, lemme fix that typo [20:19:00] andrewbogott: vanadium has plenty of space, btw, just not on / [20:19:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [20:19:03] PROBLEM - MySQL InnoDB on db1062 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:19:03] PROBLEM - MySQL Slave Running on db1062 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:19:03] PROBLEM - MySQL Replication Heartbeat on db1062 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:19:03] PROBLEM - RAID on db1062 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:19:04] PROBLEM - puppet last run on db1062 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:19:19] andrewbogott: /dev/md1 111G 101G 4.1G 97% / [20:19:23] RECOVERY - SSH on db1062 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [20:19:24] /dev/mapper/vg0-lv0 352G 97G 256G 28% /srv [20:19:34] ori: I don't think it has room for 90 days even on /srv [20:19:43] RECOVERY - check configured eth on db1062 is OK: NRPE: Unable to read output [20:19:51] andrewbogott: oh, i haven't looked at usage lately. that may be. [20:19:53] RECOVERY - MySQL InnoDB on db1062 is OK: OK longest blocking idle transaction sleeps for 0 seconds [20:19:54] RECOVERY - MySQL Slave Running on db1062 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [20:19:54] RECOVERY - MySQL Replication Heartbeat on db1062 is OK: OK replication delay 29 seconds [20:19:54] RECOVERY - RAID on db1062 is OK: OK: optimal, 1 logical, 2 physical [20:19:54] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 561 seconds ago with 0 failures [20:19:59] (03CR) 10QChris: [C: 031] "LGTM." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151204 (owner: 10Andrew Bogott) [20:20:09] (03PS2) 10Andrew Bogott: Save logs on vanadium for 45 days, not 90. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151204 [20:20:55] ori: but, i agree that dumping these enormous logs onto / is a bit wrong. qchris, did you know who set this up originally? [20:21:11] andrewbogott: no clue. [20:21:20] andrewbogott: it was set up by notpeter for solr [20:21:24] and then repurposed for eventlogging [20:21:34] Ah, ok. [20:21:40] and then solr got decomissioned and some aspects of the setup were never revisited [20:21:47] (03CR) 10QChris: [C: 031] "Bringing over vote from PS1." [operations/puppet] - 10https://gerrit.wikimedia.org/r/151204 (owner: 10Andrew Bogott) [20:21:50] (03CR) 10Andrew Bogott: [C: 032] Save logs on vanadium for 45 days, not 90. [operations/puppet] - 10https://gerrit.wikimedia.org/r/151204 (owner: 10Andrew Bogott) [20:22:05] +1 to moving the logs to /srv (mounting that disk at /var/log) [20:22:10] *or mounting, rather [20:23:26] (03CR) 10Ori.livneh: [C: 032] Set 300M memory_limit default for HHVM fcgi servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/151139 (owner: 10Aaron Schulz) [20:24:36] ori: moving now seems hard, without leaving a gap in the logs [20:25:15] !log shorted the logrotate interval on vanadium; disk space critical should resolve soon [20:25:21] Logged the message, Master [20:25:24] i don't manage that setup anymore so not my call anyhow [20:25:38] ori: :-) [20:26:15] * ori waves hello [20:26:21] andrewbogott: moving files or changing mount points should by fine [20:26:33] people only care on the files on stat1002+stat1003 [20:26:46] Seems like it could cause a gap there too... [20:26:53] As long as we do not interfer with the files there, everything is fine. [20:26:56] well, I guess if we just move it and start with brand new empty logs [20:27:36] We only rsync the new files over, but do not kill the ones on stat1002+stat1003 that got logrotated on vanadium. [20:27:48] qchris: ok, so, we have half a log file (maybe or maybe not in rsynced to stat1002) in /. We remount /srv to /var/log, at which point it starts creating brand new log files there. [20:28:12] So, later on the new file on /srv will get log-rotated. But the half-a-file that was there before, still on /? It never gets rsynced, that data is forgotten. [20:28:48] Since that happens once only, we could just stitch this file together by hand. Put it in place on vanadium. Dane. [20:28:53] s/Dane/Done/ [20:29:16] OK -- as long as you don't mind doing the cleanup work afterwards :) [20:29:29] I do not have access to vanadium :-/ [20:29:48] Yeah, I can grab the files for you. [20:30:05] But, it's a bit late in the day/week/wikimania season to start this now. So I will file an RT ticket. [20:30:14] Ok. Cool. [20:30:35] Thanks. [20:54:23] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 316 seconds [20:54:33] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 325 seconds [20:55:03] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 324 seconds [21:02:23] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 324 seconds [21:02:33] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 332 seconds [21:05:23] PROBLEM - MySQL Replication Heartbeat on db1018 is CRITICAL: CRIT replication delay 312 seconds [21:05:33] PROBLEM - MySQL Slave Delay on db1018 is CRITICAL: CRIT replication delay 320 seconds [21:06:03] PROBLEM - MySQL Replication Heartbeat on db69 is CRITICAL: CRIT replication delay 349 seconds [21:09:35] (03CR) 10Chad: "This thing has way too many dependencies." [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/151108 (owner: 10Chad) [21:18:03] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 19:17:00 UTC [21:38:03] PROBLEM - Puppet freshness on db1009 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 19:37:07 UTC [21:41:27] (03PS3) 10Ottomata: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 [21:42:21] (03CR) 10Ottomata: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data (0310 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 (owner: 10Ottomata) [21:43:30] (03PS4) 10Ottomata: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 [21:48:51] (03PS5) 10Ottomata: [WIP] Create icinga alerts if hourly HDFS webrequest imports have missing or duplicate data [operations/puppet] - 10https://gerrit.wikimedia.org/r/151095 [21:52:01] (03Abandoned) 10Aaron Schulz: Set a 300M memory limit for fcgi job runners [operations/puppet] - 10https://gerrit.wikimedia.org/r/151138 (owner: 10Aaron Schulz) [22:12:19] (03CR) 10QChris: "To keep archives happy:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151204 (owner: 10Andrew Bogott) [22:14:03] PROBLEM - MySQL Slave Delay on db74 is CRITICAL: CRIT replication delay 309 seconds [22:14:05] PROBLEM - MySQL Replication Heartbeat on db74 is CRITICAL: CRIT replication delay 308 seconds [22:20:03] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [22:25:06] (03PS1) 10Aaron Schulz: Bumped SlowQueryThreshold since the log is spammy and untruncated [operations/puppet] - 10https://gerrit.wikimedia.org/r/151229 [22:32:35] (03PS1) 10Aaron Schulz: Bumped SlowQueryThreshold since the log is spammy and untruncated [operations/puppet] - 10https://gerrit.wikimedia.org/r/151233 [22:32:56] (03Abandoned) 10Aaron Schulz: Bumped SlowQueryThreshold since the log is spammy and untruncated [operations/puppet] - 10https://gerrit.wikimedia.org/r/151229 (owner: 10Aaron Schulz) [22:40:03] PROBLEM - MySQL Slave Delay on db69 is CRITICAL: CRIT replication delay 2766 seconds [22:41:03] RECOVERY - MySQL Slave Delay on db69 is OK: OK replication delay 0 seconds [23:17:33] RECOVERY - Puppet freshness on db1009 is OK: puppet ran at Fri Aug 1 23:17:26 UTC 2014 [23:19:03] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Fri 01 Aug 2014 19:17:00 UTC [23:48:43] (03CR) 1020after4: [C: 031] "this looks really useful" [operations/puppet] - 10https://gerrit.wikimedia.org/r/151099 (https://bugzilla.wikimedia.org/66683) (owner: 10BryanDavis)