[00:00:51] PROBLEM - SSH on db29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:01:54] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [00:03:31] !log db29 os installed, handed off to peter [00:03:33] Logged the message, RobH [00:03:33] RECOVERY - SSH on sq48 is OK: SSH OK - OpenSSH_5.3p1 Debian-3ubuntu7 (protocol 2.0) [00:04:18] RECOVERY - SSH on db29 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [00:07:12] New patchset: RobH; "db61 in sandbox" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48976 [00:07:27] New review: Pyoungmeister; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48862 [00:07:54] PROBLEM - Puppet freshness on tin is CRITICAL: Puppet has not run in the last 10 hours [00:08:31] New review: RobH; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48976 [00:08:40] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48976 [00:09:28] New patchset: Pyoungmeister; "reviving db29 for use by pgehres" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48862 [00:09:33] RECOVERY - Frontend Squid HTTP on sq48 is OK: HTTP OK HTTP/1.0 200 OK - 631 bytes in 0.007 seconds [00:09:57] New review: Pyoungmeister; "Patch Set 3: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48862 [00:10:07] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48862 [00:13:14] New patchset: RobH; "reclaiming copper, magnesium, and zinc for use elsewhere" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48977 [00:15:52] New patchset: RobH; "reclaiming copper, magnesium, and zinc for use elsewhere" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48977 [00:16:39] New review: RobH; "Patch Set 2: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48977 [00:16:51] Change merged: RobH; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48977 [00:17:03] RECOVERY - swift-account-reaper on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [00:17:30] paravoid: ok, i just removed copper, magnesium, and zinc from site.pp and added to decom, so reclaimed =] [00:17:48] great, thanks [00:18:22] srv266, srv278? [00:18:31] and virt1004? [00:18:42] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [00:18:44] virt1004 is in decom so it shoudl come out of nagios [00:18:49] it'll come back later [00:18:51] cool, great [00:19:08] the srvs i have not gotten to yet but will, i wanna make sure i have a long history in each ticket so no one complains when i kill them [00:19:28] I don't think anyone will complain [00:19:36] they've been acked by mark too [00:24:21] New patchset: Hoo man; "Fix CORS for wikidata-test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48979 [00:25:27] PROBLEM - Frontend Squid HTTP on sq48 is CRITICAL: Connection refused [00:28:45] PROBLEM - NTP on db29 is CRITICAL: NTP CRITICAL: Offset unknown [00:30:14] New patchset: Faidon; "Adjust knsq24 for its broken disk" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48980 [00:30:50] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48980 [00:30:59] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48980 [00:32:12] RECOVERY - NTP on db29 is OK: NTP OK: Offset -0.007966518402 secs [00:35:52] New patchset: Hoo man; "Fix CORS for wikidata-test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48979 [00:37:17] New patchset: Aaron Schulz; "Modified jobs-loop script to keep a fuller pipeline." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48981 [00:37:44] New review: Aaron Schulz; "Patch Set 1: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/48981 [00:38:44] New patchset: Hoo man; "Fix CORS for wikidata testing instances" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48979 [00:46:51] New patchset: Faidon; "Fix silly typo with previous commit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48982 [00:47:20] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48982 [00:47:28] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48982 [00:47:53] New patchset: Ryan Lane; "Add salt reactors for bootstrapping puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48983 [00:48:06] RECOVERY - Frontend Squid HTTP on sq48 is OK: HTTP OK HTTP/1.0 200 OK - 631 bytes in 0.007 seconds [00:52:28] http://test.m.wikipedia.org/ appears to be returning 503s [00:54:10] New patchset: Ryan Lane; "Add salt reactors for bootstrapping puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48983 [00:57:13] New patchset: Ryan Lane; "Add salt reactors for bootstrapping puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48983 [00:59:39] PROBLEM - Frontend Squid HTTP on knsq24 is CRITICAL: Connection refused [01:00:15] PROBLEM - Frontend Squid HTTP on knsq26 is CRITICAL: Connection refused [01:00:18] New review: Andrew Bogott; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48941 [01:00:30] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48941 [01:00:51] PROBLEM - Frontend Squid HTTP on knsq20 is CRITICAL: Connection refused [01:01:27] PROBLEM - Backend Squid HTTP on knsq26 is CRITICAL: Connection refused [01:01:54] PROBLEM - Frontend Squid HTTP on knsq16 is CRITICAL: Connection refused [01:02:30] PROBLEM - Backend Squid HTTP on knsq20 is CRITICAL: Connection refused [01:02:48] PROBLEM - Frontend Squid HTTP on knsq22 is CRITICAL: Connection refused [01:02:57] PROBLEM - Backend Squid HTTP on knsq16 is CRITICAL: Connection refused [01:03:15] PROBLEM - Backend Squid HTTP on knsq22 is CRITICAL: Connection refused [01:03:51] PROBLEM - Frontend Squid HTTP on knsq27 is CRITICAL: Connection refused [01:04:13] that's me, fixing [01:05:30] PROBLEM - Backend Squid HTTP on knsq27 is CRITICAL: Connection refused [01:05:37] dammit [01:08:02] New patchset: Ryan Lane; "Add salt reactors for bootstrapping puppet" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48983 [01:11:30] RECOVERY - Frontend Squid HTTP on knsq22 is OK: HTTP OK HTTP/1.0 200 OK - 821 bytes in 0.236 seconds [01:11:30] RECOVERY - Backend Squid HTTP on knsq20 is OK: HTTP OK HTTP/1.0 200 OK - 661 bytes in 0.358 seconds [01:11:39] RECOVERY - Backend Squid HTTP on knsq16 is OK: HTTP OK HTTP/1.0 200 OK - 661 bytes in 0.469 seconds [01:11:48] RECOVERY - Frontend Squid HTTP on knsq24 is OK: HTTP OK HTTP/1.0 200 OK - 1573 bytes in 0.525 seconds [01:11:57] RECOVERY - Backend Squid HTTP on knsq26 is OK: HTTP OK HTTP/1.0 200 OK - 1419 bytes in 0.237 seconds [01:11:58] RECOVERY - Backend Squid HTTP on knsq22 is OK: HTTP OK HTTP/1.0 200 OK - 659 bytes in 0.238 seconds [01:12:15] RECOVERY - Frontend Squid HTTP on knsq16 is OK: HTTP OK HTTP/1.0 200 OK - 819 bytes in 0.234 seconds [01:12:33] RECOVERY - Backend Squid HTTP on knsq27 is OK: HTTP OK HTTP/1.0 200 OK - 1419 bytes in 0.235 seconds [01:12:33] RECOVERY - Frontend Squid HTTP on knsq27 is OK: HTTP OK HTTP/1.0 200 OK - 1580 bytes in 0.237 seconds [01:12:45] RECOVERY - Frontend Squid HTTP on knsq26 is OK: HTTP OK HTTP/1.0 200 OK - 1580 bytes in 0.235 seconds [01:13:00] RECOVERY - Frontend Squid HTTP on knsq20 is OK: HTTP OK HTTP/1.0 200 OK - 819 bytes in 0.238 seconds [01:13:11] !log removing mysql & apache from emery(!) [01:13:14] Logged the message, Master [01:15:53] New review: Ryan Lane; "Patch Set 4: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48983 [01:16:02] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48983 [01:17:11] !log updated OpenStackManager to git head on virt0. Previous state is preserved (temporarily) in the local 'premerge' branch. [01:17:12] Logged the message, Master [01:21:07] New patchset: Ryan Lane; "Firewall off ldap replication ports" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48988 [01:22:19] New patchset: Ryan Lane; "Fix salt_reactor argument" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48989 [01:22:39] andrewbogott: was there something unmerged? [01:22:44] I didn't think it had any live hacks [01:22:55] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48988 [01:23:04] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48988 [01:23:14] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48989 [01:23:21] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48989 [01:23:38] Ryan_Lane: The patch that merged netadmin and sysadmin was unmerged but the changes were there locally [01:23:54] And also the patch that created the default sudo policy wasn't merged for some reason [01:24:07] (the latter was what I was looking for) [01:24:34] andrewbogott: ah ok [01:27:55] New patchset: Ryan Lane; "Fix reference to moved directory for salt reactors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48990 [01:28:28] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48990 [01:28:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48990 [01:42:43] New patchset: Tim Starling; "add cgroup to limit memory of sub processes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40784 [01:42:49] New review: Tim Starling; "Patch Set 4: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/40784 [01:42:58] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/40784 [01:43:13] New patchset: Ryan Lane; "Fix typo for ldap_replication service" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48991 [01:43:30] New review: Ryan Lane; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48991 [01:43:37] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48991 [01:43:54] TimStarling: merging in your cgroup stuff on sockpuppet [01:44:08] ok [01:46:13] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [01:48:02] New patchset: Ryan Lane; "Sort array entries for reactors" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48992 [01:48:33] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48992 [01:48:41] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48992 [02:03:41] New patchset: Tim Starling; "Fix MW cgroup installation" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48993 [02:04:58] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48993 [02:05:06] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48993 [02:10:52] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/47097 [02:11:01] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47097 [02:22:05] New patchset: Ryan Lane; "Set server explicitly for first puppet run" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48996 [02:23:39] New patchset: Ryan Lane; "Set server explicitly for first puppet run" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48996 [02:29:06] !log LocalisationUpdate completed (1.21wmf9) at Thu Feb 14 02:29:05 UTC 2013 [02:29:09] Logged the message, Master [02:30:29] RECOVERY - Puppet freshness on tin is OK: puppet ran at Thu Feb 14 02:30:00 UTC 2013 [02:32:56] drdee: locke's / is 100% full [02:34:09] New patchset: Catrope; "Add a symlink to the node_modules dir in the Parsoid config dir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48998 [02:35:17] Ryan_Lane: --^^ [02:36:46] New review: Ryan Lane; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48998 [02:36:55] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48998 [02:37:21] Yay thanks [02:37:27] yw [02:42:13] Hmm so that needs puppet runs on all of the minions then? [02:42:28] no [02:42:32] Just on tin? [02:42:38] just on sockpuppet [02:46:59] RECOVERY - Puppet freshness on locke is OK: puppet ran at Thu Feb 14 02:46:24 UTC 2013 [02:48:02] New patchset: Catrope; "LVS for Parsoid Varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38457 [02:48:19] New review: Catrope; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38457 [02:49:16] New patchset: Catrope; "LVS for Parsoid Varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38457 [02:50:25] New review: Catrope; "Patch Set 4:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38457 [03:06:17] New patchset: Faidon; "mediawiki cgroup: cleanup the manifest" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48999 [03:07:23] I'll merge it later or tomorrow [03:09:32] what was wrong with the way I did it? I was just reading the puppet docs and couldn't work it out [03:11:54] the manual seems pretty explicit [03:12:08] class wordpress { [03:12:08] require apache [03:12:08] require mysql [03:12:08] ... [03:12:08] } [03:12:14] "The above example would cause every resource in the apache and mysql classes to be applied before any of the resources in the wordpress class. " [03:12:40] so you would think that there would be no way the service class could be applied before the conf file was created [04:01:44] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [04:04:33] New patchset: Tim Starling; "Use a cgroup for command execution" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49000 [04:05:18] New patchset: Aaron Schulz; "Modified jobs-loop script to keep a fuller pipeline." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48981 [04:17:48] New review: Aaron Schulz; "Patch Set 2:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48981 [04:19:02] ugh [04:19:11] * Aaron|home fucking hates files/ssl/*.wikimedia.org.crt [04:25:20] https://bugzilla.wikimedia.org/44982 [04:25:31] ^ TimStarling Aaron|home [04:25:57] if anyone has any idea how to mitigate that issue, comments on the bug or such would be appreciated [04:27:03] it's one of the reasons we are developing lua [04:27:13] which is going to be deployed on enwiki early next week [04:27:34] good! [04:27:43] in the meantime, can we raise the time out? [04:28:01] it's impossible to save and have the page reload w/o error [04:28:11] you know this has been a problem for years, right? [04:28:14] yup, tried again [04:28:21] and 3 days before the fix, you want to raise the timeout? [04:28:22] yes but seems especially bad [04:28:42] the timeout is there to protect the servers [04:28:43] well, i don't know how magically lua will fix [04:28:51] i'm sure it will help [04:29:10] another 10 seconds or something might help [04:29:13] or 20 seconds [04:29:28] Request: POST http://en.wikipedia.org/w/index.php?title=New_York_City&action=submit, from 84.175.74.62 via amssq40.esams.wikimedia.org (squid/2.7.STABLE9) to 91.198.174.42 (91.198.174.42) [04:29:32] Error: ERR_READ_TIMEOUT, errno [No Error] at Thu, 14 Feb 2013 04:28:00 GMT [04:29:35] again [04:29:50] the fix we normally recommend is to modify the article to make it take less long to render [04:30:01] true [04:30:38] * aude tries different article [04:31:29] RECOVERY - MySQL disk space on neon is OK: DISK OK [04:31:51] TimStarling: why must we have that one file with a * in the name? [04:31:56] * Aaron|home kicks ntfs [04:32:19] you can change it can't you? [04:32:45] it's not just windows that is broken by filenames with asterisks in them [04:32:58] shell scripts can be made to do interesting things with such filenames [04:34:04] I'll have to wait till tomorrow when I'm on my linux box [04:35:09] * aude waits to see if i can save [[Barack Obama]] :) [04:35:20] New review: Aaron Schulz; "Patch Set 2: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/48981 [04:35:42] aude: that article is actually quicker than it used to be [04:36:01] it did save :) [04:36:05] slow-parse.log shows it takes 35-40s now [04:36:13] in december, it was 40-45 [04:36:28] ah [04:36:43] september shows it occasionally taking over 60 [04:36:58] TimStarling: I recall ryan saying that file was needed for something [04:37:09] for some hack, I can't recall [04:37:18] hmmm [04:37:36] i'll see what i can do to remove templates from [[New York City]] [04:37:44] and poke around more at other pages [04:38:04] * aude knows it's been an issue for years but seems especially bad recently [04:38:09] I measure 22 seconds with the removed [04:38:28] which may mean that there is a lot of non-reference fat to cut out [04:38:29] ah [04:39:01] porting cite/core to lua is the thing that will magically make everything faster after lua deployment [04:39:30] ah [04:39:36] * aude can't wait [04:39:55] I see it uses country data templates [04:40:06] * aude wants it on now :D [04:40:06] that's an obvious thing to target [04:40:11] sure [04:40:54] let's try removing that climate box, it looks a lot like some templates that were causing trouble elsewhere [04:41:08] sure [04:41:57] why climate templates have to be so complicated :P [04:42:19] maybe there is climate model software embedded in the templates [04:42:24] lol [04:42:31] PROBLEM - Puppet freshness on cp3022 is CRITICAL: Puppet has not run in the last 10 hours [04:43:11] oh no [04:43:26] but wouldn't be surprised if it converts units, etc. [04:43:36] there are many {{convert}} calls [04:43:42] not sure if they are the problem [04:43:53] ah, right [04:44:02] removing the climate box brings it down from 22 to 13 [04:44:27] I'm just using preview, you know, this is nothing advanced [04:47:02] i'm removing it [04:47:07] will leave a link to it [04:47:39] at least something that maybe will allow saving the page again [04:47:40] New patchset: Aaron Schulz; "Modified jobs-loop script to keep a fuller pipeline." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48981 [04:48:02] we can wait a week to maybe reintroduce the template :) [04:48:11] ah, I can reset it, I was forgetting to escape the * so it thought I was globbing [04:48:35] it saved! [04:48:40] magic [04:49:23] [04:49:31] slightly better, under the timeou [04:49:33] t [04:50:26] people who introduce these templates should be beaten over the head a bit [04:50:57] in many cases, they know what they are doing to parser performance, they just don't care [04:52:19] yeah [04:52:57] Why would an article about a city have information about its climate? [04:53:08] That's just outrageous for an encyclopedia. Don't these editors know this? [04:53:28] Why aren't they profiling the page for performance? [04:54:09] heh [04:54:22] my point is that they actually do profile the page for performance [04:54:27] New patchset: Tim Starling; "Fix sql script so it can run as wikidev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49002 [04:54:41] they work on their templates until they start rendering in less than the timeout, then that's job done [04:54:52] sure [04:55:18] New review: Tim Starling; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49002 [04:55:25] Change merged: Tim Starling; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49002 [04:55:28] Yes, they've been trained to work below the ceiling. Raise the roof. [04:55:46] you're saying I should raise the roof? [04:55:51] like what aude wants? [04:56:01] increase the timeout to 60? [04:56:18] TimStarling: probably slightly higher would help [04:56:20] No, I commented on the bug. [04:56:33] but nothing insane [04:56:33] I think that bug is a dupe and I think Scribunto is the appropriate answer. [04:56:34] I think actually Susan just wants YouTube videos of you dancing. [04:56:44] Susan: feel free to mark it as a dup [04:56:48] Getting down, as it were. [04:56:56] * aude just got fed up [04:56:59] aude: You seemed frustrated, so I didn't want to antagonize. [04:57:00] While you bring it up. [04:57:05] sure :) [04:57:18] It's a legitimate bug, it just happens to be a known issue. [04:57:27] [[Q60]] has been my test item for wikidata, so having such a test page for wikipedia, .... [04:57:42] well it's not a good example for testing editing [04:57:43] I think Scribunto should come with global templates, so that we don't end up with {{convert}} on every wiki, but I got shouted down on that. [04:57:59] Susan: that would be cool [04:58:01] So... there will be an improvement, but the code fragmentation will almost certainly eventually be painful. [04:58:08] An improvement next week, it sounds like. [04:58:55] * aude notes the same article on italian wikipedia is quite fast [04:58:57] to edit [04:59:07] fewer templates.... [04:59:14] so before I got distracted by the sql script not working, I was about to tell you that there are now only 7603 pages that use {{Weather_box}} [04:59:23] presumably down from 7604 [04:59:27] oh [05:00:08] baby steps [05:00:14] heh [05:00:17] https://toolserver.org/~jarry/templatecount/index.php?lang=en&name=Template%3AWeather+box#bottom [05:00:27] TimStarling, did we ever make much progress on automatic translation of traditional wiki-templates to Lua? (I apologize if I somehow missed the answer on-wiki or a list) [05:00:36] dschoon: no [05:00:45] It's going to be a long tail. [05:00:50] That whole Turing Complete thing is a bitch. [05:00:57] But the main culprits should be pretty easy to knock out. [05:01:26] coord, cite, and convert, mostly, as I recall. [05:01:30] I guess it won't give us a list, eh? [05:01:31] The triple Cs. [05:01:39] TimStarling: would it complicate Scribunto deployment if I were to enable CodeEditor on metawiki tomorrow? [05:01:48] It'd be kind of interesting to see the histogram by views (perhaps via stats.grok.se) [05:02:13] It'd be more interesting if there were a real analytics infrastructure so we could stop using stats.grok.se. [05:02:54] ori-l: maybe you should first set $wgCodeEditorEnableCore = true [05:03:04] and deploy it to the wikis that it's running on already [05:03:31] Word, Susan. [05:03:40] currently mediawiki.org and test2.wikipedia.org [05:03:54] Out of curiosity, is there a literal asterisk template? https://toolserver.org/~jarry/templatecount/index.php?lang=en&namespace=10&name=Template%3A*#bottom or is that everything? [05:03:55] that would make things a bit simpler [05:04:03] OK, good idea. [05:04:05] (relatedly https://toolserver.org/~jarry/templatecount/index.php?lang=en&namespace=10&name=Template%3A.*#bottom ) [05:06:01] https://en.wikipedia.org/wiki/Template:* [05:06:07] It's a redirect for a bold middot. [05:06:45] Or bullet, I guess. [05:06:59] It's slowly being deprecated. [05:07:24] Huh. So there really is. [05:07:25] Yeah. [05:07:31] Just read the code. [05:07:48] (of the search) [05:10:06] TimStarling: in a7a74cf8 you added a comment advising not to enable it until it has been reviewed. It hasn't been formally reviewed yet but I figure that it's Brion's code and it has been running long enough to amount to something of a fact on the ground. [05:10:37] So I'm inclined to enable it, unless there was something specific that made you nervous. [05:13:08] have you tested it? [05:14:03] Yes, my dev instance has it enabled since I've been editing lots of JSON documents. [05:14:38] Periodically the syntax rules seem not to load and I haven't chased that down fully, but I haven't seen that affect anything else. [05:15:21] The comment didn't make much sense to me, because the extension is already deployed on a few production wikis, including mediawiki.org. [05:15:51] And the code didn't seem branched in a way that that disabled variable was stopping anything (additional) of consequence. [05:16:18] It's disabling the use of CodeEditor for .js and .css articles [05:16:25] the automatic use, rather. [05:16:56] Right. I didn't see how that made a difference in terms of code safety/deployment-readiness. [05:17:16] it took a fair amount of work to get it working sufficiently well to support lua editing [05:17:37] From where Brion left off, you mean? [05:17:44] yes [05:18:28] I hadn't tested it for JS or CSS editing, and I didn't really want to have to do that prior to the scribunto deployment [05:18:33] Well, it's available to be used by extensions, but the only extensions that make use of the hook are Scribunto and EventLogging, and the latter for functionality that is not enabled on mediawiki.org or test2wiki. [05:19:10] I could add the schema: namespace to test2, which would have the effect of enabling CodeEditor there. [05:19:16] so splitting the deployments seemed to make sense [05:19:38] I'd like CodeEditor to be enabled on Meta-Wiki for JSON, CSS, and JS. [05:19:47] I think Meta-Wiki is a good development/testing ground. [05:19:55] Because it's mostly experienced users. [05:20:10] generally a deployment is preceded by an announcement and some beta testing [05:20:11] And the extension needs more use to find more bugs and be ready for wider deployment. [05:20:19] and followed by gathering feedback [05:20:29] Heh, since when? :-) [05:20:30] TimStarling: Susan did.. [05:20:54] Most deployments in Wikimedia's history weren't preceded by anything more than an IRC comment saying "time to scap!" [05:21:01] http://meta.wikimedia.org/wiki/Meta:Babel?uselang=it#CodeEditor_extension_deployment_on_Schema: [05:21:25] Interesting that you went for Italian. [05:21:34] Blame the Googles. [05:21:44] if you're fine with taking responsibility for it, then I am fine with it being deployed [05:22:03] huge JavaScript applications are not my normal area of expertise [05:22:18] I'll just enable the Schema: namespace on test2 for now, I think. [05:22:58] I'm surprised EventLogging isn't already on there. [05:23:03] Or on test.wikipedia.org. [05:23:12] Then $wgCodeEditorEnableCore = true the week after that, and enable the extension on metawiki after that. [05:23:12] You were testing on Labs, I guess? [05:23:34] It's enabled on both test and labs, just not the component that manages a namespace. [05:23:50] Ah, weird. [05:23:50] And yes, on labs. [05:24:04] I thought the extension didn't do much of anything without the Schema namespace. [05:24:30] Yes, but rather than deploy and look after a bunch of schema articles scattered across Wikimedia wikis, they're all centralized on meta. [05:24:56] But data collection jobs on other WMF wikis can reference them. [05:25:03] Ah. [05:25:15] Yay for centralization. :-) [05:25:43] Well, it was where Dario, Faulkner & co were documenting data models anyhow. [05:39:46] New patchset: Ori.livneh; "Set $wgEventLoggingDBname on test2wiki to test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49003 [05:40:41] New review: Ori.livneh; "Patch Set 1: Code-Review-2" [operations/mediawiki-config] (master) C: -2; - https://gerrit.wikimedia.org/r/49003 [05:46:14] TimStarling: So my plan is to only enable CodeEditor for Schema: articles on test2 for now. I will report back if I notice anything funny that could affect Scribunto. Thanks for the advice. [05:47:37] ok, they didn't like me removing the climate box (lol) [05:50:17] there's a joke about global warming and climate science in here somewhere. [06:24:53] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [06:41:13] PROBLEM - MySQL Replication Heartbeat on db32 is CRITICAL: CRIT replication delay 204 seconds [06:41:40] PROBLEM - MySQL Slave Delay on db32 is CRITICAL: CRIT replication delay 221 seconds [06:48:07] RECOVERY - MySQL Replication Heartbeat on db32 is OK: OK replication delay 9 seconds [06:48:34] RECOVERY - MySQL Slave Delay on db32 is OK: OK replication delay 0 seconds [08:19:10] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [08:36:33] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [08:48:33] RECOVERY - MySQL disk space on neon is OK: DISK OK [09:10:50] New patchset: Catrope; "Fix stupid copy/paste mistake that broke Nagios checks for Parsoid Varnish" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49009 [09:38:10] dumps site is slow... downloading small rss file takes 10 secs... :-/ [09:43:57] there's a bug [09:45:09] https://bugzilla.wikimedia.org/show_bug.cgi?id=43647 [09:45:55] hm, for big files it eventually stabilises around some higher speed [09:46:44] network quite low http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Miscellaneous+pmtpa&h=dataset2.wikimedia.org&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [10:01:36] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [10:03:33] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [10:32:08] RECOVERY - MySQL disk space on neon is OK: DISK OK [10:34:12] apergos: apparently dumps.wm.o has some bandwidth / disk speed issue : https://bugzilla.wikimedia.org/show_bug.cgi?id=43647 ;D [10:34:16] out for acocunting [11:28:43] RECOVERY - Puppet freshness on ms1004 is OK: puppet ran at Thu Feb 14 11:28:12 UTC 2013 [11:31:58] !log ms1004 - full disk, deleted large HTCPpurger.log file, fixed puppet runs [11:32:01] Logged the message, Master [11:34:47] mutante: any chance I could get you to reinstall the two remaining lucid mobile varnish boxes to precise? :) [11:35:00] that would complete our varnish precise migration [11:39:23] mark: ah, what are their names? ok.. [11:41:02] ok, cp1043 and cp1044 i suppose [11:42:37] set to enabled: False in pybal first, right [11:43:20] yep [11:45:04] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [11:45:24] keep in mind that the backend varnish instances are still used by other frontends, independent of pybal [11:47:37] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [11:50:47] and do them one by one too [11:53:09] yea, one by one, about the other frontends, is this ./templates/varnish/mobile-frontend.inc.vcl.erb .. ? [11:54:10] no need to edit them [11:54:27] i just want you to be aware that depooling in pybal doesn't mean the box is then idle [11:54:38] but if you stop the backend varnish instance the others will notice and it's ok [11:54:57] ah, ok [11:56:37] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [12:10:14] !log disabling cp1043 in pybal [12:10:16] Logged the message, Master [12:14:28] !log stopping varnish backend on cp1043 and starting reinstall [12:14:29] Logged the message, Master [12:16:06] !log Inserted varnish 3.0.3plus~rc1-wm7 into the precise-wikimedia APT repository [12:16:07] RECOVERY - MySQL disk space on neon is OK: DISK OK [12:16:07] Logged the message, Master [12:19:01] PROBLEM - Host cp1043 is DOWN: PING CRITICAL - Packet loss = 100% [12:24:49] RECOVERY - Host cp1043 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [12:29:01] PROBLEM - Varnish HTTP mobile-backend on cp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:29:01] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:29:10] PROBLEM - Varnish HTTP mobile-frontend on cp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:29:29] PROBLEM - SSH on cp1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:29:46] PROBLEM - Varnish HTCP daemon on cp1043 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:31:07] RECOVERY - SSH on cp1043 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [12:43:52] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [12:44:17] New patchset: Mark Bergsma; "Update streaming range patch to M.B.Grydeland's updated version" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/49031 [12:44:17] New patchset: Mark Bergsma; "varnish (3.0.3plus~rc1-wm7) precise; urgency=low" [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/49032 [12:44:19] RECOVERY - Varnish HTCP daemon on cp1043 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [12:45:33] mark: after reinstall on cp1043, it couldn't start varnish, because it's missing /a/sda/varnish.persist. does not have /a/ [12:45:58] huh [12:46:00] * mark looks [12:46:44] oh I see [12:46:52] check role/cache.pp [12:46:58] near the mobile-backend section [12:47:06] there's a separate stanza for cp104[12] [12:47:08] and for the rest [12:47:13] so that needs updating between every reinstall [12:47:35] line 623 [12:48:08] gotcha:) [12:48:32] after you're done, that selector can be removed [12:48:34] they're all consistent then [12:49:19] also that FIXME some lines above it [12:49:41] New review: Mark Bergsma; "Patch Set 1: Verified+1 Code-Review+2" [operations/debs/varnish] (testing/3.0.3plus-rc1); V: 1 C: 2; - https://gerrit.wikimedia.org/r/49031 [12:49:48] New review: Mark Bergsma; "Patch Set 1: Verified+2" [operations/debs/varnish] (testing/3.0.3plus-rc1); V: 2 - https://gerrit.wikimedia.org/r/49031 [12:49:49] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/49031 [12:50:12] New review: Mark Bergsma; "Patch Set 1: Verified+2 Code-Review+2" [operations/debs/varnish] (testing/3.0.3plus-rc1); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49032 [12:50:13] Change merged: Mark Bergsma; [operations/debs/varnish] (testing/3.0.3plus-rc1) - https://gerrit.wikimedia.org/r/49032 [12:51:17] New patchset: Dzahn; "let cp1043 use /srv/ instead of /a/ for varnish.persist file after reinstall with precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49033 [12:52:18] yep, good [12:52:29] New review: Dzahn; "Patch Set 1: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49033 [12:52:36] k:) [12:52:38] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49033 [12:55:08] RECOVERY - Varnish HTTP mobile-backend on cp1043 is OK: HTTP OK HTTP/1.1 200 OK - 698 bytes in 0.053 seconds [12:55:34] RECOVERY - Varnish HTTP mobile-frontend on cp1043 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [12:55:50] hmm so [12:55:55] better leave it running for a while before you do the next [12:55:59] cp1043 is empty now [12:57:20] why is ryan up [12:58:51] mark: ok, i see hits in varnishstat [12:59:19] not re-enabled in pybal yet [13:08:20] mark: he might be jet lagged following his trip in Europe :-D [13:10:59] hehe [13:11:12] !log Upgraded varnish to latest version with updated range patch on cp1021-1028 [13:11:14] Logged the message, Master [13:19:44] mutante: you can proceed with the next box [13:19:47] mobile hit rate sucks anyway [13:20:01] it's the same on the reinstalled box with an empty cache as on the existing boxes [13:21:23] ah, thanks. i was looking at hit rate and wondering why it goes up so slow [13:22:16] !log re-enabling cp1043 in pybal [13:22:16] RECOVERY - Puppet freshness on cp3022 is OK: puppet ran at Thu Feb 14 13:22:07 UTC 2013 [13:22:17] Logged the message, Master [13:27:10] !log disabling cp1044 in pybal, reinstalling with precise [13:27:12] Logged the message, Master [13:32:48] New patchset: Dzahn; "use /srv/ pathes to varnish.persist as default for mobile varnish cache servers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49042 [13:34:52] PROBLEM - Host bits-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:04] !log Fixed fixing [13:35:06] Logged the message, Master [13:35:07] heh [13:35:09] hehe [13:35:20] PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:46] RECOVERY - Host bits-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 26.58 ms [13:36:03] mark: setting default like this https://gerrit.wikimedia.org/r/#/c/49042/1/manifests/role/cache.pp [13:36:04] ... that becomes more fun if the fix didn't quite work since then you'd be able to 'Fixed fixing fix' :-) [13:36:46] mutante: no [13:36:51] remove the entire selector [13:36:58] there's no point anymore [13:37:06] also remove the if with the FIXME a bit above [13:37:15] ok, wasn't sure if it gets those options from elsewhere [13:41:10] RECOVERY - Host cp1044 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [13:43:10] New patchset: Dzahn; "remove selector / FIXME for mobile varnish backends on lucid after reinstalls with precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49042 [13:44:55] PROBLEM - Varnish HTTP mobile-frontend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:55] PROBLEM - Varnish HTCP daemon on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:45:40] PROBLEM - SSH on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:46:16] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:46:17] PROBLEM - Varnish HTTP mobile-backend on cp1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:47:19] RECOVERY - SSH on cp1044 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [13:50:06] New review: Dzahn; "Patch Set 2: Verified+2 Code-Review+2" [operations/puppet] (production); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49042 [13:50:14] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49042 [13:50:19] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 184 seconds [13:51:13] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 201 seconds [13:51:40] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 217 seconds [13:51:54] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 240 seconds [13:54:47] New patchset: Mark Bergsma; "Fully qualify $hostname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49043 [13:54:47] New patchset: Mark Bergsma; "Configure LVS cross row interfaces for row C" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49044 [13:54:49] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [13:55:16] RECOVERY - Varnish HTCP daemon on cp1044 is OK: PROCS OK: 1 process with UID = 997 (varnishhtcpd), args varnishhtcpd worker [14:01:06] mutante: why the fuck do you merge a change which doesn't pass puppet syntax verification? [14:02:07] yeah, sry, looking at that syntax error [14:02:09] argh [14:02:15] and you removed the entire section inside the if too [14:02:27] please don't merge stuff if you don't understand what it's doing [14:02:34] New patchset: Mark Bergsma; "Revert "remove selector / FIXME for mobile varnish backends on lucid after reinstalls with precise"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49046 [14:03:01] New review: Mark Bergsma; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49046 [14:03:09] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49046 [14:03:26] sorry, i should have seen the jenkins failure [14:03:38] why did you even set verified +2? [14:03:44] jenkins normally does that [14:04:14] out of habit when we had to do both to merge [14:04:14] New patchset: Mark Bergsma; "Configure LVS cross row interfaces for row C" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49044 [14:04:15] New patchset: Mark Bergsma; "Fully qualify $hostname" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49043 [14:04:21] er no [14:04:23] that's never been the case [14:04:29] rarely, when the check fails or something [14:04:43] you have the option to do that in case of emergency [14:04:50] you should certainly not do that out of habit [14:04:54] that defeats the entire purpose of it [14:05:37] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [14:06:07] New patchset: Mark Bergsma; "Configure LVS cross row interfaces for row C" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49044 [14:06:49] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [14:07:23] true, it defeats the purpose. not doing it anymore [14:08:21] thanks [14:08:45] New patchset: Mark Bergsma; "Configure LVS cross row interfaces for row C" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49044 [14:09:39] New review: Mark Bergsma; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49043 [14:09:48] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49043 [14:10:12] New review: Mark Bergsma; "Patch Set 4: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49044 [14:10:21] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49044 [14:11:48] New patchset: Mark Bergsma; "Remove legacy settings now all mobile varnish servers are precise" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49047 [14:12:18] New review: Mark Bergsma; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49047 [14:12:27] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49047 [14:12:56] ok, i saw you did it right. thanks [14:13:29] New patchset: Demon; "Adding --topic to one last gerrit hook" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49048 [14:13:43] <^demon> Super easy one-line fix if someone's got a second ^ [14:14:03] not me right now, busy [14:14:10] RECOVERY - Varnish HTTP mobile-frontend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 641 bytes in 0.053 seconds [14:15:25] ^demon: can do [14:15:31] RECOVERY - Varnish HTTP mobile-backend on cp1044 is OK: HTTP OK HTTP/1.1 200 OK - 696 bytes in 0.053 seconds [14:15:53] <^demon> mutante: Thanks [14:16:08] New review: Dzahn; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49048 [14:16:18] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49048 [14:26:20] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 203 seconds [14:26:28] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 206 seconds [14:30:48] !log jenkins : updated all jobs using latest version jjb-config [14:30:50] Logged the message, Master [14:39:54] !log Jenkins: adding in several extension to Zuul configuration {{gerrit|49041}} [14:39:55] Logged the message, Master [14:50:37] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [14:52:44] !log re-enabled cp1044 in pybal - mobile varnish cache servers all on precise now [14:52:45] Logged the message, Master [14:52:47] mark: done [14:53:04] thanks [14:53:38] yw, sorry for that bad merge [14:58:58] mutante: mark: would you mind reviewing https://gerrit.wikimedia.org/r/#/c/45115/ ? That is to get rid of a duplicate definition in beta for /usr/local/apache [14:59:06] it is a directory in production but a symlink in beta : / [15:01:56] New patchset: Aude; "enwiki should have alphabetic sort order, not code order" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49052 [15:03:16] * aude requests one more update for wikibase for enwiki  [15:03:22] ^demon: or Reedy (sorry!) [15:03:24] https://gerrit.wikimedia.org/r/#/c/49051/ [15:03:34] https://gerrit.wikimedia.org/r/#/c/49052/ [15:03:59] enwiki has a different sort order for the interwiki links than the other wikipedias that have wikidata [15:04:16] nobody has complained, fortunately :D [15:05:05] New review: Aude; "Patch Set 1:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49052 [15:10:55] <^demon> aude: Looking. [15:11:25] ^demon: thanks [15:11:46] we might have a couple little bug fixes probably on monday, but otherwise everything seems quite good with enwiki [15:12:00] might depend when the next deployment will be for enwiki [15:12:41] i lost our ops :D [15:12:56] * aude will have to take a lot of time to get the sort orders right and all for deploying to the other wikipedias [15:13:15] details details [15:15:36] <^demon> for the client, I just need to sync SortUtils, right? [15:16:37] New review: Demon; "Patch Set 1: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/49052 [15:17:00] ^demon: yes [15:17:07] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49052 [15:17:15] <^demon> Okie dokie. Once jenkins merges the core change we'll sync this out. [15:17:34] i put minwiki in the wrong order and now enwiki needs a custom order [15:17:39] someone will notice [15:18:17] AnjaJ_WMDE: we are updating the sort order for enwiki, just fyi [15:18:35] they use a different ordering than hu / he and it [15:19:51] <^demon> aude: It won't merge! [15:19:53] <^demon> :\ [15:19:55] ack [15:20:05] <^demon> There is goes. [15:20:06] <^demon> Weird. [15:20:07] i'll try again [15:20:11] <^demon> *it [15:20:13] <^demon> It's fine now [15:20:17] oh, ok [15:20:42] <^demon> Aaron was complaining about that the other day. Hard to replicate...no errors in the log. [15:20:51] hmmm [15:21:40] RECOVERY - MySQL disk space on neon is OK: DISK OK [15:21:46] gerrit is mostly nicer for me now, though i miss the +2 and jenkins +1 on my merged patches [15:21:49] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [15:21:52] those seem to disappear after merge [15:22:09] !log demon synchronized php-1.21wmf9/extensions/Wikibase/client/includes/SortUtils.php 'Updating wikibase sort order' [15:22:11] Logged the message, Master [15:22:15] thanks :) [15:22:56] !log demon synchronized wmf-config/InitialiseSettings.php 'Updating wikibase sort order' [15:22:57] Logged the message, Master [15:23:28] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 194 seconds [15:25:44] <^demon> aude: Everything look good? [15:25:52] * aude checking [15:29:14] ^demon: good [15:29:43] i remember now that Japanese, for example, is sorted as "nihongo" -- so in the n's [15:29:51] <^demon> *nod* [15:29:52] hungarian is "maygar" [15:30:17] thanks [15:38:12] <^demon> np [15:46:26] New review: ArielGlenn; "Patch Set 3: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/45115 [15:46:37] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/45115 [15:56:17] New patchset: Mark Bergsma; "insert 'realm' in role::cache::configuration::active_nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47067 [15:57:18] New review: Mark Bergsma; "Patch Set 3: Code-Review-2" [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/47067 [15:58:28] New review: Mark Bergsma; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47067 [16:06:31] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 190 seconds [16:06:40] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 188 seconds [16:20:20] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 1 seconds [16:20:28] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [16:20:28] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:34:05] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [16:53:12] New patchset: Aude; "Redirect wikidata.org to www.wikidata.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [16:54:26] New patchset: Aude; "(bug 45005) Redirect wikidata.org to www.wikidata.org" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/49069 [16:54:35] mutante: https://gerrit.wikimedia.org/r/#/c/49069/ [16:55:04] i'm not 100% sure that's the right way but we'd like to have wikidata.org redirect to www.wikidata.org so it's consistent [16:55:55] jeremyb_: maybe you want to review that also? :) [16:59:01] aude: what about the other subdomains? [16:59:23] i.e. why are you removing the wildcard? [17:00:43] i think we want to redirect the other subdomains but probably not permanently [17:01:14] eventually we want the language subdomains to work automagically to resolve to language specific content [17:01:50] RECOVERY - MySQL disk space on neon is OK: DISK OK [17:26:53] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 182 seconds [17:27:38] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 200 seconds [17:36:32] apergos: did you see the newer patch? [17:38:53] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 210 seconds [17:41:06] preilly: csteipp robla ^demon https://gerrit.wikimedia.org/r/#/c/49069/ [17:41:08] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.49 ms [17:41:09] for apache [17:41:30] i am modelling it on mediawiki.org but not sure that's the right way to solve this [17:42:26] i think we want to leave open the possibility for language/site subdomains (e.g. de) to redirect to language specific content [17:42:47] yet have wikidata.org redirect to www.wikidata.org like mediawiki does [17:43:54] aude: Yeah, although getting rid of *.wikidata.org in main.conf will mean that the VirtualHost wont apply for, e.g., en.wikidata.org [17:44:36] hmmmm [17:45:02] i think we want those to redirect to www.wikidata also [17:45:12] but not set it as permanent, i suppose [17:45:15] ? [17:45:25] Ah, gatcha. That would work, yes. [17:45:36] it'd be a 302 then? [17:45:40] temporary or what? [17:46:23] i realise the www is pretty imporant for the whole routing code that's done for mediawiki, though not sure i know all the implications completely [17:46:30] like bits and stuff [17:46:32] For the apache config, it should be as perminant as possible, so Squid can cache the redirect. [17:46:38] ok [17:47:28] feel free to -1 my patch or upload new patch or whatever or -2 it :D [17:48:23] So is a redir for en.wikidata.org => www.wikidata.org/something in apache something that is likely to happen in the next couple of months? [17:48:27] for the client, we are trying to do the ajax stuff and think the inconsistency might also be a problem with getting the right edit token [17:48:35] probably [17:49:14] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [17:50:08] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds [17:51:04] i think it.wikidata.org/wiki/New_York would redirect to www.wikidata.org/wiki/Special:ItemByTitle/itwiki/New_York [17:51:32] why itwiki not itwikisource? [17:51:39] for stuff in namespaces, like project namespace, it's trickier [17:51:55] jeremyb_: that would be when we support wikisource, then it would be itwikisource [17:52:06] err, it won't work [17:52:28] right, but it.wikidata.org means itwiki ? or just means italian? [17:52:28] i don't know if /how we'd be able to do that [17:52:33] it = itwiki [17:52:38] New review: Hashar; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47067 [17:52:42] New patchset: Ottomata; "Setting up sync from HDFS /wmf/public directory to stat1001 stats.wikimedia.org site." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49074 [17:52:50] maybe we could have yet another subdomain, like mobile for the other projects, but that's way in the future [17:52:56] New patchset: Hashar; "insert 'realm' in role::cache::configuration::active_nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47067 [17:53:03] en.s.wikidata.org maybe? (ugly) [17:53:20] no, multilevel domain is bad [17:53:22] for SSL [17:53:26] sure [17:53:40] i think we wouldn't support that then, though up to denny [17:54:06] for now it.wikidata.org or wikidata.org can all go to www.wikidata.org [17:54:42] alright, hate to skip out of here.... but need to go to pub quiz in a few minutes :D [17:54:46] * aude online later [17:54:58] there's also a bug ticket https://bugzilla.wikimedia.org/show_bug.cgi?id=45005 [18:00:47] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [18:02:09] ^demon: here is a quick read: http://squirrelshaterobots.com/programming/php/on-the-speed-of-functions-and-namespaces/ and http://squirrelshaterobots.com/programming/php/more-on-the-speed-of-functions-and-namespaces/ [18:03:26] * Aaron|home glances at http://www.teslamotors.com/models [18:03:43] New patchset: Hashar; "insert 'realm' in role::cache::configuration::active_nodes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47067 [18:04:11] New review: Hashar; "Patch Set 5:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47067 [18:06:38] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.47 ms [18:14:49] <^demon> preilly: Interesting. [18:15:20] ^demon: don't get me wrong I love the idea of namespaces [18:15:40] <^demon> I was never sold on them, but I know there's a subset of developers who *really like them* [18:16:04] yeah for sure [18:16:16] <^demon> Did you see that Zend open sourced the zend optimizer+? [18:18:53] ^demon: you mean Zend Optimizer™ [18:19:05] ^demon: I think you forgot the ™ [18:19:08] <^demon> https://github.com/zend-dev/ZendOptimizerPlus/ [18:19:10] ;-) [18:19:22] <^demon> They seem to have dropped the (tm) for + :) [18:20:26] ^demon: but yes I heard about it — sorry, I was just trying to be funny [18:20:53] <^demon> I've never used it. Wonder how its opcode caching compares to apc. [18:21:42] ^demon: I trust APC more [18:22:05] someone already ITPed for Debian [18:22:07] hehe [18:22:15] ^demon: Dmitry Stogov is a good person don't get me wrong it's just he can be a little progressive for my taste [18:22:37] * Aaron|home wonders where peter is [18:22:52] he walked in to the office minutes ago [18:23:21] paravoid: IPT === Intent to package right? [18:23:25] yes [18:23:35] s/IPT/ITP [18:23:47] * preilly today must be my typo day [18:24:02] https://lists.debian.org/debian-devel/2013/02/msg00244.html [18:24:14] paravoid: The site's security certificate is not trusted! [18:24:36] it's issued by SPI's CA [18:25:04] SPI is Software in the Public Interest Inc., Debian's umbrella organization [18:25:33] and the non-profit that hosts and audits WMF's board elections FWIW :) [18:27:04] woosters: were blocked on https://rt.wikimedia.org/Ticket/Display.html?id=4519 and notpeter (who's on duty) isn't around to move the ticket. who else can we reach out to? [18:27:08] paravoid: yeah, I know that: SPI has hosted Wikimedia Foundation board elections and audited the tally as a neutral third party since 2007. [18:27:26] tfinc: Peter is right next to me [18:27:39] preilly: can you ask him to join irc please [18:27:55] tfinc: he is working on it right now [18:29:23] notpeter: howdy. what else can we provide to move https://rt.wikimedia.org/Ticket/Display.html?id=4519 forward ? [18:29:32] tfinc - normally we have a 3-day window for comments [18:29:36] for any access request [18:29:40] yeah [18:29:42] * YuviPanda looks around [18:29:53] this seems pretty clear cut, so I can do it now [18:29:58] notpeter: thanks! [18:30:02] but 72 hour period for comments is standard [18:30:15] that was in yesterday … and if no ops folks object, it will be approved without deliberation [18:30:31] notpeter: I have a fixed patch for that runner script now [18:30:39] runs fine on mw1010 [18:31:03] notpeter: the fix was moving 1 line of code :) [18:31:08] Aaron|home: woo! [18:31:09] awesome [18:31:09] woosters: thanks. are those guidelines published anywhere? i tried searching wikitech and found nothing. [18:31:27] it's actually a lot easier to test than I thought [18:31:56] it's an erb file, but really just hardcoding one puppet var makes it valid bash [18:32:53] notpeter: the problem was subprocscreate not getting reset to 0 between calls, so the loop always break out and 12 more procs would spawn [18:33:05] it seems kind of obvious, not sure how we missed that [18:33:41] * Aaron|home isn't used to passing variables as "side affects" and always avoids it in non-crap language where it can be avoided [18:34:56] notpeter: I'd added you as a reviewer [18:34:58] New review: Faidon; "Patch Set 1: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/49074 [18:36:37] *I've [18:36:42] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49009 [18:36:43] * Aaron|home can't type [18:36:51] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49009 [18:37:00] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48999 [18:37:06] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48999 [18:37:32] PROBLEM - Puppet freshness on stafford is CRITICAL: Puppet has not run in the last 10 hours [18:37:32] tfinc: I have updated http://wikitech.wikimedia.org/view/Requesting_shell_access accordingly [18:37:40] Aaron|home: ah, makes sense [18:38:08] tfinc: I thought that this had been communicated to the rest of the eng team. if it had not, I appologize [18:38:51] notpeter: yup. i and i'm sure many others had never heard of this very legitimate window. thanks for the update. that paged hadn't been touched in over two years [18:38:57] tfinc .. i thought i mentioned that during staff .. sorry [18:39:21] woosters: if its not written down then its easily forgotten and anyone who joins after doesn't benefit from it [18:40:01] notpeter: is there anyway to show this page within rt *when* someone is requesting access ? that way they dont have to go elsewhere to know the expecations [18:40:05] i mentioned that we reduced it from discussing access request during ops staff mtg to this 3-day window [18:40:32] should have updated wikitech …. [18:40:42] notpeters- thks for doing so [18:41:38] tfinc: I'm just going to email out to eng@ for clarity/reminder/expecation setting [18:42:26] great [18:43:33] MaxSem: hehe nice one http://wikitech.wikimedia.org/index.php?title=Requesting_shell_access&diff=56460&oldid=56459 [18:43:43] * tfinc steps away [18:50:09] New patchset: Faidon; "Decomission srv266 & srv278" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49077 [18:52:01] New review: Faidon; "Patch Set 1: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49077 [18:52:09] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49077 [18:52:14] !log rebooting labstore1 [18:52:15] Logged the message, Master [18:52:18] RobH: ^^^ [18:52:26] RobH: not sure what else you need to do to decom them [18:53:04] so the next steps are to drop a ticket in pmtpa queue for each one to be wiped. they also need to come out of any node lists for dsh and pybal [18:53:10] (i assume you did the latter two already?) [18:53:17] no [18:53:23] wanna or shall i? [18:53:38] i can drop ticket if you pull out of pybal and dsh nodes [18:53:39] feel free :) [18:54:04] * Ryan_Lane grumbles [18:54:22] this is not how I wanted to spend my day [18:54:29] New patchset: Ottomata; "Setting up sync from HDFS /wmf/public directory to stat1001 stats.wikimedia.org site." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49074 [18:55:00] RobH: BTW I am now all set up so I can swap out the eqiad Parsoid machines whenever you want me to (except for Mondays or Wednesdays, cause I'm OOTO then) [18:55:23] coolness, lets plan to do some of them tomorrow =] [18:55:31] Alright [18:55:32] i'll get the OS spun up and the systems ready [18:55:36] Great [18:55:47] thx [18:56:07] paravoid: fyi: other than those items, the last thing we do on decom server is tell me, so i can tell accounting =] [18:56:21] (anyone can tell accounting, but I keep a running list for replacements and the like) [18:56:25] so easier if i do it [18:56:29] and racktables I guess [18:56:33] As for the Parsoid Varnish boxes (cesium and titanium), if you want to do anything to those that'll have to wait a bit longer, because I need to set up LVS for that service first, which means I need Leslie [18:56:38] so the datacenter tech will move in racktables [18:56:41] once they wipe and remove from rack [18:56:52] k [18:56:58] RoanKattouw: no problem, just getting back a few of them should help a lot [18:57:21] unfortunately, there is no 'disabled' hardware in racktables, just active or deleted [18:57:27] so we move the decom stuff to 'decom racks' heh. [18:59:28] RobH: removed from pybal & dsh groups [18:59:38] coolness [18:59:46] good riddance to wonky servers [19:01:38] !log rebooting labstore2 [19:01:40] Logged the message, Master [19:08:27] !log rebooting labstore4 [19:08:28] Logged the message, Master [19:14:40] New patchset: Hashar; "gallium blessed with misc::package-builder" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47725 [19:15:15] New review: Hashar; "Patch Set 3:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47725 [19:15:58] !log rebooting labstore3 [19:16:00] Logged the message, Master [19:16:25] New review: Faidon; "Patch Set 3: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/47725 [19:16:37] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/47725 [19:19:32] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [19:19:59] RECOVERY - Host labstore3 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:22:24] !log just added mw1161-1200 into node lists for eventual service, some may throw sync script errors as I track them down [19:22:24] Logged the message, RobH [19:22:24] !log new servers are NOT in pybal yet, and thus not getting traffic [19:22:24] Logged the message, RobH [19:24:47] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:25:00] ottomata: ^^^ [19:25:10] yeah, that's been down fooooever [19:25:13] chjohnson1 is working on it [19:25:21] like, that machine has actually never been online [19:25:43] it hits a raid error in installer [19:25:50] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 181 seconds [19:25:53] which is odd, cuz there is no raid setup on the controller (as intended) [19:26:01] so it shouldnt throw the error, its some disk or paritioning problem [19:26:08] (odd since its identical to the other servers that installed fine) [19:26:26] next step is to yank it out of netboot auto partition and see what it does [19:26:36] cuz atleast then we can see whats up [19:26:43] i spent multiple hours troubleshooting it last week =P [19:26:48] (updated ticket too) [19:28:06] aye [19:28:15] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 185 seconds [19:30:02] PROBLEM - SSH on labstore1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:30] Aaron|home: no, I didn't see there was one [19:31:41] RECOVERY - SSH on labstore1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [19:31:42] lol, I just started typing, heh [19:32:46] unfortunately in about half an hour I'm in a meeting (non work) [19:33:05] and while I kow this "should" take only a few minutes.... [19:33:11] that was yesterday too :-P [19:33:18] I can always bug notpeter [19:33:32] so there are two things I can do [19:33:46] 1) I can look at the actual commit (gotta link?) [19:33:47] RECOVERY - Host analytics1007 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [19:34:10] 2) if you stick the script on mw10-choose one (how about 1010), I can run it form the command line and see what it does [19:34:14] RECOVERY - Parsoid Varnish on titanium is OK: HTTP OK HTTP/1.1 200 OK - 1324 bytes in 0.054 seconds [19:34:16] shooting it if it behaves stuidly [19:34:20] *stupidly [19:34:34] those are both things that I could do without leaving things in a mess in half ah nour [19:35:08] RECOVERY - Parsoid Varnish on celsus is OK: HTTP OK HTTP/1.1 200 OK - 1314 bytes in 0.004 seconds [19:35:08] RECOVERY - Parsoid Varnish on constable is OK: HTTP OK HTTP/1.1 200 OK - 1326 bytes in 0.010 seconds [19:35:08] RECOVERY - Parsoid Varnish on cerium is OK: HTTP OK HTTP/1.1 200 OK - 1324 bytes in 0.056 seconds [19:36:39] apergos: https://gerrit.wikimedia.org/r/#/c/48981/3 [19:37:01] I have a version in my home dir [19:37:06] on mw1010 [19:37:59] paravoid: Yay, Parsoid Varnish checks are working now :) ---^^ [19:38:16] RoanKattouw: thanks a lot! [19:38:54] ok just a sec [19:39:13] Hmm, wtf [19:47:35] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [19:47:40] New review: Ori.livneh; "Patch Set 1:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49003 [19:48:16] RoanKattouw: I think Parsoid Varnish ought to be properly known as "punish" [19:48:43] that did not go well [19:49:21] I saw some out of memory fly by from php too [19:49:37] this looks much like yesterday's run did, Aaron|home [19:49:59] what are you running? [19:50:44] your jobs-loops2.sh [19:50:48] well it's dead now [19:50:54] but I tried it over there [19:51:02] spawned a billion things [19:51:07] then fell over [19:51:11] (thank goodness) [19:51:25] I ran it several times today [19:51:55] sudu -u apache bash /home/aaron/jobs-loop2.sh that's how I'm running it [19:52:06] er [19:52:08] sudo [19:52:14] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 29 seconds [19:52:50] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 0 seconds [19:54:01] you can see a bunch of whining in syslog too [19:54:37] Feb 14 09:44:26 mw1010 php: "" is not a valid magic word for "int" [19:54:40] hmm interesting [19:55:05] but that's already after an oom so could be a side efffect of brokenness [19:55:50] New patchset: Pyoungmeister; "giving yuvi panda access to stat1" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49087 [19:57:18] apergos: you are not running anything now though right? [19:57:41] nope [19:57:55] the regular job looops script is stopped too til puppet runs it again [19:58:03] ps -ef | grep runJobs | wc -l looks stable [19:58:05] PROBLEM - MySQL disk space on neon is CRITICAL: Connection refused by host [19:58:09] ah I see you started it [19:58:09] wtf [19:58:12] at 31 [19:58:22] it's not insane to me [19:58:51] so there are the normal 12 plus 12, seems like there are some extra procs though [19:59:14] it's like there are three sh scripts running [19:59:21] well that is so not what I had [19:59:33] like I said there are a ton of oom messages in the syslog from having it explode [19:59:40] and then die [19:59:40] ah, there are three [19:59:49] mine, yours, and the puppet one (so says ps) [19:59:52] that makes sense then [20:00:04] so yeah, ~36 [20:00:14] apergos: maybe you are running into https://bugzilla.wikimedia.org/show_bug.cgi?id=39493 ? [20:00:34] refreshLinks still ooms on occasion, has for months :( [20:00:43] well mine wasn't running earlier [20:00:44] ps axuww | grep job [20:00:44] root 11506 0.0 0.0 9384 924 pts/1 S+ 19:48 0:00 grep --color=auto job [20:00:52] so how is it running now when I did not start it?? [20:09:54] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [20:09:55] New patchset: Andrew Bogott; "Fixing broken dependencies for updating mw-extensions" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48474 [20:09:55] New review: Andrew Bogott; "Patch Set 4:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48474 [20:09:55] apergos: there are 3-4 procs at a time on average now [20:09:55] New review: Andrew Bogott; "Patch Set 3: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/48979 [20:09:56] I wonder if adding 12 is just too high for refreshlinks [20:09:56] so the syslog is full of oom messages until about 2 mins ago when all the new job scipts got shot [20:09:57] funny, since we nominally have 12 [20:09:58] well the current one runs with --procs=12 [20:10:01] and I see [20:10:14] but really it's 3-4 since it is bad at keeping the pipeline full [20:10:20] 13 of them [20:10:29] right this second I mean [20:10:40] that means it just started the script [20:10:47] but when I ran it by hand I saw like hundreds of refreshlink lines go by [20:11:24] well again this is the old script with normal behavior [20:11:34] I expect it to look like this, where there's 12 or so [20:11:43] which eventually die off and then a new batch starts [20:12:07] Feb 14 20:10:02 t last message repeated 2 times [20:12:07] Feb 14 20:10:02 mw1010 kernel: [12670150.585429] php[4084]: segfault at 2c ip 00007f5b0ec7d7a4 sp 00007fffaf5bf510 error 6 in libmemcached.so.10.0.0[7f5b0ec63000+32000] [20:12:07] Feb 14 20:10:03 mw1010 t of memory [4079] [20:12:08] new [20:12:26] that's not good [20:12:31] like right now? [20:12:46] Thu Feb 14 20:12:35 UTC 2013 [20:12:52] yep, as in 2 mins ago [20:13:21] I'm not doing anything [20:13:33] I know [20:14:24] might need rebooting :-( [20:14:53] that whole business of 'it shows up in the prcess table, now it doesn't, now it does again' makes me nervous anyways [20:15:07] going to reboot, mid droppipng off for a minute? [20:15:10] *mind [20:15:30] already off [20:15:41] New patchset: Kaldari; "Updating wgEchoDefaultNotificationTypes for new key name" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49090 [20:15:54] !log rebooting mw1010 after some issues with OOM [20:15:55] Logged the message, Master [20:15:57] New review: Kaldari; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49090 [20:15:58] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49090 [20:16:41] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 192 seconds [20:17:17] PROBLEM - MySQL Slave Delay on db33 is CRITICAL: CRIT replication delay 199 seconds [20:17:56] hmm looks like my meeting isn't tonight after all but next week :-) (but I will disappear soon anyways) [20:18:51] host is back up, if you and peter play, be sure to have him keep an eye on the syslog [20:18:57] apergos: ok [20:24:31] !log kaldari synchronized wmf-config/CommonSettings.php 'updating CommonSettings.php for Echo' [20:24:33] Logged the message, Master [20:29:26] RECOVERY - MySQL Slave Delay on db33 is OK: OK replication delay 24 seconds [20:30:47] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [20:33:03] New patchset: Demon; "Include diffs in Gerrit e-mails" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49133 [20:33:33] <^demon> Ryan_Lane: ^ Would make people very happy :) [20:34:03] handling an outage... [20:34:46] <^demon> Oh, nvm. [20:36:43] New review: Demon; "Patch Set 1: Code-Review+1" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/49133 [20:41:50] notpeter: is there anything in the syslog for mw1010? [20:46:19] oh, the runner hasn't started yet, so probably not [20:47:08] RECOVERY - MySQL disk space on neon is OK: DISK OK [20:56:10] whee, Solr went bonkers again [20:56:32] anyone around to investigate? [21:00:04] notpeter, ^^^ [21:03:36] !log bsitu Started syncing Wikimedia installation... : Update ArticleFeedbackv5, Echo, PageTriage to master [21:03:38] Logged the message, Master [21:05:20] New review: Aaron Schulz; "Patch Set 3: Code-Review-1" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/48981 [21:05:36] New patchset: MaxSem; "Enable concurrent GC for Solr" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49159 [21:08:58] anyone around? [21:18:20] New review: Hashar; "Patch Set 5: Verified-1 Code-Review-1" [operations/puppet] (production); V: -1 C: -1; - https://gerrit.wikimedia.org/r/47664 [21:18:27] MaxSem, RT. [21:18:57] ori-l, RT to resolve an outage?:P [21:19:14] If no one is available on IRC, yes. [21:20:06] fatcache cool? [21:20:06] https://github.com/twitter/fatcache [21:20:46] i think everyone must be out to lunch on the west coast [21:21:00] ori-l, yell real loud, tell us if the office echoes [21:21:06] ottomata, maybe you can help mre? [21:21:14] maayyyyyyybe, i will try! [21:21:23] i'm waiting for some ops response too, so let's hang [21:21:27] wazzaaaaap? [21:21:35] Solr eh? [21:21:53] ottomata: :) [21:21:55] ottomata, 2 Solr servers are choked [21:22:15] https://gerrit.wikimedia.org/r/49159 should address this [21:23:07] i dunno if I feel confident merging that, since I don't know any of the context, is there anything I can do to help other than that? bumping solr server or something? [21:23:32] restarting jetty would fix it for now, yes [21:23:48] service jetty restart on solr3 and solr1003 [21:24:40] ok... [21:25:44] !log restarted solr jetty on solr3 and solr1003 for MaxSem [21:25:46] Logged the message, Master [21:25:53] thanks!:) [21:25:54] MaxSem, i'm not sure how to check on that, how's it look? [21:26:38] ottomata, curl http://:8983/solr/admin/stats.jsp [21:26:51] loads of XML = OK [21:27:14] infinite delay or HTTP 500 = bad [21:27:22] looking good! [21:27:23] on both [21:27:49] funny about this is that monitoring on this stuff is broken [21:28:40] oh shi~ [21:28:52] wha? [21:29:05] New patchset: Ori.livneh; "Enable PostEdit on {mediawiki,outreach,be}wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49164 [21:29:07] ottomata, now solr2 also broke:P [21:29:40] ha, ok, bump it? [21:29:48] yes please [21:30:41] !log bsitu Finished syncing Wikimedia installation... : Update ArticleFeedbackv5, Echo, PageTriage to master [21:30:42] Logged the message, Master [21:31:46] !log restarted solr jetty on solr2 [21:31:48] Logged the message, Master [21:32:05] gah, i love it when the internet cuts out while in the middle of restarting something! [21:32:36] heh. [21:32:45] New review: Hashar; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49133 [21:33:55] thx again:) [21:34:07] yup! [21:34:49] New review: Ori.livneh; "Patch Set 1: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/49164 [21:37:30] New patchset: MaxSem; "Enable concurrent GC for Solr" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49159 [21:38:14] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49164 [21:39:34] !log taking down srv266 and srv268 for decommissioning ... https://rt.wikimedia.org/Ticket/Display.html?id=4534 [21:39:35] Logged the message, Master [21:40:23] PROBLEM - Host srv268 is DOWN: PING CRITICAL - Packet loss = 100% [21:41:10] New review: Demon; "Patch Set 1:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49133 [21:43:04] Hey, sync-dir appears to have hung for me. I got asked for password on mw1194 and mw1200. [21:43:59] I think it proceeded to do the remaining hosts and it's just those two that are outstanding, so presumably ctrl-c wouldn't hurt. What is the proper way to handle this? [21:45:06] ori-l: RobH is setting up those hosts [21:45:20] you can skip them [21:45:24] The best way to handle it is to press enter as many times as there are password prompts [21:45:29] !log olivneh synchronized php-1.21wmf9/extensions/EventLogging [21:45:30] Logged the message, Master [21:45:34] Well, that was easy. [21:45:36] yea, they will fail and i will ensure i make them work when i push them later [21:45:38] Thanks! [21:45:39] That'll cause wrong password errors for those, but allows the others to continue [21:45:49] sorry about that guys [21:46:08] No problem. Thanks RoanKattouw and RobH. [21:46:14] RECOVERY - Host srv268 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [21:48:55] New patchset: Mattflaschen; "Add GuidedTour to be_x_oldwiki and mediawikiwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49169 [21:49:34] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Enabling PostEdit on {be|mediawiki|outreach}wikis' [21:49:35] Logged the message, Master [21:49:53] New review: Ori.livneh; "Patch Set 1: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/49169 [21:50:35] PROBLEM - Apache HTTP on srv268 is CRITICAL: Connection refused [21:50:49] New patchset: Reedy; "(bug 44974) Add localised/v2 logos for Wikipedias without one (first installment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48952 [21:50:56] New review: Reedy; "Patch Set 2: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48952 [21:50:56] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48952 [21:52:14] New review: Ori.livneh; "Patch Set 1: Code-Review+2" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/49003 [21:53:31] New patchset: Reedy; "Add JP to Wikimedia Shop link config for sidebar" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48038 [21:53:39] New review: Reedy; "Patch Set 2: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48038 [21:54:14] New patchset: Reedy; "(bug 44843) Change English Wiktionary favicon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48351 [21:54:51] New review: Reedy; "Patch Set 3: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48351 [21:54:54] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48351 [21:55:15] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48038 [21:56:00] New patchset: Reedy; "(bug 44103) Set up autopromotion to autoreview group on dewiktionary" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48634 [21:56:04] New review: Reedy; "Patch Set 2: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48634 [21:56:05] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48634 [21:56:29] New review: Ori.livneh; "Patch Set 1: Verified+2" [operations/mediawiki-config] (master); V: 2 - https://gerrit.wikimedia.org/r/49003 [21:56:29] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49003 [21:56:43] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49169 [21:57:03] New patchset: Reedy; "(bug 44032) Deploy Universal Language Selector to oldwikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47732 [21:57:07] New review: Reedy; "Patch Set 5: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/47732 [21:57:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/47732 [21:57:20] PROBLEM - Puppet freshness on virt0 is CRITICAL: Puppet has not run in the last 10 hours [22:00:52] !log olivneh synchronized wmf-config [22:00:53] Logged the message, Master [22:01:53] !log reedy synchronized docroot [22:01:54] Logged the message, Master [22:02:31] Can someone please run sync-common as root on mw1161 and mw1165 [22:03:26] !log replacing gluster upstarts on all labstore nodes with a regular init script [22:03:28] Logged the message, Master [22:06:17] New patchset: Reedy; "(bug 44587) Multiple changes for trwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48841 [22:06:22] New review: Reedy; "Patch Set 2: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48841 [22:06:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/48841 [22:10:42] New patchset: Ori.livneh; "wgEventLoggingSchemaIndexUri on test2wiki should point at test2wiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49172 [22:12:28] New review: Ori.livneh; "Patch Set 1: Verified+2 Code-Review+2" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/49172 [22:12:29] Change merged: Ori.livneh; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49172 [22:14:08] !log olivneh synchronized php-1.21wmf9/extensions/Math [22:14:10] Logged the message, Master [22:16:09] !log olivneh synchronized wmf-config/CommonSettings.php 'Updating for test2wiki' [22:16:10] Logged the message, Master [22:16:24] New patchset: Reedy; "Language Template fixup definition for UploadWizard." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39026 [22:18:37] !log pushing mw1161-1188 into apache service (sans 1161/1164) from false to true in pybal, otherwise all online [22:18:39] Logged the message, RobH [22:18:55] MaxSem: hey, was at lunch [22:18:56] RECOVERY - Apache HTTP on srv268 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.062 second response time [22:18:58] still need halp? [22:19:18] New review: MaxSem; "Patch Set 1: Verified+2 Code-Review+2" [operations/debs/mod_tile] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48478 [22:19:18] Change merged: MaxSem; [operations/debs/mod_tile] (master) - https://gerrit.wikimedia.org/r/48478 [22:19:41] notpeter, yes please - with https://gerrit.wikimedia.org/r/49159 [22:19:52] please site no break [22:19:58] no reports of oddness, so yay [22:20:04] (my apaches arent breaking things it seems) [22:20:14] New review: MaxSem; "Patch Set 1: Verified+2 Code-Review+2" [operations/debs/osm2pgsql] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48479 [22:20:14] Change merged: MaxSem; [operations/debs/osm2pgsql] (master) - https://gerrit.wikimedia.org/r/48479 [22:20:29] and their active connections are actually moving this time [22:20:31] \o/ [22:21:05] New review: MaxSem; "Patch Set 1: Verified+2 Code-Review+2" [operations/debs/osm-mapnik-style] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/48480 [22:21:05] Change merged: MaxSem; [operations/debs/osm-mapnik-style] (master) - https://gerrit.wikimedia.org/r/48480 [22:21:26] !log taking storage1 offline for decommissioning per: https://rt.wikimedia.org/Ticket/Display.html?id=4529 [22:21:27] Logged the message, Master [22:21:52] MaxSem: ok, I'll deploy now [22:22:01] thanks! [22:22:02] New review: Pyoungmeister; "Patch Set 2: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/49159 [22:22:11] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/49159 [22:22:12] MaxSem: should I force puppet runs on anything? [22:22:20] hmm [22:22:59] they seem to hang at a 1-2 servers/day rate - I guess they can wait for 30 minutes [22:23:07] ok, cool [22:23:17] they will be restarted after that change anyway [22:23:27] okie dokie [22:23:34] * jeremyb_ is being made to do captcha on wikitechwiki [22:23:46] 21-2 = ? :) [22:24:11] MaxSem: ok, merged on sockpuppet now [22:26:16] jeremyb_, 21-2 = 19 ;) [22:26:28] MaxSem: thank you!!! [22:26:40] (but too late, i already saved) [22:29:59] !log mflaschen Started syncing Wikimedia installation... : [22:30:00] Logged the message, Master [22:34:39] PROBLEM - Apache HTTP on mw1200 is CRITICAL: Connection refused [22:34:39] PROBLEM - Apache HTTP on mw1194 is CRITICAL: Connection refused [22:37:09] !log mflaschen Finished syncing Wikimedia installation... : [22:37:10] Logged the message, Master [22:40:32] just over 7 minutes.. [22:41:55] Reedy, are you referring to my scap? [22:42:08] yeah [22:42:17] I actually started 20 minutes ago, but the start log message is way off. [22:42:21] lol [22:42:33] It is done, though. [22:44:05] !log mw1189-1199 (sans 1194) now online as eqiad api apaches [22:44:06] Logged the message, RobH [22:46:48] RECOVERY - Apache HTTP on mw1200 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.130 second response time [22:48:08] !log mw1200 added to api pool, online and in service [22:48:10] Logged the message, RobH [22:48:27] RECOVERY - Apache HTTP on mw1194 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.132 second response time [22:49:07] !log mw1194 online and now in service as api apache [22:49:08] Logged the message, RobH [22:49:30] before the first log message, it is all NFS latency [22:49:49] moving the source off NFS would make it much much faster [23:02:37] New patchset: Aaron Schulz; "Modified jobs-loop script to keep a fuller pipeline." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48981 [23:07:05] notpeter: ^ [23:07:15] last change I swear :p [23:09:07] Aaron|home: has it been tested? [23:09:25] on mw1010 [23:09:40] okie dokie [23:09:48] New patchset: Nemo bis; "(bug 44796) Updating logo for Telugu Wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49180 [23:10:29] cool, I'll merge now, then [23:11:06] New review: Pyoungmeister; "Patch Set 4: Code-Review+2" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/48981 [23:11:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/48981 [23:11:53] Aaron|home: merged [23:12:26] notpeter: \o/ are you messing with sockpuppet? [23:12:32] ?? [23:12:36] I merged it on sockpuppet [23:12:44] does that count as messing? [23:13:08] yes, very much :) [23:13:25] New patchset: Nemo bis; "(bug 44796) Updating logo for Telugu Wikisource" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49180 [23:13:45] notpeter: I mean forcing a puppet run [23:14:06] btw how often are runs? [23:18:46] Aaron|home: 'runinterval' in /etc/puppet/puppet.conf (specified in seconds); 30 minutes by default. [23:19:40] Aaron|home: every 30 minutes [23:19:47] but sure, i can force them [23:33:01] New review: Nemo bis; "Patch Set 1:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/45273 [23:33:54] Feb 14 23:32:58 mw1010 t of memory [6773] [23:33:55] Feb 14 23:32:58 mw1010 t of memory [6772] [23:33:55] Feb 14 23:32:58 mw1010 kernel: [11732.508401] php[6772]: segfault at 2c ip 00007f703c7727a4 sp 00007fffd459de10 error 6 in libmemcached.so.10.0.0[7f703c758000+32000] [23:33:55] Feb 14 23:33:09 mw1010 /usr/sbin/gmond[1013]: slurpfile() read() buffer overflow on file /proc/stat [23:33:57] Thu Feb 14 23:33:49 UTC 2013 [23:34:01] that's now [23:34:04] I need to sleep [23:34:14] good luck and hope you get it sorted [23:34:48] Aaron|home: ^^ [23:35:09] New review: Mattflaschen; "Patch Set 1:" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/49003 [23:36:24] apergos: how long have those happened? [23:38:32] * Aaron|home wonders if the 'DOMXPath::query(): Memory allocation failed' bits are bug 39493 [23:38:59] PROBLEM - MySQL Replication Heartbeat on db53 is CRITICAL: CRIT replication delay 200 seconds [23:39:36] PROBLEM - MySQL Slave Delay on db53 is CRITICAL: CRIT replication delay 216 seconds [23:42:26] RECOVERY - MySQL Replication Heartbeat on db53 is OK: OK replication delay 0 seconds [23:43:11] RECOVERY - MySQL Slave Delay on db53 is OK: OK replication delay 0 seconds