[00:17:47] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [00:17:47] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [00:17:47] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [00:19:47] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [00:41:14] PROBLEM - Puppet freshness on mw1171 is CRITICAL: No successful Puppet run in the last 10 hours [01:01:35] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset -0.006460666656 secs [01:03:03] PROBLEM - Host mw1173 is DOWN: PING CRITICAL - Packet loss = 100% [01:08:01] hmm, getting 503 from bits.w.o [01:08:58] http://paste.debian.net/7212/ [01:09:02] New patchset: Hazard-SJ; "(bug 29902) Tidied up CommonSettings and InitialiseSettings" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65860 [01:09:29] just uploaded a new version of twinkle, I hope I didn't trigger something funny [01:16:23] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:16:43] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [01:16:46] want me to file a bug report instead? [01:17:06] don't know which component to file in then [01:19:23] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.467 second response time [01:19:23] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:20:17] New patchset: Hazard-SJ; "Added a comment for the wikidata hostname" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65861 [01:21:24] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.161 second response time [01:28:23] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:29:13] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.044 second response time [01:31:23] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:31:33] RECOVERY - NTP on ssl3003 is OK: NTP OK: Offset 0.003872513771 secs [01:32:43] RECOVERY - NTP on ssl3002 is OK: NTP OK: Offset -0.0047301054 secs [01:35:14] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.048 second response time [01:36:23] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:39:16] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.192 second response time [01:43:25] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:46:15] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.049 second response time [01:47:25] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:48:15] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [01:48:25] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:49:15] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.001 second response time [01:49:25] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:50:15] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.068 second response time [01:57:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:58:25] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:59:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 5.912 second response time [02:00:25] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.127 second response time [02:05:48] !log LocalisationUpdate completed (1.22wmf4) at Wed May 29 02:05:47 UTC 2013 [02:05:57] Logged the message, Master [02:07:29] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:09:28] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.891 second response time [02:09:38] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:28] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.406 second response time [02:10:52] !log LocalisationUpdate completed (1.22wmf5) at Wed May 29 02:10:48 UTC 2013 [02:10:58] Logged the message, Master [02:16:28] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:18:28] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.031 second response time [02:25:38] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:26:28] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:27:29] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.879 second response time [02:27:29] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.836 second response time [02:31:28] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:32:28] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 6.543 second response time [02:33:19] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed May 29 02:33:19 UTC 2013 [02:33:28] Logged the message, Master [02:34:38] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:26] PROBLEM - Apache HTTP on mw1151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:26] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.901 second response time [02:37:17] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.175 second response time [02:40:26] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:42:16] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.305 second response time [02:48:10] I'm seeing lots of "Fatal error: Allowed memory size of 183500800 bytes exhausted" in fatal.log for enwiki load.php URLs, starting at about 01:04. I wonder if the edit to en:MediaWiki:Gadget-Twinkle.js at 01:04 UTC pushed things over a threshold; all the URLs seem to include reference to Twinkle, and even https://en.wikipedia.org/w/load.php?debug=false&lang=en&modules=ext.gadget.Twinkle&skin=vector&version=20130529T013531Z&* bombs out. [02:50:45] There are several reports of Twinkle being broken. [02:50:53] And the entire site is sluggish due to bits.wikimedia.org being slow. [02:51:01] But nobody seems to be around. [02:51:26] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:37] PROBLEM - Apache HTTP on mw1152 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:52:16] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.060 second response time [02:52:26] RECOVERY - Apache HTTP on mw1152 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [02:58:28] PROBLEM - Apache HTTP on mw1149 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:59:05] the errors reference jsminplus.php, so i'm guessing something in twinkle or something else is proving difficult to minify [03:00:03] ori-l: Or else having a half-megabyte gadget is just too much. [03:00:16] RECOVERY - Apache HTTP on mw1149 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.656 second response time [03:00:28] can't an enwiki admin just roll it back or blank it temporarily? [03:01:10] ori-l: I'd suggest reverting the most recent edit. I'd do it, except I'm suffering from hat confusion. [03:01:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:02:03] * anomie doesn't have enwiki admin with staff hat, but isn't sure that volunteer hat can be worn after doing staffy-stuff to diagnose the problem [03:02:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.137 second response time [03:08:12] i'd revert it and worry about hats later if it's affecting the production site in a serious way [03:08:22] i'm trying to figure out the exact cause meanwhile [03:09:22] * anomie talks to self [03:09:43] .... And silence in the log. [03:16:59] well, if i time -v a run of jsminplus.php against the twinkle source, i get [03:17:11] Maximum resident set size (kbytes): 277840 [03:17:22] or ~271 megabytes [03:17:52] which is over the maximum [03:18:02] so twinkle is indeed a likely culprit [03:20:38] I ran into that locally before [03:21:43] * Aaron|home ended up disabling jsminplus or something and people like trevor seemed surprised we still used it IIRC [03:23:36] anomie: thanks [03:24:42] now i'll pretend like i know how to use valgrind and see if i can figure out why memory consumption balloons with that revision [03:30:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:31:01] running jsminplus on twinkle with xdebug.profiler_enable = 1 is giving my lap second degree burns [03:32:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [03:35:39] ori-l, do you know if jsmin runs before or after files in a module are concatenated together? [03:36:02] If before, it might help for them to use multiple files, which is allowed (even for a gadget). [03:36:43] They already have modules (a lot) in the git repo. [03:38:22] the call is in ResourceLoader.php, ~line 840 [03:41:35] it finished, cachegrind file is 153M [03:43:23] ori-l, it minifies the whole module at once. [03:43:40] It's a simpler architecture, but doing it the other way might allow better performance. [03:45:02] Though I think that's a breaking change. [03:50:18] Actually, looking at ResourceLoaderWikiModule::getScript(), it looks like it calls jsminplus on each page in the module separately, then concatenates them all. [03:56:00] anomie|away: right, for syntax checking [04:00:17] anomie|away, you're right, I was looking at the wrong ResourceLoader subclass. [04:00:52] However, like Aaron says, the actual minification isn't done there. [04:01:24] This suggests that if Twinkle used multiple files in the gadget, it might solve the problem. [04:17:27] PROBLEM - Puppet freshness on db1032 is CRITICAL: No successful Puppet run in the last 10 hours [04:23:29] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [04:23:29] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [04:23:29] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [04:26:27] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [04:26:28] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [04:38:32] PROBLEM - RAID on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:39:31] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [05:02:51] PROBLEM - NTP on ssl3003 is CRITICAL: NTP CRITICAL: No response from NTP server [05:03:02] PROBLEM - NTP on ssl3002 is CRITICAL: NTP CRITICAL: No response from NTP server [05:07:09] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [05:07:49] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 0.45 ms [05:11:09] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [06:18:09] PROBLEM - Puppet freshness on ms-fe3001 is CRITICAL: No successful Puppet run in the last 10 hours [06:19:09] PROBLEM - Puppet freshness on erzurumi is CRITICAL: No successful Puppet run in the last 10 hours [06:22:09] PROBLEM - Puppet freshness on cp1029 is CRITICAL: No successful Puppet run in the last 10 hours [06:39:36] New review: Physikerwelt; "Thanks for your comments." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/61767 [07:36:43] New patchset: Petrb; "new tool for easy sql replica access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65634 [07:36:44] New patchset: Petrb; "inserted 7z to list of packages on dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65876 [10:09:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:10:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [10:17:52] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: No successful Puppet run in the last 10 hours [10:17:52] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: No successful Puppet run in the last 10 hours [10:17:52] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: No successful Puppet run in the last 10 hours [10:19:52] PROBLEM - Puppet freshness on mc15 is CRITICAL: No successful Puppet run in the last 10 hours [10:26:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:27:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [10:41:49] PROBLEM - Puppet freshness on mw1171 is CRITICAL: No successful Puppet run in the last 10 hours [10:42:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:43:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [10:45:40] PROBLEM - RAID on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:46:29] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [10:56:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:57:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [11:13:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:14:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [11:22:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [11:31:34] paravoid, i suspect that Zero refactoring your did is not working. Can we revert it for a day? [11:33:35] why do you suspect that? [11:35:15] New patchset: Yurik; "Revert "Merge "Varnish: separate Zero-related carrier stuff" into production"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65961 [11:37:17] mark, because i tried spoofing an IP from hacakthon, and it worked for a bit, and then it stopped working [11:37:45] and i think that was the period after first patch was deployed but before refactoring [11:37:52] mark - https://gerrit.wikimedia.org/r/#/c/65961/ [11:37:54] New review: Mark Bergsma; "Could you please provide some data / log output / debugging aids so we can actually look at the prob..." [operations/puppet] (production) C: -2; - https://gerrit.wikimedia.org/r/65961 [11:38:01] my temp revert [11:38:09] yeah we're not going to merge that [11:38:22] to test -- spoof XFF header, proxy through bast1001 or from office, and access zero.wikipedia.org [11:39:05] mark, why not? [11:39:22] because I'd rather look at the actual problem instead of going on your hunch [11:39:55] mark, i would love to look at the actual problem, but if this is the case, we are currently failing on all opera users [11:40:14] which is much bigger issue than trying to test things in production :) [11:40:45] hence - i think a temporary revert of a complex change is a better testing strategy [11:47:59] New review: Yurik; "To test:" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65961 [11:50:22] mark, i updated the patch with testing strategy, but i still think it is not a great idea to test complex changes in production like that refactoring (Although I totally support that separation) https://gerrit.wikimedia.org/r/#/c/65961/ [11:53:09] yurik: I just confirmed that opera XFF spoofing is working fine [11:54:35] mark, in that case i really don't understand why it was working for a few minutes during hackathon, but hasn't worked since. Could you see if connecting from the office network allows spoofing? [11:54:57] or should we wait for SF to wake up [11:55:02] no, because I'm not in the office [11:57:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:58:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [12:00:09] yurik: i've just confirmed spoofing is broken when connecting using IPv6 [12:00:14] if you connect using IPv4, it works fine [12:01:15] Change abandoned: Mark Bergsma; "Abandoning this because XFF spoofing brokenness is caused by IPv6, and has nothing to do with this c..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65961 [12:01:43] mark, thanks! how did you force it to use IPv4 over proxy? [12:01:49] i didn't use a proxy [12:02:20] I wonder if this is a bug in that C code that can't convert from [12:02:33] string to IP [12:02:33] i'm sure it is [12:02:44] it's not exactly very good code [12:03:02] fortunately it will be replaced soon [12:03:09] yep, not at all. Can't wait :) [12:03:22] ok, will try to force ip4 for testing [12:04:19] you could write a script that runs on fenari or bast1001 and does that, and only uses ipv4 [12:04:31] because, you know, you might make a mistake when doing it manually [12:05:04] script to automate testing? my testing involves lots of various browsing [12:05:31] as long as all requests are treated as coming from a predefined IP, its ok [12:05:45] will see if i can force proxy to ipv4 all the time [12:05:48] that's exactly something you could automate for the test I would think [12:07:11] i need to spoof carrier's IP, after which visit many different sites and click many links to see if the new functionality works. It might be good to automate browsing testing in general, but might be too complex at this point. Besides, most of the time I just need to verify that an issue carrier is seeing is really there [12:10:05] anyway, mark thanks for all your help! I still don't know how it was working before, but will deal with it later. Good thing zero is not broken [12:10:24] * yurik is off to break it in today's deployment [12:12:57] So... [12:13:01] what's the verdict? [12:13:30] AzaToth, abt what? [12:13:46] twinkle and the bits fuckup [12:15:23] yurik: http://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Problem_using_Twinkle [12:15:45] see Bjorsch comment [12:16:00] he here? [12:16:29] anomie|away: ping [12:16:32] AzaToth, no idea about twinkle, mark seems to be around [12:16:43] anomie is probably still asleep [12:16:53] westcoaster? [12:16:59] east [12:17:08] ok [12:17:19] then again, we all just got back from europe, so might be getting up early [12:17:31] so if he is a normal beeing, he should get awake any time now [12:17:47] awake != online [12:17:55] no? [12:18:30] AzaToth: It looks like Twinkle just got too large, so PHP was running out of memory when processing the code. Probably splitting it up into multiple files will work. [12:18:52] is the limit per file? [12:20:01] Well, the place where it was running out of memory processes one file at a time. And a ResourceLoader-using gadget can consist of multiple files. [12:20:54] yea, we might want to return to the multiple files layout again [12:24:21] anomie|away: for the moment, would it be possible to increase the limit some, at least until we've splitted it up? [12:25:00] AzaToth: You'd have to ask an ops person about that. I wouldn't count on it though. [12:25:45] afaik it's a bug if bits returns 503 just because the script "is too big" [12:26:48] it's only 526K [12:27:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:28:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [12:31:31] The jsminplus thing is for syntax checking in user scripts or something like that [12:31:47] It's somewhat experimental and I think we've disabled it before [12:40:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:41:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.127 second response time [13:01:35] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:02:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.126 second response time [13:07:13] New patchset: BBlack; "A few minor fixups..." [operations/software/varnish/vhtcpd] (master) - https://gerrit.wikimedia.org/r/65972 [13:07:13] New patchset: BBlack; "Bump version to 0.0.5, add NEWS file" [operations/software/varnish/vhtcpd] (master) - https://gerrit.wikimedia.org/r/65973 [13:11:43] anomie|away: which ops should I ask? [13:13:49] AzaToth: Not sure. Probably just ask in general in about 3 hours, when the SF people start getting in. [13:13:59] okidoki [13:16:32] Change merged: BBlack; [operations/software/varnish/vhtcpd] (master) - https://gerrit.wikimedia.org/r/65972 [13:16:45] Change merged: BBlack; [operations/software/varnish/vhtcpd] (master) - https://gerrit.wikimedia.org/r/65973 [13:18:03] New patchset: BBlack; "Merge branch 'master' into debian" [operations/software/varnish/vhtcpd] (debian) - https://gerrit.wikimedia.org/r/65974 [13:18:04] New patchset: BBlack; "bump pkg version, fix a lintian warning" [operations/software/varnish/vhtcpd] (debian) - https://gerrit.wikimedia.org/r/65975 [13:18:16] Change merged: BBlack; [operations/software/varnish/vhtcpd] (debian) - https://gerrit.wikimedia.org/r/65974 [13:18:27] Change merged: BBlack; [operations/software/varnish/vhtcpd] (debian) - https://gerrit.wikimedia.org/r/65975 [14:05:24] !log ms-be4..fixing raid cfg [14:05:33] Logged the message, Master [14:07:09] PROBLEM - Host ms-be4 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:19] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:08:19] RECOVERY - RAID on searchidx1001 is OK: OK: State is Optimal, checked 4 logical device(s) [14:17:49] PROBLEM - Puppet freshness on db1032 is CRITICAL: No successful Puppet run in the last 10 hours [14:21:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:00] RECOVERY - Host ms-be4 is UP: PING OK - Packet loss = 0%, RTA = 26.59 ms [14:23:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.130 second response time [14:23:49] PROBLEM - Puppet freshness on virt1 is CRITICAL: No successful Puppet run in the last 10 hours [14:23:49] PROBLEM - Puppet freshness on virt3 is CRITICAL: No successful Puppet run in the last 10 hours [14:23:49] PROBLEM - Puppet freshness on virt4 is CRITICAL: No successful Puppet run in the last 10 hours [14:25:10] PROBLEM - swift-object-auditor on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:10] PROBLEM - swift-account-auditor on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:10] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:10] PROBLEM - swift-account-reaper on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:19] PROBLEM - swift-object-replicator on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:20] PROBLEM - swift-object-server on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:29] PROBLEM - swift-container-server on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:30] PROBLEM - RAID on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:30] PROBLEM - SSH on ms-be4 is CRITICAL: Connection timed out [14:25:39] PROBLEM - swift-container-updater on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:39] PROBLEM - swift-container-replicator on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:39] PROBLEM - Disk space on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:59] PROBLEM - swift-account-replicator on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:59] PROBLEM - DPKG on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:59] PROBLEM - swift-object-updater on ms-be4 is CRITICAL: Timeout while attempting connection [14:25:59] PROBLEM - swift-account-server on ms-be4 is CRITICAL: Timeout while attempting connection [14:26:28] what's the status with ms-bes? [14:26:48] are you replacing the last c2100s? [14:26:49] PROBLEM - Puppet freshness on pdf1 is CRITICAL: No successful Puppet run in the last 10 hours [14:26:50] PROBLEM - Puppet freshness on pdf2 is CRITICAL: No successful Puppet run in the last 10 hours [14:29:39] PROBLEM - Host ms-be4 is DOWN: PING CRITICAL - Packet loss = 100% [14:40:09] RECOVERY - Host ms-be4 is UP: PING OK - Packet loss = 0%, RTA = 26.55 ms [14:40:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:41:06] New patchset: BBlack; "Turn on vhtcpd in production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65991 [14:41:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [14:44:29] ^ if anyone wants to review that for stupid errors, please do ( https://gerrit.wikimedia.org/r/65991 ) [14:46:48] PROBLEM - Host ms-be4 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:42] New review: Faidon; "LGTM" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/65991 [14:52:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:52:30] bblack: I'd say go for it :-) [14:54:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [14:59:25] New review: BBlack; "LGTMT" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/65991 [14:59:26] Change merged: BBlack; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65991 [15:01:34] !log deployed vhtcpd puppet stuff via sockpuppet... [15:01:42] Logged the message, Master [15:03:44] of course, now I realize the problem with my commit. the new daemon doesn't like the hostname "localhost", wants 127.0.0.1 :P [15:03:58] RECOVERY - Varnish HTCP daemon on cp1029 is OK: PROCS OK: 0 processes with UID = 997 (varnishhtcpd), args varnishhtcpd worker [15:05:29] New patchset: BBlack; "s/localhost/127.0.0.1/g for varnish::htcppurger args" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65996 [15:06:36] New review: BBlack; "This corrects an issue with my previous related commit, already pushed. New daemon doesn't understa..." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/65996 [15:07:12] New review: Hashar; "that is unfortunate :-)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/65996 [15:07:59] RECOVERY - Host ms-be4 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [15:08:23] shouldn't jenkins have verified that by now? [15:09:25] ah there it goes [15:09:40] Change merged: BBlack; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/65996 [15:12:09] PROBLEM - Puppet freshness on db45 is CRITICAL: No successful Puppet run in the last 10 hours [15:14:49] PROBLEM - Host google is DOWN: PING CRITICAL - Packet loss = 100% [15:16:10] RECOVERY - Host google is UP: PING OK - Packet loss = 0%, RTA = 17.92 ms [15:17:29] PROBLEM - RAID on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:19] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [15:19:31] PROBLEM - Disk space on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:20:19] RECOVERY - Disk space on mc15 is OK: DISK OK [15:22:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:22:39] PROBLEM - NTP on ms-be4 is CRITICAL: NTP CRITICAL: No response from NTP server [15:23:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [15:23:30] PROBLEM - RAID on mc15 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:24:29] RECOVERY - RAID on mc15 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [15:26:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [15:48:37] New review: Alex Monk; "Why are you adding an extra space before each comment?" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/65860 [15:56:43] any ops around atm with time to spare? [15:57:27] wanna discuss the bits issue [15:59:43] AzaToth: i'm not ops, but i investigated it a bit yesterday [16:00:02] jsminplus.php's memory usage balloons to ~270 megabytes with the problematic revision of twinkle [16:00:12] i have cachegrind files and some other profiling data [16:00:13] wtf? [16:00:25] 270MB? [16:00:34] well, 271 MB [16:00:40] to be more exact :P [16:00:47] that's like 500 times the code size ヾ [16:01:26] there's a bad leak in jsminplus that we should fix [16:01:35] the current version (the one reverted to) how much does it use? [16:01:42] is it like 269MB? [16:02:11] i don't know, but i can check. let me first of all capture all the data i have that was outputted to a terminal into a file, so i don't lose it [16:02:12] or did I write some kind of code it choked on [16:02:15] i'll pastebin it for you [16:02:19] ok [16:05:08] Reedy: greg-g https://gerrit.wikimedia.org/r/#/c/65988/ and https://gerrit.wikimedia.org/r/#/c/66006/ needed for deployment [16:05:18] also we have a settings change [16:13:58] New patchset: Aude; "Add dataSquidMaxage Wikibase setting for entity data special page" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66007 [16:14:22] and https://gerrit.wikimedia.org/r/#/c/66007/ [16:14:26] greg-g: Reedy ^ [16:16:22] AzaToth: http://paste.debian.net/7361/ [16:16:47] to run jsminplus.php on its own, i just added the line "$minified = JSMinPlus::minify( file_get_contents('twinkle.js' ) ); " the the bottom of it [16:17:18] !log replacing failed disk on dataset1001 slot9 [16:17:26] Logged the message, Master [16:18:00] AzaToth: let me also confirm the exact revision i was checking; a bit stupid of me not to note that [16:19:23] AzaToth: it's revision 557261853 [16:20:22] superm401 noted yesterday that you may be able to dodge the problem by splitting up the gadget into multiple pages, but I think it'd be better to fix jsminplus if we could [16:21:36] http://en.wikipedia.org/w/index.php?diff=557261853 [16:22:16] where (if even) is jsminplus available from? [16:22:17] yeah, i was just about to check memory usage again the previous revision [16:23:04] AzaToth: it's a proprietary, patented software we can license to you for a fee with DRM restrictions [16:23:15] just joking, it's in core, under includes/libs [16:23:18] hehe [16:23:56] and before you waste your time checking: i looked, and what we use is the latest version available [16:24:40] :( /me wanna waste time [16:25:24] has it modified narcissus? [16:25:36] I know narcissus is buggy [16:26:16] (narcissus can't handle breaks in subblocks of switch statements [16:27:23] erm, meant continue [16:28:10] nooooo idea [16:30:24] ori-l: 2.2s for me [16:30:56] the time, you mean? but it wasn't the execution time limit that was exceeded; rather the memory limit. [16:32:24] now the previous version seems even worse, strange [16:32:29] $ /usr/bin/time -v php jsminplus.php twinkle-557226569.js 2>&1 | grep Maximum [16:32:29] Maximum resident set size (kbytes): 280432 [16:32:36] $ /usr/bin/time -v php jsminplus.php twinkle-557261853.js 2>&1 | grep Maximum [16:32:37] Maximum resident set size (kbytes): 215552 [16:33:39] Maximum resident set size (kbytes): 37240 [16:34:17] huh. you edited the file to make it actually parse the twinkle file, right? [16:34:22] by itself it's just an inert class [16:34:37] add "JSMinPlus::minify( file_get_contents( $argv[1] ) );" at the end of the file [16:35:09] I added "$minified = JSMinPlus::minify( file_get_contents('/home/azatoth/build/twinkle/twinkle.js' ) );" [16:35:42] output of php -v? [16:35:42] azatoth@azaboxen:~/build/twinkle «(257276e...) $»$ ll /home/azatoth/build/twinkle/twinkle.js [16:35:42] -rw-r--r-- 1 azatoth azatoth 537982 maj 29 18:34 /home/azatoth/build/twinkle/twinkle.js [16:35:48] 5.4.4-14+deb7u1 [16:35:59] i'm running PHP 5.3.10-1ubuntu3.6+wmf1 with Suhosin-Patch (cli) (built: May 15 2013 23:19:18), same as prod [16:36:26] I'm using "PHP 5.4.4-14+deb7u1 (cli) (built: May 23 2013 09:11:48)" [16:37:07] ok, let me try to upgrade the version of php on the same machine, and try again. if i get the same result as you we know it's a memory management issue that was fixed in php [16:38:34] and it generated minified output (402K) [16:39:20] ori-l: I'm using thinkle.js generated from git directly though, not downloading it from the wiki [16:39:26] dunno if it's relevant or not [16:41:21] AzaToth: well, it's just adds further potential for error; just run [16:41:21] curl "https://en.wikipedia.org/w/index.php?title=MediaWiki:Gadget-Twinkle.js&action=raw&oldid=557261853" -o twinkle-557261853.js [16:44:43] azatoth@azaboxen:~/build/core/includes/libs «master *% u=»$ command time -v php jsminplus.php twinkle-557261853.js 2>&1| grep Maximum [16:44:43] Maximum resident set size (kbytes): 35976 [16:45:51] ori-l: seems consistent [16:46:17] i'm still upgrading to 5.4 [16:46:23] okai [16:48:08] ori-l: manual or via package? [16:48:40] package, from the https://launchpad.net/~ondrej/+archive/php5 ppa [16:50:18] $ command time -v php jsminplus.php twinkle-557261853.js 2>&1 | grep Maximum [16:50:18] Maximum resident set size (kbytes): 150320 [16:50:26] uhhhh. [16:50:31] this is with 5.4.15. [16:51:05] I don't have suhoshi btw [16:51:34] neither does this build of 5.4 that i'm now running. i'm trying to figure out how to run php without loading any extensions [16:51:55] in /etc/php [16:52:22] you need to break the symlinks and for cli/conf.d remove all [16:52:54] i think --no-php-ini does the trick [16:53:42] Maximum resident set size (kbytes): 30992 [16:53:44] still 127 megabytes [16:53:47] with --no-php-ini [16:55:13] * AzaToth scratches heads [16:55:44] while 127 MB is better than 271MB, it's still way high than 35MB [16:56:28] ori-l: 32 or 64bit? [16:56:34] 64 [16:56:45] * AzaToth to [16:56:59] pastebin the output of php-config? [16:57:35] http://paste.debian.net/7371/ [17:02:16] ori-l: you are not out of RAM so it needs to swap or something? [17:02:27] you have swap space (even if minimal) on the server? [17:02:50] about to start deployment, any issues with tin? [17:03:48] yurik: other than the submodules issue Reedy mentioned on engineering@ [17:04:04] greg-g, is that for new submodules? [17:04:08] add a "nothing that I know of" to the beginning of that sentence [17:04:10] i need to deploy zero extension [17:05:00] I think it is just new ones... [17:05:20] Reedy: clarify for yurik ^ [17:05:22] ok, will let you know if there are any issues [17:05:59] AzaToth: yes to both. i was running under a VM but just tried again on my actual environment (php 5.3, 8gb ram), and Maximum resident set size (kbytes): 251658240 [17:06:38] ori-l: all I can say is that it works for me™ [17:06:51] yurik: god speed, good sir. [17:07:00] thanks greg-g ! [17:07:01] AzaToth: we just need to redirect traffic to your IP [17:07:05] hehe [17:07:12] * AzaToth sends the bill [17:07:37] ori-l: anyone else here with a higher level of intelligence regarding matters in hand? [17:08:28] yes, but they're keeping quiet :) [17:08:40] gif me names! [17:11:25] heh. i have to head out to the office (this started out as a "quick thing while i have a coffee"..) but if you post a detailed description to wikitech-l (include links to the function trace) someone will reply [17:12:08] fwiw, i also just ran it on one of the mw apaches where it took 181 megabytes [17:12:22] btw, I never got the function trace [17:12:54] ori-l: http://paste.debian.net/7376/ [17:13:09] perhaps I missed config something [17:13:12] http://paste.debian.net/7361/ scroll down [17:13:25] the full trace is huge [17:14:01] i'll upload it from the office later if you want [17:14:12] just wondered how you got that trace [17:15:00] if you have access to labs, try it on a labs instance, and see if you can add a judicious unset() or gc_collect_cycles() somewhere to resolve it [17:15:26] can see [17:15:27] though there's probably something fundamentally dumb about how it endlessly creates copies of the code from the location of the cursor to eof [17:15:46] dunno which instances I have access to [17:15:48] i got the trace by using xdebug and then running tracefile-analyser.php, which you can find in the contrib/ dir of the xdebug repo [17:15:55] k [17:16:03] dir-syncing zero ext... [17:16:14] but you got 180+MB witout debug trace? [17:16:29] yeah [17:17:34] !log yurik synchronized php-1.22wmf4/extensions/ZeroRatedMobileAccess/ [17:17:43] Logged the message, Master [17:17:48] mw1171: ssh_exchange_identification: Connection closed by remote host [17:17:50] mw1173: ssh: connect to host mw1173 port 22: Connection timed out [17:19:35] AzaToth: i added you to the editor-engagement labs project so you don't have to wait for an instance/project to be created. you can use https://wikitech.wikimedia.org/wiki/Nova_Resource:I-000003dd and do whatever you want to it [17:20:08] but: behave! :) [17:20:17] hehe [17:22:04] * ori-l will bbl [17:26:37] mark mark- -- dr0ptp4kt tried to spoof ip and it doesn't look like its working. Are you sure about IPv4? [17:27:00] adam did it from the office - 216.38.130.163 [17:27:40] paravoid, ^^ [17:28:08] mark, the X-Forwarded-For header i specified while testing with yurik was 115.164.0.0 [17:28:22] +paravoid ^^ [17:28:39] for a split moment during hackathon it was working, but then the refactoring patch went in, and either that or something else stopped handling the spoofing [17:40:22] !log yurik synchronized php-1.22wmf4/extensions/ZeroRatedMobileAccess/ [17:40:30] Logged the message, Master [17:46:19] !log mw1085 powering down to reseat DIMM [17:46:27] Logged the message, Master [18:09:24] !log reedy synchronized database lists files: private and wikivoyage to 1.22wmf5 [18:09:33] Logged the message, Master [18:09:53] Reedy: poke [18:10:23] :) [18:10:40] https://gerrit.wikimedia.org/r/#/c/66006/ https://gerrit.wikimedia.org/r/#/c/65988/ and https://gerrit.wikimedia.org/r/#/c/66007/ [18:15:54] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: closed, special and wikimedia to 1.22wmf5 (not wikidata) [18:16:03] Logged the message, Master [18:21:15] dr0ptp4kt: yurik: if you used that IP, did you also specify a User-Agent of Opera? [18:21:29] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikibooks and wikiversity to 1.22wmf5 [18:21:38] Logged the message, Master [18:21:57] mark, User agent was set as android i think - but that If statement should have worked as it only checks if XFF is set [18:22:08] that if statement works [18:22:11] but X-CS wouldn't get set [18:24:02] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikisource and wiktionary to 1.22wmf5 [18:24:09] Logged the message, Master [18:27:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikiquote and wikinews to 1.22wmf5 [18:28:07] Logged the message, Master [18:31:49] mark, good point, will try with the useragent of opera [18:36:20] mark, yurik, no luck with user agent of opera and forged xff [18:37:54] it most definitely works for me [18:38:03] GET /wiki/Main_Page?zero HTTP/1.1 [18:38:03] Host: en.zero.wikipedia.org [18:38:03] Connection: close [18:38:03] X-Forwarded-For: 115.164.0.0 [18:38:03] User-Agent: Opera [18:38:12] the ?zero is just to find it back in the log [18:45:22] New patchset: Dr0ptp4kt; "Adding Urdu as a supported language for Mobilink Pakistan." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66026 [18:47:39] mark, uno momento, will try that [18:48:06] mark, i've been using /wiki/Special:ZeroRatedMobileAccess? [18:48:48] New review: Reedy; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66007 [18:52:33] mark, i'm getting the "Sorry" warning about IP address 216.38.130.163 with the following submission: [18:52:37] GET /wiki/Pearl_Jam?cachebuster=13 HTTP/1.1 [18:52:37] Host: en.zero.wikipedia.org [18:52:39] User-Agent: Opera [18:52:40] Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 [18:52:42] Accept-Language: en-us,en;q=0.5 [18:52:43] Accept-Encoding: gzip, deflate [18:52:45] DNT: 1 [18:52:46] X-Forwarded-For: 115.164.0.0 [18:52:46] Connection: keep-alive [18:53:07] New review: Aude; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66007 [18:55:21] !log reedy synchronized php-1.22wmf4/extensions/Wikibase [18:55:30] Logged the message, Master [18:56:16] !log reedy synchronized php-1.22wmf5/extensions/Wikibase [18:56:23] Logged the message, Master [18:56:53] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66007 [18:57:46] !log reedy synchronized wmf-config/CommonSettings.php [18:57:54] Logged the message, Master [18:58:50] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikidatawiki to 1.22wmf5 [18:58:54] :) [18:58:58] Logged the message, Master [18:59:42] oh, we might have another settings change [18:59:43] Reedy: [18:59:47] one sec [19:00:18] New patchset: Reedy; "Anything non 'pedia to 1.22wmf5" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66029 [19:00:26] aude: I wasn't about to run away :p [19:01:22] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66029 [19:01:50] New patchset: Aude; "remove data types settings, use default setting in Wikibase extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66030 [19:02:13] https://gerrit.wikimedia.org/r/#/c/66030/ [19:02:15] ok [19:02:31] this is needed to enable the time data type [19:05:24] New patchset: Reedy; "remove data types settings, use default setting in Wikibase extension" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66030 [19:05:56] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66030 [19:07:01] !log reedy synchronized wmf-config/CommonSettings.php [19:07:09] Logged the message, Master [19:16:23] New review: Yurik; "I am checking in a different patch that removes all conditional stuff - now it will simply match up ..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/66026 [19:27:39] New patchset: Yurik; "Removed carrier-specific filtering" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66034 [19:30:37] New review: Yurik; "See https://gerrit.wikimedia.org/r/#/c/66034/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66026 [19:33:10] ori-l: no luck, managed to get down some megs, but nothing intresting [19:33:26] i am just updating https://bugzilla.wikimedia.org/show_bug.cgi?id=29784 [19:35:50] okai [19:36:12] done, added you as CC [19:42:49] ty [19:44:00] andre__: do you know what is the maximum attachment size on bugzilla? [19:45:34] ori-l, 10240kb [19:45:41] ori-l: do you know if it's possible temporary to increase the memory limit? [19:46:28] ori-l: you compressed the log right? [19:46:32] yeah [19:46:40] AzaToth: on production? there'd have to be a pretty compelling reason, and I have a feeling that a better PHP programmer than me will be able to identify and fix the memory leak, and there are plenty of better PHP programmers than me around [19:46:54] hehe [19:47:01] greg-g, could i do a dir sync of WMF5 zero ext? WMF4 seems to be fine, not even sure if WMF5 is in prod anywhere [19:47:24] i think it just holds references to substring copies (from cursor to end of buffer) of the input buffer [19:47:52] ori-l: the functionality in twinkle that pushed it over is simplified reporting of edit warriors so more edit warriors might get reported [19:47:54] it's probably not rocket science to resolve, just requires reading and stepping through the code carefully [19:48:30] I ran it through valgrind but there was no "real" memort leak [19:49:04] New patchset: Wpmirrordev; "fix typos in Makefile, mwxml2sql.c, and sqlfilter.c; rename dist" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/65706 [19:49:06] yurik: it is in production on test/test/mediawiki and all non-pedia sites [19:49:30] yurik: but, yeah, go ahead [19:49:35] cool, thx [19:49:53] AzaToth: you can try splitting it up into multiple modules as a temporary workaround, but I recommend just waiting a bit to see if anyone responds to the bug [19:52:42] !log yurik synchronized php-1.22wmf5/extensions/ZeroRatedMobileAccess/ [19:52:53] Logged the message, Master [19:53:28] greg-g, thx, done [19:56:10] v [19:56:11] ^ oops, wrong window. that's gmail for "label" :) [19:56:54] yurik: np, thank you :) [19:59:05] ori-l: http://paste.debian.net/7410/ [19:59:51] that was with 5.3 [20:00:21] hi, looking for help flushing varnish -- https://rt.wikimedia.org/Ticket/Display.html?id=5225 [20:01:09] i'm getting fairly frequent log entries from there, probably because it was incorrectly configured for a while [20:01:31] that carrier has not gone public yet, we should not show any banners for them [20:04:34] New review: Dr0ptp4kt; "Can we get this one +1'd and +2'd? The partner wants their stuff now. I agree, the other change shou..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66026 [20:06:14] New review: Yurik; "ok, won't hurt, I will need to rebase the other patch" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/66026 [20:06:55] also, this is a tiny change that carrier quickly needs - pls +2 https://gerrit.wikimedia.org/r/#/c/66026/ [20:07:03] yurik, dank je [20:29:16] New patchset: Wpmirrordev; "fix typos in Makefile, mwxml2sql.c, and sqlfilter.c; rename dist" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/65706 [21:14:34] * AaronSchulz wonders if LeslieCarr is around [21:14:45] AaronSchulz: nope [21:15:17] rats :) [21:16:33] apergos: did you get a chance to look at that xff thing more? [21:20:44] what xff thing? [21:21:45] but seriously do you need something from me ? [21:26:03] LeslieCarr: I pinged mark (I thought he was away but I guess not) [21:26:09] ok [21:43:59] mark, seems like people at the office are also having issues with it :( [21:44:05] (xff) [22:28:12] New patchset: Ori.livneh; "Use $wmgUseVForm to toggle VForm for both login & account creation" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66048 [22:40:14] New review: Dr0ptp4kt; "Reviewing. I believe this may result in *every* non-cached .m.wikipedia.org page invoking ..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/66034 [22:50:08] New review: Spage; "Looks good. This is progress towards bug 46333 (only having the new forms)." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/66048 [22:50:57] New patchset: Ori.livneh; "Use $wmgUseVForm to toggle VForm for both login & account creation" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66048 [22:51:45] New patchset: Ori.livneh; "Use $wmgUseVForm to toggle VForm for both login & account creation" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66048 [23:01:51] 2013-05-29 22:57:48 TTMServerMessageUpdateJob Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates/36 STARTING [23:01:56] well, it was going ok until that [23:03:03] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66048 [23:07:45] Nemo_bis: not sure why that would be slow [23:12:41] AaronSchulz: I see an explosion of CPU wait at same time, cause or consequence? https://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=vanadium.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [23:13:50] that message doesn't even have any translation https://meta.wikimedia.org/wiki/Special:PrefixIndex/Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates/36 [23:14:49] it lines up with when I started [23:15:32] 2013-05-29 23:15:05 TTMServerMessageUpdateJob Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates/75 STARTING [23:16:34] heh, Board_elections/2013/Candidates/65 took 266943 ms [23:17:27] 75 also has no translations [23:17:51] what about 65? [23:18:09] 6 characters ^^ https://meta.wikimedia.org/wiki/Translations:Wikimedia_Foundation_elections/Board_elections/2013/Candidates/65/en [23:18:31] and no translations [23:18:38] is this a pattern? [23:18:39] before, there were a few hundred fast ones, like: [23:18:40] 2013-05-29 22:56:38 TTMServerMessageUpdateJob CNBanner:B13_plaintext_PleaseReadPict-text-1 t=389 good [23:18:43] AaronSchulz, basically, Solr is deadlocked on vanadium [23:18:43] !log olivneh synchronized wmf-config/InitialiseSettings-labs.php 'Labs: consolidate 'wgUseVFormCreateAccount' & 'wgUseVFormUserLogin' into 'wmgUseVForm'' [23:18:52] Logged the message, Master [23:19:03] !log olivneh synchronized wmf-config/InitialiseSettings.php 'Enabling new 'VForm' layout for account creation & login interfaces on 31 wikis. (1/2)' [23:19:11] Logged the message, Master [23:19:14] as long as there are no concurrent updates, it's fine [23:19:19] MaxSem: must have been great when 192 job runners were on it [23:19:21] !log olivneh synchronized wmf-config/CommonSettings.php 'Enabling new 'VForm' layout for account creation & login interfaces on 31 wikis. (2/2)' [23:19:25] what else is doing updates? [23:19:30] Logged the message, Master [23:19:41] other jobs of the same type [23:19:46] * AaronSchulz has a single threaded script [23:19:50] Nikerabbit, ^^^^^ [23:19:57] that fast one had a "translation" https://meta.wikimedia.org/w/index.php?title=Special%3APrefixIndex&prefix=B13_plaintext_PleaseReadPict-text-1&namespace=866 [23:20:10] no runners are doing TTMServerMessageUpdateJob normally, which was disabled weeks ago [23:20:11] (I don't know if qqq is excluded) [23:20:30] eh? [23:20:39] interesting [23:21:07] Nemo_bis: https://dpaste.de/2Bmog/ [23:21:24] what's wrong with icinga, btw? [23:21:56] (note "t" is ms) [23:22:24] please be careful with vanadium; it's the eventlogging event processor and an outage affects multiple teams [23:23:39] ori-l, luckily Solr updates are single-threaded [23:23:56] "luckily" but not for translators :) [23:24:11] and apparently it's not enough for them [23:24:36] Nemo_bis: do you think there is some bug with empty translations? [23:25:18] MaxSem: is it possible that Solr consumes more resources looking for matches with very short strings rather than long ones? [23:25:31] no idea [23:25:42] I remember it only looks for translations with similar length [23:25:50] !log Tried running some metawiki TTMServerMessageUpdateJob jobs again, still very slow for some jobs (though at least not hours) [23:25:59] Logged the message, Master [23:26:06] I just help Niklas along when he asks for help, I don't know the inner workings of his schema [23:26:07] It's possibly much slower for strings with "common" lengths [23:26:45] are there other wikis with Translate? [23:27:00] a dozen [23:28:09] 10 excluding testwiki and future wikimania [23:35:56] !log olivneh synchronized wmf-config/InitialiseSettings.php 'enwikispecies -> specieswiki' [23:36:05] Logged the message, Master [23:36:31] New patchset: Ori.livneh; "Fix typo: enwikispecies -> specieswiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66055 [23:36:42] ori-l: whenever I see "vanadium" it makes me think of some kind of drug [23:36:47] * AaronSchulz is terrible [23:39:26] Nemo_bis: want to comment on https://bugzilla.wikimedia.org/show_bug.cgi?id=48164? :) [23:41:32] New review: Spage; "specieswiki is in all.dblist and maps to //species.wikimedia.org as intended" [operations/mediawiki-config] (master) C: 2; - https://gerrit.wikimedia.org/r/66055 [23:41:40] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/66055 [23:41:45] AaronSchulz: Do you have any idea what could be the underlying issue of https://bugzilla.wikimedia.org/show_bug.cgi?id=48164 [23:42:34] AaronSchulz: I'd rather go to bed now :) can you continue the script, maybe at a lower rate, to gather more testcases/data? [23:42:50] from the profiling of it, Nemo was suggesting it has a hard time with empty translations [23:43:04] Nemo_bis: not sure how much I want to mess with vanadium though [23:43:35] AaronSchulz: talknig to me, Nemo, or MaxSem ? [23:43:39] * AaronSchulz would have to sprinkle some sleep calls around [23:43:59] AzaToth: my profiling remark was to you :) [23:44:03] oh [23:44:20] AaronSchulz: as MaxSem said, slow solr updates don't seem to actually affect the rest [23:44:42] don't think it has any translations in the context though [23:46:12] * Nemo_bis to bed now [23:50:35] robla: epic bacon? [23:50:54] whut? [23:51:12] AaronSchulz: translations? [23:51:21] you are getting my head spin [23:51:35] AaronSchulz: not that I'm anti-bacon or anything, but I'm missing some context [23:51:43] wtf [23:51:50] AzaToth: also see the scrollback with nemo here [23:51:53] https://bugzilla.wikimedia.org/show_bug.cgi?id=29784 [23:52:00] I damn posted the wrong bug [23:52:12] sorry [23:52:12] robla: you don't smell that?