[00:55:39] !log reedy updated /a/common to {{Gerrit|I3543a34aa}}: db1040 back to full steam [00:55:43] (03PS1) 10Reedy: Swap method for closure [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105471 [00:55:50] (03CR) 10jenkins-bot: [V: 04-1] Swap method for closure [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105471 (owner: 10Reedy) [00:55:56] Logged the message, Master [01:02:55] PROBLEM - RAID on analytics1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:04:26] Reedy: is amazon.com down? [01:04:44] Nope... [01:04:54] RECOVERY - RAID on analytics1009 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [01:04:59] it times out for me [01:05:02] other sites work fine [01:14:10] Home page loads for me. [01:16:53] dns flush changed IP from 176.32.98.166 to 72.21.194.212...ping still times out [01:18:19] same with 205.251.242.54 [01:20:35] http://sitedown.co/amazoncom, heh, reports are just bay area [01:21:47] 6 18 ms 15 ms 16 ms he-3-9-0-0-cr01.sanjose.ca.ibone.comcast.net [68.86.91.45] [01:21:59] fail after that...always f*cking ca.ibone [01:22:24] such deja vu [01:24:08] James_F: are you in SF now? [01:24:53] (03PS1) 10Reedy: Swap global functions for closures [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105473 [01:25:30] Aaron|home: No. [01:25:35] He's upside down [01:25:37] Aaron|home: I'm in Perth, Australia. [01:26:52] (03PS2) 10Reedy: Swap method for closure [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105471 [01:27:03] Aaron|home: yeah no route for me either [01:27:07] on Sonic.net [01:27:18] my traceroute stops at layer42 [01:27:53] although I can reach www.amazon.com w/ my browser... [01:28:21] (looks like they disabled ICMP...) [01:36:48] some sort of routing weirdness w/ them, I tried from a VPS and it traces all right, but then ping doesn't get anything back [02:01:58] !log LocalisationUpdate completed (1.23wmf8) at Sun Jan 5 02:01:57 UTC 2014 [02:02:17] Logged the message, Master [02:03:44] !log LocalisationUpdate completed (1.23wmf9) at Sun Jan 5 02:03:44 UTC 2014 [02:04:01] Logged the message, Master [02:07:11] (03PS1) 10Reedy: Disable oai auditing. No one uses it for anything [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105475 [02:07:14] amazon is working fine for me [02:08:47] (03PS1) 10Reedy: Bump various epochs to start of 2013 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105476 [02:08:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Sun Jan 5 02:08:47 UTC 2014 [02:09:03] Logged the message, Master [02:10:46] legoktm: the funny part is that http seems to work [02:10:51] but ICMP's going haywire [02:10:59] oh, I was using http [02:11:13] eh, https is redirecting to http for me. [02:11:32] (oops, accidentally pressed control w somehow...) [02:12:04] idk lego, Microsoft does the same thing (blocking ICMP while doing http just fine) [02:12:10] probably b/c it's an attack vector [02:35:16] works for me now [02:38:50] hhm ICMP still funny [02:38:58] means that it can't be used for Amazon [02:56:22] Amazon uk responds... [02:56:32] to ICMP? [02:56:52] C:\Users\Reedy>ping amazon.co.uk [02:56:52] Pinging amazon.co.uk [176.32.108.186] with 32 bytes of data: [02:56:52] Reply from 176.32.108.186: bytes=32 time=37ms TTL=245 [02:56:52] Reply from 176.32.108.186: bytes=32 time=38ms TTL=245 [02:57:14] oh I didn't use that IP [02:57:31] funnily enough that times out for me [02:58:27] this time it ends at tinet's end of the transatlantic [03:02:08] hops around quite a bit on ips with no rnds [03:02:12] rdns [03:41:14] PROBLEM - Host gadolinium is DOWN: PING CRITICAL - Packet loss = 100% [03:54:20] (03PS6) 10TTO: Add templateeditor right, group, and restriction [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/104912 (owner: 10Rschen7754) [04:33:38] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /var/log/udp2log/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [04:47:16] PROBLEM - udp2log log age for erbium on erbium is CRITICAL: CRITICAL: log files /var/log/udp2log/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [05:07:34] (03PS1) 10Chad: Finalize commons config for Cirrus [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105479 [06:25:34] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [06:59:35] (03PS1) 10Ori.livneh: Lint [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/105487 [06:59:36] (03PS1) 10Ori.livneh: Release under terms of Apache license [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/105488 [07:33:34] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /var/log/udp2log/packet-loss.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [07:35:25] (03CR) 10Ori.livneh: [C: 032 V: 032] Lint [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/105487 (owner: 10Ori.livneh) [07:35:45] (03CR) 10Ori.livneh: [C: 032 V: 032] Release under terms of Apache license [operations/software/mwprof] - 10https://gerrit.wikimedia.org/r/105488 (owner: 10Ori.livneh) [07:58:44] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [07:58:54] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [07:58:54] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:58:54] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:59:04] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [07:59:14] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [07:59:44] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:59:44] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:00:24] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [08:00:54] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:01:44] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.055 second response time [08:01:44] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [08:02:34] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.059 second response time [08:02:44] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.065 second response time [08:02:45] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 8.829 second response time [08:02:54] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 65010 bytes in 0.250 second response time [08:03:14] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.036 second response time [08:04:34] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.057 second response time [08:06:44] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 808 bytes in 0.038 second response time [08:21:13] !log Power cycled ms-be1012 [08:21:31] Logged the message, Master [08:23:44] RECOVERY - Host ms-be1012 is UP: PING WARNING - Packet loss = 86%, RTA = 0.27 ms [09:03:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [13:31:58] ACKNOWLEDGEMENT - RAID on db1047 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Sean Pringle RT 6463 [14:17:54] (03PS1) 10Yuvipanda: dynamicproxy: Add XFF support [operations/puppet] - 10https://gerrit.wikimedia.org/r/105509 [15:21:33] (03PS1) 10Andrew Bogott: Add support for miscellaneous proxy-specific settings. [operations/puppet] - 10https://gerrit.wikimedia.org/r/105513 [15:23:40] (03CR) 10Andrew Bogott: [C: 04-1] "Untested pseudocode, please do not merge" [operations/puppet] - 10https://gerrit.wikimedia.org/r/105513 (owner: 10Andrew Bogott) [18:42:23] (03PS1) 10BBlack: remove http/0.9 patch [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/105523 [18:42:24] (03PS1) 10BBlack: varnish (3.0.3plus~rc1-wm27) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/105524 [18:42:45] (03CR) 10BBlack: [C: 032 V: 032] remove http/0.9 patch [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/105523 (owner: 10BBlack) [18:42:56] (03CR) 10BBlack: [C: 032 V: 032] varnish (3.0.3plus~rc1-wm27) precise; urgency=low [operations/debs/varnish] (testing/3.0.3plus-rc1) - 10https://gerrit.wikimedia.org/r/105524 (owner: 10BBlack) [18:51:44] (03CR) 10Yuvipanda: [C: 04-1] "does two redis calls per HTTP request, definitely a bad idea." [operations/puppet] - 10https://gerrit.wikimedia.org/r/105513 (owner: 10Andrew Bogott) [19:17:24] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [19:21:02] (03PS4) 10Nemo bis: Make logscale in reqerror graphs actually work [operations/puppet] - 10https://gerrit.wikimedia.org/r/101065 [19:21:09] (03CR) 10jenkins-bot: [V: 04-1] Make logscale in reqerror graphs actually work [operations/puppet] - 10https://gerrit.wikimedia.org/r/101065 (owner: 10Nemo bis) [19:26:21] !log varnish mobile caches updated to -wm27 package (crash fix, http/0.9 patch removal to test effect on logging anomalies) [19:26:27] (03PS5) 10Nemo bis: Make logscale in reqerror graphs actually work [operations/puppet] - 10https://gerrit.wikimedia.org/r/101065 [19:26:39] Logged the message, Master [19:29:21] (03CR) 10Nemo bis: "Rebased. I think nothing else needs changing after the update/migration?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/101065 (owner: 10Nemo bis) [19:35:52] (03CR) 10Ori.livneh: [C: 032] Make logscale in reqerror graphs actually work [operations/puppet] - 10https://gerrit.wikimedia.org/r/101065 (owner: 10Nemo bis) [19:36:05] (03CR) 10Ori.livneh: [V: 032] Make logscale in reqerror graphs actually work [operations/puppet] - 10https://gerrit.wikimedia.org/r/101065 (owner: 10Nemo bis) [19:41:23] Nemo_bis: thanks for the patch [19:41:58] ori: thanks for the speedy merge :) [19:42:22] the wrong title I had introduced was a bit embarrassing, especially now that the graphs show it so well :P [19:42:23] 'wmf stats' is a bit gross as the title, no? [19:42:36] should it be 'Wikimedia Stats'? [19:43:04] well it's surely just wmf [19:43:18] but "WMF stats" would be less incorrect if not more descriptive [19:43:39] after all we have N "wikistats", wikimetrics etc. etc. [19:44:01] awwwww log scale is SO LOVELY [19:44:04] WMF Stats or WMF stats? [19:44:30] * Nemo_bis doesn't like Title Case [19:44:40] Fine By Me [19:45:37] such a pity that electronic calculators made the general population forget the power of logarithms [19:46:54] (03PS1) 10Ori.livneh: gdash: tweak letter case of site header [operations/puppet] - 10https://gerrit.wikimedia.org/r/105526 [19:47:59] why is the text not vertically centered in the top bar [19:48:07] grr [19:48:15] hmm maybe we also need log2 for the more recent ones [19:48:46] yep, the items in it jump up and down a bit :) [19:49:54] (03CR) 10Nemo bis: [C: 031] "Combat The Abuse Of Stylish All-Lowercase" [operations/puppet] - 10https://gerrit.wikimedia.org/r/105526 (owner: 10Ori.livneh) [19:50:00] heh [19:50:12] (03CR) 10Ori.livneh: [C: 032] gdash: tweak letter case of site header [operations/puppet] - 10https://gerrit.wikimedia.org/r/105526 (owner: 10Ori.livneh) [19:55:24] Nemo_bis: https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&logBase=2&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color(cactiStyle(alias(reqstats.500,%22500%20resp/min%22)),%22red%22)&target=color(cactiStyle(alias(reqstats.5xx,%225xx%20resp/min%22)),%22blue%22) [19:56:09] this page lists the various URL parameters you can use to manipulate the graph: https://graphite.readthedocs.org/en/latest/render_api.html [19:56:42] IIRC they should all work with gdash as long as it maps the graph definition file to url params [20:03:25] I don't think "WMF" is nice here [20:03:49] I mean sure, there's a distinction between the wikimedia community and the foundation [20:04:03] but these are stats for the wikimedia sites [20:04:09] and it's still under wikimedia.org :) [20:05:03] log scale in reqerror, hmm [20:05:08] not sure how I feel about that [20:05:34] i like it, personally [20:06:33] !log aaron started scap: timing test [20:06:35] but discuss @ https://bugzilla.wikimedia.org/41754 [20:06:50] Logged the message, Master [20:06:57] paravoid: talking about stats. Do you know why pagecount dumps currently dried out at 3 AM? http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-01/ [20:07:11] I do not [20:07:15] I know nothing about dumps [20:07:21] apergos would be the person to ask [20:07:34] paravoid: thx [20:07:58] apergos: ping [20:09:57] ori: I think 500 & 5xx graphs should probably be split [20:10:07] 500s are mediawiki, and I tend to ignore these [20:10:23] 503s are varnish and I'm sure most mediawiki developers tend to ignore those [20:10:24] PROBLEM - Varnish HTTP mobile-backend on cp3011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:42] it's different issues altogether and potentially different audiences [20:10:44] PROBLEM - Varnish HTCP daemon on cp3011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:10:59] bblack: ping? [20:11:03] bblack: ^^^ that you? [20:11:04] PROBLEM - Varnish traffic logger on cp3011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:11:04] PROBLEM - Varnishkafka log producer on cp3011 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:11:20] [14073035.723997] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250) [20:11:23] grumble [20:11:30] !log rebooting cp3011, kmem_allock deadlock [20:11:48] Logged the message, Master [20:13:07] !log aaron finished scap: timing test (duration: 06m 51s) [20:13:22] Logged the message, Master [20:13:44] PROBLEM - Host cp3011 is DOWN: PING CRITICAL - Packet loss = 100% [20:14:45] RECOVERY - Host cp3011 is UP: PING OK - Packet loss = 0%, RTA = 95.32 ms [20:14:54] RECOVERY - Varnish traffic logger on cp3011 is OK: PROCS OK: 2 processes with command name varnishncsa [20:15:26] !log aaron started scap: timing test (beta) [20:15:34] RECOVERY - Varnish HTCP daemon on cp3011 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [20:15:44] Logged the message, Master [20:16:54] RECOVERY - Varnishkafka log producer on cp3011 is OK: PROCS OK: 1 process with command name varnishkafka [20:18:03] !log aaron finished scap: timing test (beta) (duration: 02m 49s) [20:18:18] (03PS1) 10Nemo bis: Also logbase 2 for the shorter reqerror graphs [operations/puppet] - 10https://gerrit.wikimedia.org/r/105614 [20:18:19] Logged the message, Master [20:19:46] ori: if you mean that logbase 2 should work, yes, it's the only logbase used in other gdash graphs :) [20:20:10] (03PS2) 10Nemo bis: Also logbase 2 for the shorter reqerror graphs [operations/puppet] - 10https://gerrit.wikimedia.org/r/105614 [20:22:36] gadolinium is down or at least unreachable, can't get to mgmt console either; that's the host that collects the pagecount files [20:22:58] http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Miscellaneous+eqiad&h=gadolinium.wikimedia.org&tab=m&vn=&hide-hf=false&metric_group= [20:23:27] apergos: thx for the info. [20:23:30] ping to mgmt gives no response from palladium [20:23:36] yw [20:24:35] !log gadolinium down/unreachable, can't get to mgmt console, no ping even [20:24:59] * apergos glares at morebots [20:56:45] (03PS1) 10Aaron Schulz: Minor scap fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/105616 [21:13:23] Aaron|home: are you aware of any serious problems with LocalisationUpdate? [21:14:44] I don't know if it was filed or not, but we have a rather specific report of a message not updated as it should: [21:15:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [21:39:09] paravoid: no, that wasn't me, although perhaps the restarting of varnish contributed to that eventuality :) [21:52:24] apergos: is there an ETA when gadolinium will be up and [21:52:50] apergos: delivering dumps again? [22:27:19] (03CR) 10Andrew Bogott: [C: 032] "Works for me! This isn't any less private than the public-IP solution we were using previously." [operations/puppet] - 10https://gerrit.wikimedia.org/r/105509 (owner: 10Yuvipanda) [22:38:38] ori: https://gerrit.wikimedia.org/r/#/c/105616/ :) [23:37:06] !ops http://www.youtube.com/watch?v=qSEmycI7uro [23:37:25] leaking_style: Stop that. [23:37:54] sire [23:37:56] is it interesting? [23:38:23] No. [23:38:23] i leik to tink so [23:38:37] probably not on topic, though, right? [23:38:40] Oh lord [23:38:45] MY EARS [23:38:46] * greg-g hasn't clicked [23:39:52] * Gloria NP: "Put The Gun Down" by ZZ Ward from "Til The Casket Drops" [23:39:56] it is gut coding music [23:40:03] I've been on a female singer kick lately. [23:40:37] ZZ Ward, Lana Del Rey, Lorde, et al. [23:42:03] https://login.wikimedia.org/wiki/Special:ListGroupRights is pretty random. [23:42:17] Special:VipsTest is okay for everyone. [23:42:55] * Gloria files a bug. [23:43:23] there should be an easy syntax to revoke all permissions, no? [23:43:41] ['*']['go-away'] = true; [23:45:24] https://bugzilla.wikimedia.org/show_bug.cgi?id=59701 [23:45:53] $wgGroupPermissions['*']= array(); [23:46:34] There's a handful of other extensions that probably want disabling on it [23:46:36] But extensions do crazy shit. [23:46:45] Right. [23:46:49] I'm not sure why CU is enabled there. [23:46:54] Or Oversight. [23:47:01] Or OAuth, for that matter. [23:47:05] (03CR) 10Ori.livneh: [C: 032] Minor scap fixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/105616 (owner: 10Aaron Schulz) [23:47:24] * Gloria files a bug about that. [23:47:43] Gloria: CU is there for a very good reason. [23:47:59] Gloria: OAuth and Oversight, OTOH, are entirely unnecessary. [23:48:10] $wgGroupPermissions['*']= array(); [23:48:15] James_F: There's a concern that stewards are using loginwiki as a CU backdoor of sorts. [23:48:15] really? [23:48:21] But that's probably a discussion for wikimedia-l. [23:48:21] thought one had to explicitly revoke them [23:48:43] Gloria: oh, is there? I said so a few days ago but was just kidding [23:49:01] Gloria: neko case [23:49:16] Nemo_bis: Almost all the activity that the wiki gets is stewards +/- CU. [23:49:18] 2013-12-28 20.29 < Nemo_bis> who needs local CU if everyone's logged on loginwiki anyway ;) [23:49:28] yep [23:49:42] ori: Hmmm, any tracks in particular? [23:50:01] neko case ♥ [23:51:09] heh, i wouldn't have guessed [23:51:28] ZZ Ward reminds me of Amy Winehouse just a little. And Janis Joplin. [23:51:35] But not as dark. [23:53:37] https://bugzilla.wikimedia.org/show_bug.cgi?id=59702 is the bug about installed extensions on loginwiki. [23:53:39] i like all of 'fox confessor brings the flood' the best, but maybe you need something more pop-y as a hook [23:54:28] I usually just look at all songs on iTunes and sort by most popular. ;-) [23:54:45] http://www.youtube.com/watch?v=zi6keFpm-BY [23:56:52] paravoid: what else do you listen to? [23:56:53] wouldn't have guessed what? [23:57:05] (03PS1) 10Reedy: Disable more stuff on loginwiki and votewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105622 [23:58:34] (03CR) 10Reedy: [C: 032] Disable more stuff on loginwiki and votewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105622 (owner: 10Reedy) [23:58:43] (03Merged) 10jenkins-bot: Disable more stuff on loginwiki and votewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/105622 (owner: 10Reedy) [23:59:03] greg-g: Get 'im! [23:59:42] !log reedy synchronized wmf-config/InitialiseSettings.php [23:59:56] Logged the message, Master