[00:20:06] New patchset: Anomie; "Add wikivoyage to captcha whitelist" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44839 [00:25:01] New review: Anomie; "Ignore the Jenkins failure. Unit tests are broken since I873ab295; I97fd6b1d will fix them." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/44839 [00:28:33] If anyone happens to be around and bored enough to review stuff, ^ [00:30:59] hmm interesting, my review didn't show up? (I did it just after your comment about the jenkins failure) [00:32:42] Thehelpfulone- I see it in there. I don't think gerrit-wm notifies if you don't enter a comment along with your review, if that's what you're referring to. [00:46:57] ah, yeah makes sense [01:19:39] New patchset: Asher; "little python script that reads stdin and calculates some percentiles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44842 [01:20:23] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44842 [02:03:55] New patchset: Asher; "select api backend certain mobile requests are routed to based on $::mw_primary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44844 [02:07:22] New patchset: Asher; "select api backend certain mobile requests are routed to based on $::mw_primary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44844 [02:18:35] New patchset: Asher; "select api backend certain mobile requests are routed to based on $::mw_primary" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/44844 [02:25:48] !log LocalisationUpdate completed (1.21wmf7) at Sun Jan 20 02:25:48 UTC 2013 [02:26:03] Logged the message, Master [02:37:30] New patchset: Asher; "EQIAD SWITCH: set $wgReadOnly in pmtpa" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44845 [02:48:06] !log LocalisationUpdate completed (1.21wmf8) at Sun Jan 20 02:48:05 UTC 2013 [02:48:17] Logged the message, Master [02:53:38] New patchset: Asher; "EQIAD MIGRATION: turn OFF $wgReadOnly in eqiad" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44847 [13:26:54] New patchset: Raimond Spekking; "Cleanup: Remove last bit of revisionmove feature" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44856 [15:33:45] hello [15:33:52] I realize it's sunday [15:34:14] but can someone please help me activate a progress bar template on www.mediawiki.org please ? [15:49:19] average_drifter, wrong channel [15:49:43] MaxSem: hi, what's the right channel ? [15:49:51] #mediawiki [15:49:55] will do [15:49:56] thanks [15:52:49] meanwhile, we have a problem with esams upload caches [15:52:55] https://ganglia.wikimedia.org/latest/graph.php?c=Upload%20squids%20esams&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1358697028&g=network_report&z=medium&r=hour [15:53:06] https://ganglia.wikimedia.org/latest/graph.php?c=Upload%20squids%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1358697028&g=network_report&z=medium&r=hour [15:53:54] apergos, paravoid ^^^ [15:54:37] hm... maybe is that the reason why I got ´504 Gateway Time-out´ on commons [15:55:31] some of the other graphs also look unusual [16:06:27] eek https://ganglia.wikimedia.org/latest/graph.php?c=SSL%20cluster%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1358697885&g=network_report&z=medium&r=hour [16:42:28] New patchset: Dereckson; "(no subject)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44861 [16:44:42] New review: Dereckson; "Publishing draft: (bug 44163) Namespace configuration on is.wikisource" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/44861 [16:49:10] * Nemo_bis feels excluded [16:52:07] ah, but I can see that 5xx errors are exploding https://gdash.wikimedia.org/dashboards/reqerror/ [16:56:57] mark, notpeter, mutante, LeslieCarr, preilly, woosters ^^^ [18:09:50] MaxSem: do I need to try and raise some people? [18:11:24] Reedy, you should be hitting esams too - are images loading slowly for you? [18:12:07] in principle, I know myself how to look up phone numers too but I'm not sure how serious it is [18:12:48] yeah, it's pretty slow [18:13:11] though, let me unclog my connection first [18:13:27] mine's already unclogged [18:13:58] so is mine, thought the downloads were running [18:14:41] It's probably worth trying some people. For everyone bar Tim it is currently a reasonable time, so it shouldn't disturb many people [18:15:11] ok [18:15:16] whom to start with? [18:19:06] Reedy, ^^ [18:19:17] I was just looking at graphs [18:19:33] would be good if we had the procedure written down somewhere... [18:20:00] I guess we don't know the source of 5xx, other than mobile being slow [18:22:28] Probably mark, paravoid and asher for starters [18:22:59] Just mention it's not a real outage, so they all don't need to rush :p [18:23:11] Do you want me to do it? I've got most of the numbers in my mobile already [18:23:39] okay, go for it [18:23:40] s/real/big/ [18:27:31] Done [18:28:37] * MaxSem grabs popcorn [18:30:44] Reedy: hey [18:31:00] wheee [18:31:04] binasher, https://gdash.wikimedia.org/dashboards/reqerror/ [18:31:05] binasher: https://gdash.wikimedia.org/dashboards/reqerror/ [18:31:11] spamspamspam [18:31:31] I send the message and now 500 has jumped to 5k/s [18:31:50] wee [18:33:17] upload.wm.o is slow in Europe, also https://ganglia.wikimedia.org/latest/graph.php?c=Upload%20squids%20esams&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1358697028&g=network_report&z=medium&r=day [18:41:45] Any luck binasher? [18:42:21] binasher: Just got a reply from paravoid. He mentioned knsq19 needed a restart last time [18:44:46] a pretty high percentage of the 500's are for http://upload.wikimedia.org/wikipedia/commons/thumb/3/35/AroniminkLogo2.jpg/180px-AroniminkLogo2.jpg [18:45:15] from a single ip address [18:45:54] https://commons.wikimedia.org/wiki/File:AroniminkLogo2.jpg [18:47:25] mmm, it's non-compliant for Commons [18:47:36] the 500's are mostly from the US, a lot are for that [18:48:18] the non-500s mostly look like 502s from europe via the nginx ipv6 proxy for upload [18:50:51] know what this useragent is? Wikihood%20iPad/1.3.3%20CFNetwork/609%20Darwin/13.0.0? all the 500's are from that UA, no referrer, from one ip [18:51:21] are they DoSing via SSL? https://ganglia.wikimedia.org/latest/graph.php?c=SSL%20cluster%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1358697885&g=network_report&z=medium&r=day [18:51:56] nginx 502 = bad gateway, so will look at the eu upload caches [18:52:10] https://itunes.apple.com/en/app/wikihood-for-ipad/id378364975?mt=8 [18:52:29] "Experience a brand new Wikihood, fully redesigned for iPad, with lots of extra information and a compelling visual" [18:52:36] MaxSem: the 502's from the ssl hosts are mostly all non-ssl [18:52:46] Looks like it's somewhat google goggles [18:53:05] i wonder if blocking that ip would totally kill the app [18:53:51] reverse lookup and see where it's coming from? [18:53:56] Could be a mobile providers proxy. [18:55:03] looking at all wikihood requests, there are some other ips.. but they're all 404's or 500's [18:56:09] both US cable modem providers [18:58:02] New review: Dereckson; "Proofreadpage configuration part missing." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/44861 [19:26:02] oooh, the graph has jumped up [19:27:14] major packetloss between esams and pmtpa [19:27:36] but not between eqiad and esams hosts it seems [19:27:58] 1m30s and images for TFA aren't loaded [19:28:08] could someone sms leslie? i'm going to fail out of esams [19:28:50] binasher, on it [19:28:57] ty! [19:30:31] !log seeing major packetloss between esams and pmtpa, failing out of esams [19:30:46] Logged the message, Master [19:31:33] ok, what is going on [19:31:43] thank goodness ryan wasn't lazy and checked my phone [19:32:13] hey [19:32:33] i see 30-60% packet loss when pinging any esams hosts from pmtpa [19:32:54] why'd you do that? ;) [19:32:56] lemme see what's up [19:33:45] hehe [19:35:11] ok, 5xx rate is looking pretty normal now [19:36:24] packet loss is only 3-8% now when i ping -f esams hosts [19:37:44] it looks like there were some issues with our selected path but since htye're gone now ... [19:37:45] oh and 5xx's are back [19:37:49] I see "Error generating thumbnail" on that image now [19:37:57] oh, lemme see, maybe back [19:38:03] lemme try switching around selected path [19:40:02] !log deactivating 2828 as a preferred intermediate route between 14907 and 43821 [19:40:17] Logged the message, Mistress of the network gear. [19:40:58] LeslieCarr: is there network saturation in pmtpa? https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=LVS%20loadbalancers%20pmtpa&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1358710334&g=network_report&z=large [19:42:13] wow [19:43:18] (checking this out now , little bit slower in tampa than in eqiad since i need to cross reference racktables all the time) [19:52:40] Ewww. There's a tonne of swift warnings on the apache logs too [19:52:48] hrm [19:52:52] Mainly timeouts [19:52:58] so no overloaded ports on the foundry [19:55:01] has this IP been banned? [19:57:30] !log authdns scenario back to normal [19:57:41] Logged the message, Master [19:59:10] the "esams down" scenario currently moves all upload traffic to pmtpa, while normal is eqiad+esams [20:02:48] so i looked over all the network and i can't find any place that should be spiking [20:07:19] ok, with traffic moved back to esams, i'm not seeing packet loss between there and ptmpa any more [20:08:58] ok, i'm going to get on with all of my boring chores i had planned today :) [20:09:50] i wonder if the route change you made helped, thanks for getting on [20:10:00] no worries [20:10:07] 5xx's look ok now too, after moving traffic back to esams [20:10:10] woot [20:10:27] feel free to ping me if something else goes wrong - i should be around most of the day being boring :) [20:10:29] bye [20:12:29] "Error creating thumbnail: Image was not scaled, is the requested width bigger than the source?" [20:13:07] did that thing just repeat requesting the image ad infinitum after receiving a 5xx? [20:13:41] reminds me of http://thedailywtf.com/Articles/The-Most-Favoritest-Icon.aspx [20:17:32] Sounds like it tbh [20:18:25] a very reasonable reason to ban this supermegaclient, then [20:19:26] fatalmonitor looks back to normal [20:20:49] What Would Domas Do? [20:21:03] drop a nuke? [21:11:21] Reedy, I thought you changed rc_params everywhere, but I still see warnings related to it... [21:26:26] Change abandoned: Dereckson; "After a discussion on IRC, Tpt and me agreed such further requests will be handled by extension code..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/44861 [21:26:50] MaxSem: Reedy was https://gerrit.wikimedia.org/r/#/c/44332/ ever deployed? [21:27:26] if people try to view recent change or watchlist entries that have malformed rc_params, the error may still occur without that patch [21:28:21] maxsem@fenari:/home/wikipedia/common/php-1.21wmf7/extensions/Wikibase$ git log -1 [21:28:21] commit c298dc95dd8d6745245b536ca9735d79f5404d63 [21:28:21] Author: aude [21:28:21] Date: Tue Jan 15 09:46:57 2013 +0000 [21:28:29] check rc_params is array in client [21:28:49] hmmm....ok [21:29:01] so it's ok? [21:29:17] should be but i'm curious about the warnings [21:29:44] if it's malformed, it should return false and we check it's the right thing [21:29:49] or not [21:33:42] aude, hmm - the deployed revision is from Jan 15 while your rev is from Jan 17 [21:34:04] huh? [21:34:15] that might be okay [21:34:47] i reordered a couple commits and reapplied that one [21:34:59] eeeeeeeeeeeeeeeeek [21:35:04] heh [21:35:06] you're scaring me:P [21:35:22] we have one thing that we ended up not needing to deploy [21:35:54] it was in case our synchronization broke when we moved wikidata to wmf8 but clients are still wmf7 [21:36:08] but nothing broke, afaik [21:37:29] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/extensions/Wikibase.git;a=log;h=refs/heads/mw1.21-wmf7 [21:37:31] that's what I see in prod: http://dpaste.org/y648h/ [21:38:35] that's fine, thanks [21:38:56] fine as in "something else needs fixing", yes:) [21:39:00] it was just slightly confusing on wednesday with the git deploy stuff [21:39:13] MaxSem: depends on the warnings you see [21:39:42] Warning: Wikibase\ClientHooks : $rc_params is not unserialized correctly. It has been returned as boolean in /usr/local/apache/common-local/php-1.21wmf7/extensions/Wikibase/client/WikibaseClient.hooks.php on line 433 [21:39:58] that's the same thing that should be fixed [21:40:06] wheeeee [21:40:08] * aude checks [21:41:14] oh, ok [21:41:23] danielk wanted use to put a warning ther [21:41:24] e [21:41:46] the code is working correctly, and otherwise it skips the malformed change entry [21:41:57] so it will spam the logs forever? [21:42:30] it should be more infrequent [21:42:48] i'm not sure what point we change it to something else or what's best? [21:42:57] vs. let it fail silently [21:43:23] if it is malformed sometime later, i'd be curious why [21:44:07] wfDebug()? [21:44:37] sure, we could .... where does that get logged to? [21:44:39] is it always on? [21:45:01] how would we know about warnings? [21:46:01] * aude would love to know about all warnings or debug stuff relating to wikibase [21:48:32] aude, error log has only these today [21:49:24] ok [21:50:30] i think daniel doesn't want it to fail silently so if there's a log for wfDebug, that might be ok [21:50:50] * Riley is trying to figure out why he got pinged by ^ [21:51:19] Riley, due to "fail"?:p [21:51:55] aude, it's not that easy [21:52:13] you need to define a debug group and log specifically to it [21:52:20] right [21:52:31] wfDebugLog? [21:52:31] MaxSem: Oh right, yes because I do so many fails that I !stalk fail ;) [21:52:36] and what's the value of these anyway? [21:52:56] the ^^ was to aude [21:53:26] MaxSem: agree, but will have to ask daniel again [21:53:55] if we are comfortable the problem is solved, i think if could be wfDebugLog but don't actually have to log everything [22:12:11] aude: there's still going to be bad rows [22:12:40] Reedy: huh? [22:12:58] fixing the length/datatype won't fix already bad data in the database [22:13:03] right [22:13:28] do we want to run a maintenance script or something to clean those up? [22:13:45] I'm not sure how much of an issue it is [22:13:51] * aude nods [22:13:56] it will go away eventually [22:14:01] time of database + max rc age then it will be fine [22:14:11] and people won't be viewing old rc (older they get...) [22:14:40] question is if we want it to be warning level if rc_params is invalid or what [22:18:36] if they will die off, there's no problem [22:25:58] MaxSem: ok [23:40:16] !log rebooting virt0 to determine which dimm is bad [23:40:28] Logged the message, Master [23:42:03] o_0 [23:42:29] so it was RAM not memcached [23:43:15] well, likely, yes [23:45:44] !log tstarling Started syncing Wikimedia installation... : [23:45:56] Logged the message, Master [23:56:38] !log tstarling Finished syncing Wikimedia installation... : [23:56:48] Logged the message, Master