[00:03:37] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [00:06:26] fyi: analytics gave me a list of stale pages that are referring to an old version of CentralNotice; I'm purging them all currently (which will last a while) [00:06:43] which... is why... if anyone is curious... the cluster is running a bit warm right now [00:07:04] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27485 [00:07:04] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26423 [00:11:44] !log reedy synchronized php-1.21wmf1/includes/api/ApiUpload.php [00:11:55] Logged the message, Master [00:13:48] fyi folks office network will be going off and on for a bit [00:15:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.194 seconds [00:33:22] going to drain AMS again [00:33:30] !log draining esams via authdns-scenarios [00:33:42] Logged the message, Mistress of the network gear. [00:51:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [01:04:04] PROBLEM - Host amssq32 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:13] PROBLEM - Host amssq36 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:13] PROBLEM - Host amssq38 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:13] PROBLEM - Host amssq39 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:13] PROBLEM - Host amssq35 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:13] PROBLEM - Host amssq40 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:14] PROBLEM - Host amssq46 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:14] PROBLEM - Host amssq37 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:15] PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:15] PROBLEM - Host amssq41 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:16] PROBLEM - Host amssq43 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:16] PROBLEM - Host amssq45 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:17] PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:17] PROBLEM - Host amssq54 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:18] PROBLEM - Host amssq53 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:18] PROBLEM - Host amssq58 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:19] PROBLEM - Host amssq48 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:19] PROBLEM - Host amssq60 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:20] PROBLEM - Host amssq55 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:20] PROBLEM - Host amssq44 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:21] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:21] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:22] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:22] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:23] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:05:52] PROBLEM - Host amssq33 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:52] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:52] PROBLEM - Host amssq42 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:52] PROBLEM - Host amssq49 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:52] PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:53] PROBLEM - Host amssq31 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:53] PROBLEM - Host amssq34 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:37] PROBLEM - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [01:06:46] PROBLEM - Host ms6 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:55] PROBLEM - LVS HTTP IPv4 on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:06:56] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 6785.20 ms [01:07:04] PROBLEM - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:13] PROBLEM - Host csw2-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.244) [01:07:22] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [01:07:26] all those are expected, please ignore [01:07:49] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [01:07:49] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:53] YOU HAVE ANGERED THE NAGIOS! [01:07:59] :) [01:08:07] RECOVERY - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 59307 bytes in 0.574 seconds [01:08:25] RECOVERY - LVS HTTP IPv4 on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 652 bytes in 0.218 seconds [01:08:57] oh man.. i scrolled up before reading LeslieCarr's last msg [01:09:19] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 64821 bytes in 1.180 seconds [01:09:46] PROBLEM - LVS HTTP IPv4 on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:04] RECOVERY - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.228 seconds [01:10:40] PROBLEM - Host knsq16 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:40] PROBLEM - Host amslvs1 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:40] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:58] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:58] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:58] PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:07] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 59307 bytes in 0.569 seconds [01:11:25] RECOVERY - Host knsq27 is UP: PING WARNING - Packet loss = 58%, RTA = 108.06 ms [01:11:25] RECOVERY - Host amslvs1 is UP: PING OK - Packet loss = 0%, RTA = 107.83 ms [01:11:25] PROBLEM - Host amslvs3 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:25] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:25] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [01:11:34] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:34] RECOVERY - Host knsq16 is UP: PING OK - Packet loss = 0%, RTA = 108.04 ms [01:11:34] PROBLEM - Host 91.198.174.6 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:34] PROBLEM - Host amslvs4 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:43] RECOVERY - Host knsq19 is UP: PING OK - Packet loss = 0%, RTA = 108.17 ms [01:11:43] RECOVERY - Host knsq24 is UP: PING OK - Packet loss = 0%, RTA = 107.78 ms [01:11:43] PROBLEM - Host ns2.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:11:43] PROBLEM - Host nescio is DOWN: PING CRITICAL - Packet loss = 100% [01:11:52] PROBLEM - Host hooft is DOWN: PING CRITICAL - Packet loss = 100% [01:11:52] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:01] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [01:12:01] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:01] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [01:12:02] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:02] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:29] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:29] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39745 bytes in 0.766 seconds [01:12:31] is this still all expected? [01:12:37] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 107.94 ms [01:12:46] RECOVERY - Host 91.198.174.6 is UP: PING OK - Packet loss = 0%, RTA = 108.05 ms [01:12:55] RECOVERY - LVS HTTP IPv4 on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 0.216 second response time [01:12:55] RECOVERY - Host ns2.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.27 ms [01:12:55] RECOVERY - Host csw2-esams is UP: PING OK - Packet loss = 0%, RTA = 109.25 ms [01:13:04] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 79039 bytes in 0.768 seconds [01:13:23] !log disabled nagios notifications [01:13:37] Logged the message, Master [01:13:45] thanks binasher [01:17:56] New patchset: Asher; "- bits event.gif logging to vanadium - esams traffic needs a bouncer on oxygen, tbd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27496 [01:24:18] New patchset: Asher; "- bits event.gif logging to vanadium - esams traffic needs a bouncer on oxygen, tbd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27496 [01:25:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27496 [01:34:09] New patchset: Asher; "fix syntax error, move log_format override to after default file is parsed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27497 [01:35:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27497 [01:43:47] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27497 [01:48:37] !log reenabling nagios notifications [01:48:48] Logged the message, Mistress of the network gear. [01:52:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:55:26] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:55:31] !log moved traffic back to normal with authdns-scenario [01:55:42] Logged the message, Mistress of the network gear. [01:55:44] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:55:53] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:13] New patchset: Asher; "bits servers don't run nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27499 [01:56:20] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:47] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:57:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27499 [01:57:33] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27499 [01:57:38] LeslieCarr: should i re-enable nagios notifs? [01:57:41] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:57:51] oh i did [01:58:07] so my phone tells me! hmm [01:58:10] oh noes [01:58:56] what's been up with rendering this week [01:59:56] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:20] indeed image rendering in production is pretty broken right now, is this being investigated? [02:01:29] https://commons.wikimedia.org/wiki/Special:NewFiles [02:01:34] thumbs not loading [02:02:20] hrm [02:03:12] ping apergos binasher AaronSchulz TimStarling woosters [02:03:51] i'm watching an imagescaler apache, its spent the last 20+ seconds waiting on this socket: srv220.pmtpa.wmnet:34909->ms-fe.svc.pmtpa.wmnet:www (ESTABLISHED) [02:04:07] so, looks swift fe related [02:04:42] binasher: potentially realted to https://bugzilla.wikimedia.org/show_bug.cgi?id=40514 ? [02:04:52] oh, no [02:04:53] derp [02:06:31] are the netapps ready to take over if we have to disable swift for now? [02:06:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.319 seconds [02:07:22] load on swift boxes is crazy [02:07:30] what time is it in greece? [02:07:34] on backends, specifically [02:07:38] like 3 am i think ? [02:07:39] 5am [02:08:12] load of almost 30.... [02:10:58] alright, I'll give faidon a call [02:11:50] wait io rising over the past...3 hours or so [02:12:18] getting timeouts to ms-be1 - Oct 11 02:08:33 10.0.6.215 proxy-server ERROR with Object server 10.0.6.200:6000/sdk1 re: Trying to DELETE /AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-commons-local-thumb.df/d/df/1stAlameinBritDefense.jpg/250px-1stAlameinBritDefense.jpg: Timeout (10s) [02:12:40] paravoid is coming on irc shortly [02:13:01] he is so nice [02:13:15] here [02:13:29] ms-be1 has a bunch of disks at 100% util [02:14:53] Oct 11 02:00:17 ms-be1 kernel: [4862557.684153] swift-object-re: page allocation failure. order:5, mode:0x44d0 [02:15:31] yeah, seeing those too [02:16:26] robla, fun with the swift cluster. investigation ongoing. [02:16:36] phooey [02:16:53] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:17:23] paravoid: re: the page aloc failures, i increased /proc/sys/vm/min_free_kbytes from 28k to 64k.. not that it has anything to do with other issues [02:18:04] queries per second is triple of what it used to be [02:18:19] LeslieCarr: did the site failover stuff you were doing a bit earlier impact uploads? [02:18:20] on the backends [02:18:38] where qps used to be all of a couple hundred? [02:18:46] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-be[1-9]&mreg[]=swift_[A-Z]%2B_hits%24&z=large>ype=stack&title=Swift+queries+per+second&aggregate=1&r=hour [02:19:14] started about ~2h ago [02:19:32] but doesn't seem to be the same on the frontends [02:19:50] so this could be the proxies not getting a reply in time from one backend and rerequesting it from a second one [02:19:56] then a third one [02:20:16] paravoid: it also seems like deletes are up significantly. any reason why that might be? [02:20:57] hm, I take that back, the frontends have also tripled [02:21:03] didn't see the graphs right [02:21:08] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[A-Z]%2B_hits%24&z=large>ype=stack&title=Swift+queries+per+second&aggregate=1&r=hour [02:21:18] LeslieCarr: ping [02:22:20] we keep a full swift access log, I'll try digging into that [02:23:06] the frontends also started dipping into swap recently, it would seem [02:23:41] they need their usual restart, I see they still have some memory left [02:27:51] !log LocalisationUpdate completed (1.21wmf1) at Thu Oct 11 02:27:51 UTC 2012 [02:28:06] Logged the message, Master [02:28:20] deletes seem to be up a lot more than 3x - http://ganglia.wikimedia.org/latest/graph.php?r=day&z=large&c=Swift+pmtpa&h=ms-fe2.pmtpa.wmnet&v=28.5945319297&m=swift_DELETE_204_hits&jr=&js=&vl=hps [02:28:21] binasher: so, I saw an spike earlier during the day that couldn't figure out why [02:28:43] and it correlated -and correlates again- with a spike on MySQL/memcache [02:28:46] https://gdash.wikimedia.org/dashboards/datastores/ [02:29:24] it was a larger spike back then [02:29:56] could this be just a cascading issue? [02:30:26] oh hm, that's an unusual number of DELETEs [02:30:29] * paravoid greps the logs [02:32:40] swift deletes probably have corresponding mw db writes, which would explain what graphite shows.. but the performance implications there should be negligible [02:33:12] yeah I didn't mention it for performance implications but rather to try to understand what's going on [02:35:01] mwalker: what exactly are you doing? [02:35:43] for each page that referred to Special:BannerController; I'm running a action=purge against it [02:36:17] what's the list you're working from? [02:36:44] oh? [02:36:46] it's a list analytics gave me; not sure where they're storing it, but I have a copy on aluminium if you'd like to take a look [02:36:49] action=purge deletes thumbs [02:36:57] doesn't it? [02:37:04] I think it does. [02:37:13] binasher: hey [02:37:14] yeah, I think so too [02:37:16] sorry, i got food [02:37:22] paravoid: hurm; I was hoping it was just the page content [02:37:38] let me kill it and not destroy the image thumbnailer... [02:37:46] I hope I wasn't completely destroying it [02:37:51] binasher: all upload vips were redirected to pmtpa [02:38:03] LeslieCarr: i wanted to ask if upload.wikimedia.org was moved around during your maintenance earlier, but i don't think it's related at all [02:38:10] okay [02:38:18] I'm not sure, though...I know it purges thumbs when you do it on an image page, but not sure if just normal content pages would trigger a thumbnail regen [02:38:48] robla: I know that some image pages are in my list; not sure how many of them though [02:38:52] which IP are you doing that from? [02:38:52] so I can filter them out [02:38:59] mwalker's list could contain image pages [02:39:16] 208.80.154.6 [02:39:39] oh yeah, mwalker is causing it [02:39:52] Oct 11 02:24:58 10.0.6.204 object-server 10.0.6.215 - - [11/Oct/2012:02:24:58 +0000] "DELETE /sdd1/36784/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-commons-local-thumb.ca/c/ca/Languages_of_Africa_map.svg/500px-Languages_of_Africa_map.svg.png" 204 - "-" "tx4739b658b83d49deb885b4868561d5fc" "PHP-CloudFiles/1.7.10" 0.6187 [02:40:05] which mapped to [02:40:07] UPDATE /* LocalFile::upgradeRow 208.80.154.6 */ `image` SET img_size = '768360',img_width = '1534',img_height = '1461',img_bits = '0',img_media_type = 'DRAWING',img_major_mime = 'image',img_minor_mime = 'svg+xml',img_metadata = 'a:6:{s:5:\"width\";i:1534;s:6:\"height\";i:1461;s:13:\"originalWidth\";s:4:\"1534\";s:14:\"originalHeight\";s:9:\"1461.4756\";s:8:\"metadata\";s:330:\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\">\n \n image/svg+xml\n \n \n \";s:7:\"version\";i:2;}',img_sha1 = 'aml2rgrl75caj9gfigmcpsqzqxsvje9' WHERE img_name = [02:40:08] 'Languages_of_Africa_map.svg' [02:41:48] all thumbs for around 10k images in the last 40 minutes just commons [02:41:57] so, script is killed [02:42:11] mwalker: congrats on your first site outage! [02:42:22] load on ms-be boxes dropping like a stone [02:42:25] now we know another way to kill swift [02:42:26] mwalker: woohoo! you're a member of the team [02:42:27] :) [02:42:31] qps dropping fast [02:42:35] try deleting 10k files over an entire hour.. [02:42:37] heh [02:42:38] * mwalker bows [02:42:41] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.070 second response time [02:42:43] binasher: by using it =P [02:43:11] robla: good catch, thanks [02:43:13] robla: thanks for pointing to mwalker [02:43:17] hehe [02:43:17] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.758 second response time [02:43:26] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.869 second response time [02:43:45] no prob! [02:43:55] robla: yes, ty very much! [02:43:55] problem is, the swift logs in this case are useless since the hits come from mediawiki [02:44:11] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 58880 bytes in 0.919 seconds [02:44:14] once i found an ip to go with the write queries in the common master's binlogs, i probably would have just nuked aluminium [02:44:28] it was an artifact of me being a latecomer and reading the logs rather than being in realtime :) [02:44:39] good catch indeed. thanks paravoid for coming online at an unreasonable hour to help with this one. [02:45:11] most sincere apologies all; I was watching the application servers loads; not swifts [02:45:13] aye, thanks paravoid! [02:45:14] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.980 second response time [02:45:24] mwalker: did you also run your script ~10 hours ago? [02:45:55] or some other time during the past hours? [02:46:25] paravoid: ya, you should see me spread out since about 2000 UTC yesterday [02:46:30] there was a similar spike earlier today which I noticed and scratched my head for a bit, but it was very brief [02:46:35] action=purge for image description pages causes thumbnails to be deleted, see WikiFilePage::doPurge() [02:46:37] paravoid: ya, that was me [02:46:45] funny thing is that had mwalker made his announcement 10 min earlier, I wouldn't have caught it: http://bots.wmflabs.org/~petrb/logs/%23wikimedia-operations/20121011.txt [02:47:12] robla: I was actually talking to CT at the time... [02:47:27] i missed that announcement [02:48:12] so... how many people do I owe cookies/beer to? [02:48:19] mwalker: announcing is great, but do it with !log when doing something that could effect the cluster [02:48:29] will do [02:49:07] interesting that this killed the cluster [02:49:16] swiftly [02:49:26] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.302 second response time [02:49:26] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [02:50:29] i think the sqlite container actions make deletes suck for swift [02:50:52] these would be on the SSDs though [02:50:55] but that could be one of several things [02:51:26] it still results in wild card selects and then sqlite deletes [02:51:56] each DELETE also corresponds to three backend requests [02:52:04] to actually delete all copies [02:52:05] do the sqlite deletes also get applied via writes to the async log file? [02:52:18] no idea [02:53:11] http://ganglia.wikimedia.org/latest/graph_all_periods.php?m=swift_object_change&z=small&h=Swift+pmtpa+prod&c=Swift+pmtpa&r=hour [02:53:46] I should have looked at that graph before [03:00:35] okay, I think I can go back sleeping now [03:00:53] I'll have a look tomorrow [03:01:00] bye [03:03:53] night faidon [03:04:14] soooo... would it be ok to go back to poking the caches if I filter out all media and lower my limit for number of queries per second? or should I let it sit till tomorrow? [03:06:55] mwalker: better to wait til tomorrow, just in case [03:07:01] kk [03:07:37] aah … looks like I missed the action [03:07:55] New patchset: Asher; "splitting nrpe monitoring out of varnish::logging_config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27502 [03:07:56] heh [03:08:43] funny thing was I spoke to mwalker earlier [03:08:53] and i asked him to throttle [03:08:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27502 [03:08:59] I was throttling! [03:09:07] just not apprarently the right thing [03:09:09] to the max [03:09:12] heehee [03:09:21] hey now; that counts doesn't it? :p [03:09:37] good thing you told me to announce though [03:09:41] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27502 [03:09:45] * mwalker has much more knowledge [03:09:56] * mwalker ... now [03:10:12] u can now say u brought on a partial outage [03:10:51] dubious honour that one [03:11:27] it really is one per ops tradition :) [03:11:56] anyone heard db42 being mentioned today? [03:13:12] is that db the answer to all of our problems? [03:13:21] cmhohnson1 working on it [03:13:31] some h/w problem …cannot come back up [03:13:38] mutante: i didn't get a chance to ask chris about it [03:14:08] i'd like to know if he pulled the new raid controller, or what steps were taken to try getting it back up [03:14:52] binasher: i know he already removed the new raid controller yesterday [03:15:03] but it did not come back up with or without it [03:15:33] i guess it must be mainboard-ish [03:15:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:16:12] oh he tried replacing the raid controller with the old one [03:16:13] as mutante said [03:16:23] pgehres_: woosters : analytics/research just asks for it.. [03:16:33] okay, i is going to leave the irc channel and stop breaking the site [03:16:38] but they still have db1047 and Dario said a couple days are ok [03:16:45] au revoir [03:16:47] * pgehres_ was trying to make a joke about 42 ... [03:17:07] ah, of course:) hehe, getting it now [03:17:09] it apparently was as well received as image purges in swift [03:17:10] it is [03:17:46] this sure is the day for putting the "fun" in fundraising [03:17:47] RECOVERY - Puppet freshness on sq69 is OK: puppet ran at Thu Oct 11 03:17:28 UTC 2012 [03:18:07] first we lose our master db and then we bring down the cluster :-) [03:18:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.265 seconds [03:25:03] New patchset: Asher; "when a varnish instance name isn't specified, the hostname is used for shm access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27503 [03:26:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27503 [03:27:14] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27503 [03:37:40] pgehres_: better now than later, right? /me quits [03:37:56] New patchset: Asher; "variables asigned as undef in a manifest exist in an erb, oops" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27504 [03:38:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27504 [03:39:40] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27504 [03:39:47] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [03:39:47] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [03:52:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:05:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.167 seconds [04:12:53] !log marmontel (blog): fixed some file permissions in ./blog/wp-content for www-data the wordpress caching plugin needs to be re-activated [04:12:56] !log activated W3 Total Cache performance plugin in Wordpress (blog.wikimedia.org) [04:13:06] Logged the message, Master [04:13:17] Logged the message, Master [04:40:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.017 seconds [05:28:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:42:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.583 seconds [06:16:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.169 seconds [07:03:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.065 seconds [07:33:34] apergos: hey Ariel :-] Do you have anyway to fix up the labs primary NS server ? [07:33:51] labs-ns1.wikimedia.org seems to be missing entries :( [07:34:13] I have no idea about labs at all, I am sorry [07:36:05] apergos: I understand :-D [07:36:50] I will use my host file meanwhile [07:38:39] ok [07:38:42] good luck [07:53:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:01:21] Hey tech team, did someone re-enable editing for wikimania2010wiki recently? [08:03:09] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:03:10] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [08:03:10] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:06:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.298 seconds [08:07:35] oh, nevermind, there was a bug for that zzz [08:34:39] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:09] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:36] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:11] swift pmtpa spike [08:37:33] lots more network traffic [08:38:06] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [08:39:18] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.035 second response time [08:39:27] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [08:39:36] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59453 bytes in 0.269 seconds [08:41:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:35] meh memory [08:50:46] !log restarted ms-fe4 swift-proxy [08:50:59] Logged the message, Master [08:56:06] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [08:57:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [08:59:28] !log restarted swift proxy on ms-fe3 and ms-fe2 with some delay between. note I had to shoot stale processes from Sept on all these boxes as well [08:59:37] I was about to ask about that [08:59:40] Logged the message, Master [08:59:56] apache logs have numerous swift related warnings [09:00:21] they would. those should go away now [09:02:35] how does it look? [09:29:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:39:48] They're filtering out slowly [09:40:27] The logs dont get as many lines as they did before, so when you're using the last 1000 lines... takes a while for them ot disappear [09:40:38] Oh, no. All gone [09:41:04] yay [09:44:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [09:50:59] New patchset: DamianZaremba; "Adding in nscd file to override cache times." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27517 [09:51:01] so about that wgCacheEpoch... [09:51:25] ? [09:51:27] I think in about a month we're going to want to move that up again [09:51:29] here's why [09:52:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27517 [09:52:46] pages with math images, not edited since... dunno but sometime early September, revalidated by squid after sept 8 00:00 2012 but before oct 5 12:17 2012 [09:53:07] (after epoch date value but before the sync of the setting change) [09:53:17] have the bad math paths in 'em [09:54:06] if we get lucky they'll fall out of the parser cache but if they are viewed enough I guess they won't [09:54:17] and revalidation will give us the bad paths yet again [09:55:09] (I looked at a pile of pages to see what was going on: this was it.) [09:57:39] I'm a little uncertain about the parser cache end of things actually [09:57:44] Ahh [09:57:52] but I know what the squids are doing, exactly [09:58:51] so I don't know if revalidation in a month is going to give us a good value or a bad one yet [09:58:51] we could make it increase 1 day/day for the next month [09:59:32] on Oct 22 I can check an example I have which will be up for revalidation then [09:59:49] if I can find one that expires sooner I'll check it [10:02:16] * apergos goes back to watching dtraces, most of the requests have no referer and it's impossible to find the page that contains them [10:04:33] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:09:13]