[00:03:37] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [00:06:26] fyi: analytics gave me a list of stale pages that are referring to an old version of CentralNotice; I'm purging them all currently (which will last a while) [00:06:43] which... is why... if anyone is curious... the cluster is running a bit warm right now [00:07:04] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27485 [00:07:04] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26423 [00:11:44] !log reedy synchronized php-1.21wmf1/includes/api/ApiUpload.php [00:11:55] Logged the message, Master [00:13:48] fyi folks office network will be going off and on for a bit [00:15:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.194 seconds [00:33:22] going to drain AMS again [00:33:30] !log draining esams via authdns-scenarios [00:33:42] Logged the message, Mistress of the network gear. [00:51:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [01:04:04] PROBLEM - Host amssq32 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:13] PROBLEM - Host amssq36 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:13] PROBLEM - Host amssq38 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:13] PROBLEM - Host amssq39 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:13] PROBLEM - Host amssq35 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:13] PROBLEM - Host amssq40 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:14] PROBLEM - Host amssq46 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:14] PROBLEM - Host amssq37 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:15] PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:15] PROBLEM - Host amssq41 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:16] PROBLEM - Host amssq43 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:16] PROBLEM - Host amssq45 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:17] PROBLEM - Host amssq61 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:17] PROBLEM - Host amssq54 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:18] PROBLEM - Host amssq53 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:18] PROBLEM - Host amssq58 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:19] PROBLEM - Host amssq48 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:19] PROBLEM - Host amssq60 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:20] PROBLEM - Host amssq55 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:20] PROBLEM - Host amssq44 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:21] PROBLEM - Host amssq59 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:21] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:22] PROBLEM - Host amssq56 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:22] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:23] PROBLEM - Host amssq62 is DOWN: PING CRITICAL - Packet loss = 100% [01:04:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:05:52] PROBLEM - Host amssq33 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:52] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:52] PROBLEM - Host amssq42 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:52] PROBLEM - Host amssq49 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:52] PROBLEM - Host amssq50 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:53] PROBLEM - Host amssq31 is DOWN: PING CRITICAL - Packet loss = 100% [01:05:53] PROBLEM - Host amssq34 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:37] PROBLEM - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [01:06:46] PROBLEM - Host ms6 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:55] PROBLEM - LVS HTTP IPv4 on upload.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:06:56] PROBLEM - Host csw1-esams is DOWN: PING CRITICAL - Packet loss = 54%, RTA = 6785.20 ms [01:07:04] PROBLEM - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:13] PROBLEM - Host csw2-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.244) [01:07:22] PROBLEM - LVS HTTP IPv6 on mediawiki-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [01:07:26] all those are expected, please ignore [01:07:49] PROBLEM - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [01:07:49] PROBLEM - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:07:53] YOU HAVE ANGERED THE NAGIOS! [01:07:59] :) [01:08:07] RECOVERY - LVS HTTP IPv6 on wikibooks-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 59307 bytes in 0.574 seconds [01:08:25] RECOVERY - LVS HTTP IPv4 on upload.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 652 bytes in 0.218 seconds [01:08:57] oh man.. i scrolled up before reading LeslieCarr's last msg [01:09:19] RECOVERY - LVS HTTPS IPv4 on wiktionary-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 64821 bytes in 1.180 seconds [01:09:46] PROBLEM - LVS HTTP IPv4 on mediawiki-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:04] RECOVERY - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 634 bytes in 0.228 seconds [01:10:40] PROBLEM - Host knsq16 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:40] PROBLEM - Host amslvs1 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:40] PROBLEM - Host knsq19 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:58] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:10:58] PROBLEM - Host knsq24 is DOWN: PING CRITICAL - Packet loss = 100% [01:10:58] PROBLEM - Host knsq27 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:07] RECOVERY - LVS HTTP IPv6 on wikiquote-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK HTTP/1.1 200 OK - 59307 bytes in 0.569 seconds [01:11:25] RECOVERY - Host knsq27 is UP: PING WARNING - Packet loss = 58%, RTA = 108.06 ms [01:11:25] RECOVERY - Host amslvs1 is UP: PING OK - Packet loss = 0%, RTA = 107.83 ms [01:11:25] PROBLEM - Host amslvs3 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:25] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:25] PROBLEM - Host wiktionary-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [01:11:34] PROBLEM - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:11:34] RECOVERY - Host knsq16 is UP: PING OK - Packet loss = 0%, RTA = 108.04 ms [01:11:34] PROBLEM - Host 91.198.174.6 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:34] PROBLEM - Host amslvs4 is DOWN: PING CRITICAL - Packet loss = 100% [01:11:43] RECOVERY - Host knsq19 is UP: PING OK - Packet loss = 0%, RTA = 108.17 ms [01:11:43] RECOVERY - Host knsq24 is UP: PING OK - Packet loss = 0%, RTA = 107.78 ms [01:11:43] PROBLEM - Host ns2.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [01:11:43] PROBLEM - Host nescio is DOWN: PING CRITICAL - Packet loss = 100% [01:11:52] PROBLEM - Host hooft is DOWN: PING CRITICAL - Packet loss = 100% [01:11:52] PROBLEM - Host mediawiki-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:01] PROBLEM - LVS HTTP IPv6 on wikiversity-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway [01:12:01] PROBLEM - Host wikinews-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:01] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6_https is DOWN: PING CRITICAL - Packet loss = 100% [01:12:02] PROBLEM - Host wikibooks-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:02] PROBLEM - Host bits-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:29] PROBLEM - Host wikimedia-lb.esams.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:12:29] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39745 bytes in 0.766 seconds [01:12:31] is this still all expected? [01:12:37] RECOVERY - Host csw1-esams is UP: PING OK - Packet loss = 0%, RTA = 107.94 ms [01:12:46] RECOVERY - Host 91.198.174.6 is UP: PING OK - Packet loss = 0%, RTA = 108.05 ms [01:12:55] RECOVERY - LVS HTTP IPv4 on mediawiki-lb.esams.wikimedia.org is OK: HTTP OK - HTTP/1.0 301 Moved Permanently - 0.216 second response time [01:12:55] RECOVERY - Host ns2.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 108.27 ms [01:12:55] RECOVERY - Host csw2-esams is UP: PING OK - Packet loss = 0%, RTA = 109.25 ms [01:13:04] RECOVERY - LVS HTTPS IPv4 on wikimedia-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 79039 bytes in 0.768 seconds [01:13:23] !log disabled nagios notifications [01:13:37] Logged the message, Master [01:13:45] thanks binasher [01:17:56] New patchset: Asher; "- bits event.gif logging to vanadium - esams traffic needs a bouncer on oxygen, tbd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27496 [01:24:18] New patchset: Asher; "- bits event.gif logging to vanadium - esams traffic needs a bouncer on oxygen, tbd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27496 [01:25:17] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27496 [01:34:09] New patchset: Asher; "fix syntax error, move log_format override to after default file is parsed" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27497 [01:35:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27497 [01:43:47] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27497 [01:48:37] !log reenabling nagios notifications [01:48:48] Logged the message, Mistress of the network gear. [01:52:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:55:26] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:55:31] !log moved traffic back to normal with authdns-scenario [01:55:42] Logged the message, Mistress of the network gear. [01:55:44] PROBLEM - Apache HTTP on srv221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:55:53] PROBLEM - Apache HTTP on srv224 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:13] New patchset: Asher; "bits servers don't run nrpe" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27499 [01:56:20] PROBLEM - Apache HTTP on srv220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:56:47] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:57:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27499 [01:57:33] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27499 [01:57:38] LeslieCarr: should i re-enable nagios notifs? [01:57:41] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:57:51] oh i did [01:58:07] so my phone tells me! hmm [01:58:10] oh noes [01:58:56] what's been up with rendering this week [01:59:56] PROBLEM - Apache HTTP on srv222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:01:20] indeed image rendering in production is pretty broken right now, is this being investigated? [02:01:29] https://commons.wikimedia.org/wiki/Special:NewFiles [02:01:34] thumbs not loading [02:02:20] hrm [02:03:12] ping apergos binasher AaronSchulz TimStarling woosters [02:03:51] i'm watching an imagescaler apache, its spent the last 20+ seconds waiting on this socket: srv220.pmtpa.wmnet:34909->ms-fe.svc.pmtpa.wmnet:www (ESTABLISHED) [02:04:07] so, looks swift fe related [02:04:42] binasher: potentially realted to https://bugzilla.wikimedia.org/show_bug.cgi?id=40514 ? [02:04:52] oh, no [02:04:53] derp [02:06:31] are the netapps ready to take over if we have to disable swift for now? [02:06:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.319 seconds [02:07:22] load on swift boxes is crazy [02:07:30] what time is it in greece? [02:07:34] on backends, specifically [02:07:38] like 3 am i think ? [02:07:39] 5am [02:08:12] load of almost 30.... [02:10:58] alright, I'll give faidon a call [02:11:50] wait io rising over the past...3 hours or so [02:12:18] getting timeouts to ms-be1 - Oct 11 02:08:33 10.0.6.215 proxy-server ERROR with Object server 10.0.6.200:6000/sdk1 re: Trying to DELETE /AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-commons-local-thumb.df/d/df/1stAlameinBritDefense.jpg/250px-1stAlameinBritDefense.jpg: Timeout (10s) [02:12:40] paravoid is coming on irc shortly [02:13:01] he is so nice [02:13:15] here [02:13:29] ms-be1 has a bunch of disks at 100% util [02:14:53] Oct 11 02:00:17 ms-be1 kernel: [4862557.684153] swift-object-re: page allocation failure. order:5, mode:0x44d0 [02:15:31] yeah, seeing those too [02:16:26] robla, fun with the swift cluster. investigation ongoing. [02:16:36] phooey [02:16:53] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [02:17:23] paravoid: re: the page aloc failures, i increased /proc/sys/vm/min_free_kbytes from 28k to 64k.. not that it has anything to do with other issues [02:18:04] queries per second is triple of what it used to be [02:18:19] LeslieCarr: did the site failover stuff you were doing a bit earlier impact uploads? [02:18:20] on the backends [02:18:38] where qps used to be all of a couple hundred? [02:18:46] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-be[1-9]&mreg[]=swift_[A-Z]%2B_hits%24&z=large>ype=stack&title=Swift+queries+per+second&aggregate=1&r=hour [02:19:14] started about ~2h ago [02:19:32] but doesn't seem to be the same on the frontends [02:19:50] so this could be the proxies not getting a reply in time from one backend and rerequesting it from a second one [02:19:56] then a third one [02:20:16] paravoid: it also seems like deletes are up significantly. any reason why that might be? [02:20:57] hm, I take that back, the frontends have also tripled [02:21:03] didn't see the graphs right [02:21:08] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[A-Z]%2B_hits%24&z=large>ype=stack&title=Swift+queries+per+second&aggregate=1&r=hour [02:21:18] LeslieCarr: ping [02:22:20] we keep a full swift access log, I'll try digging into that [02:23:06] the frontends also started dipping into swap recently, it would seem [02:23:41] they need their usual restart, I see they still have some memory left [02:27:51] !log LocalisationUpdate completed (1.21wmf1) at Thu Oct 11 02:27:51 UTC 2012 [02:28:06] Logged the message, Master [02:28:20] deletes seem to be up a lot more than 3x - http://ganglia.wikimedia.org/latest/graph.php?r=day&z=large&c=Swift+pmtpa&h=ms-fe2.pmtpa.wmnet&v=28.5945319297&m=swift_DELETE_204_hits&jr=&js=&vl=hps [02:28:21] binasher: so, I saw an spike earlier during the day that couldn't figure out why [02:28:43] and it correlated -and correlates again- with a spike on MySQL/memcache [02:28:46] https://gdash.wikimedia.org/dashboards/datastores/ [02:29:24] it was a larger spike back then [02:29:56] could this be just a cascading issue? [02:30:26] oh hm, that's an unusual number of DELETEs [02:30:29] * paravoid greps the logs [02:32:40] swift deletes probably have corresponding mw db writes, which would explain what graphite shows.. but the performance implications there should be negligible [02:33:12] yeah I didn't mention it for performance implications but rather to try to understand what's going on [02:35:01] mwalker: what exactly are you doing? [02:35:43] for each page that referred to Special:BannerController; I'm running a action=purge against it [02:36:17] what's the list you're working from? [02:36:44] oh? [02:36:46] it's a list analytics gave me; not sure where they're storing it, but I have a copy on aluminium if you'd like to take a look [02:36:49] action=purge deletes thumbs [02:36:57] doesn't it? [02:37:04] I think it does. [02:37:13] binasher: hey [02:37:14] yeah, I think so too [02:37:16] sorry, i got food [02:37:22] paravoid: hurm; I was hoping it was just the page content [02:37:38] let me kill it and not destroy the image thumbnailer... [02:37:46] I hope I wasn't completely destroying it [02:37:51] binasher: all upload vips were redirected to pmtpa [02:38:03] LeslieCarr: i wanted to ask if upload.wikimedia.org was moved around during your maintenance earlier, but i don't think it's related at all [02:38:10] okay [02:38:18] I'm not sure, though...I know it purges thumbs when you do it on an image page, but not sure if just normal content pages would trigger a thumbnail regen [02:38:48] robla: I know that some image pages are in my list; not sure how many of them though [02:38:52] which IP are you doing that from? [02:38:52] so I can filter them out [02:38:59] mwalker's list could contain image pages [02:39:16] 208.80.154.6 [02:39:39] oh yeah, mwalker is causing it [02:39:52] Oct 11 02:24:58 10.0.6.204 object-server 10.0.6.215 - - [11/Oct/2012:02:24:58 +0000] "DELETE /sdd1/36784/AUTH_43651b15-ed7a-40b6-b745-47666abf8dfe/wikipedia-commons-local-thumb.ca/c/ca/Languages_of_Africa_map.svg/500px-Languages_of_Africa_map.svg.png" 204 - "-" "tx4739b658b83d49deb885b4868561d5fc" "PHP-CloudFiles/1.7.10" 0.6187 [02:40:05] which mapped to [02:40:07] UPDATE /* LocalFile::upgradeRow 208.80.154.6 */ `image` SET img_size = '768360',img_width = '1534',img_height = '1461',img_bits = '0',img_media_type = 'DRAWING',img_major_mime = 'image',img_minor_mime = 'svg+xml',img_metadata = 'a:6:{s:5:\"width\";i:1534;s:6:\"height\";i:1461;s:13:\"originalWidth\";s:4:\"1534\";s:14:\"originalHeight\";s:9:\"1461.4756\";s:8:\"metadata\";s:330:\" xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\" xmlns:cc=\"http://creativecommons.org/ns#\" xmlns:dc=\"http://purl.org/dc/elements/1.1/\">\n \n image/svg+xml\n \n \n \";s:7:\"version\";i:2;}',img_sha1 = 'aml2rgrl75caj9gfigmcpsqzqxsvje9' WHERE img_name = [02:40:08] 'Languages_of_Africa_map.svg' [02:41:48] all thumbs for around 10k images in the last 40 minutes just commons [02:41:57] so, script is killed [02:42:11] mwalker: congrats on your first site outage! [02:42:22] load on ms-be boxes dropping like a stone [02:42:25] now we know another way to kill swift [02:42:26] mwalker: woohoo! you're a member of the team [02:42:27] :) [02:42:31] qps dropping fast [02:42:35] try deleting 10k files over an entire hour.. [02:42:37] heh [02:42:38] * mwalker bows [02:42:41] RECOVERY - Apache HTTP on srv220 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.070 second response time [02:42:43] binasher: by using it =P [02:43:11] robla: good catch, thanks [02:43:13] robla: thanks for pointing to mwalker [02:43:17] hehe [02:43:17] RECOVERY - Apache HTTP on srv221 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.758 second response time [02:43:26] RECOVERY - Apache HTTP on srv222 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.869 second response time [02:43:45] no prob! [02:43:55] robla: yes, ty very much! [02:43:55] problem is, the swift logs in this case are useless since the hits come from mediawiki [02:44:11] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 58880 bytes in 0.919 seconds [02:44:14] once i found an ip to go with the write queries in the common master's binlogs, i probably would have just nuked aluminium [02:44:28] it was an artifact of me being a latecomer and reading the logs rather than being in realtime :) [02:44:39] good catch indeed. thanks paravoid for coming online at an unreasonable hour to help with this one. [02:45:11] most sincere apologies all; I was watching the application servers loads; not swifts [02:45:13] aye, thanks paravoid! [02:45:14] RECOVERY - Apache HTTP on srv224 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.980 second response time [02:45:24] mwalker: did you also run your script ~10 hours ago? [02:45:55] or some other time during the past hours? [02:46:25] paravoid: ya, you should see me spread out since about 2000 UTC yesterday [02:46:30] there was a similar spike earlier today which I noticed and scratched my head for a bit, but it was very brief [02:46:35] action=purge for image description pages causes thumbnails to be deleted, see WikiFilePage::doPurge() [02:46:37] paravoid: ya, that was me [02:46:45] funny thing is that had mwalker made his announcement 10 min earlier, I wouldn't have caught it: http://bots.wmflabs.org/~petrb/logs/%23wikimedia-operations/20121011.txt [02:47:12] robla: I was actually talking to CT at the time... [02:47:27] i missed that announcement [02:48:12] so... how many people do I owe cookies/beer to? [02:48:19] mwalker: announcing is great, but do it with !log when doing something that could effect the cluster [02:48:29] will do [02:49:07] interesting that this killed the cluster [02:49:16] swiftly [02:49:26] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.302 second response time [02:49:26] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [02:50:29] i think the sqlite container actions make deletes suck for swift [02:50:52] these would be on the SSDs though [02:50:55] but that could be one of several things [02:51:26] it still results in wild card selects and then sqlite deletes [02:51:56] each DELETE also corresponds to three backend requests [02:52:04] to actually delete all copies [02:52:05] do the sqlite deletes also get applied via writes to the async log file? [02:52:18] no idea [02:53:11] http://ganglia.wikimedia.org/latest/graph_all_periods.php?m=swift_object_change&z=small&h=Swift+pmtpa+prod&c=Swift+pmtpa&r=hour [02:53:46] I should have looked at that graph before [03:00:35] okay, I think I can go back sleeping now [03:00:53] I'll have a look tomorrow [03:01:00] bye [03:03:53] night faidon [03:04:14] soooo... would it be ok to go back to poking the caches if I filter out all media and lower my limit for number of queries per second? or should I let it sit till tomorrow? [03:06:55] mwalker: better to wait til tomorrow, just in case [03:07:01] kk [03:07:37] aah … looks like I missed the action [03:07:55] New patchset: Asher; "splitting nrpe monitoring out of varnish::logging_config" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27502 [03:07:56] heh [03:08:43] funny thing was I spoke to mwalker earlier [03:08:53] and i asked him to throttle [03:08:57] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27502 [03:08:59] I was throttling! [03:09:07] just not apprarently the right thing [03:09:09] to the max [03:09:12] heehee [03:09:21] hey now; that counts doesn't it? :p [03:09:37] good thing you told me to announce though [03:09:41] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27502 [03:09:45] * mwalker has much more knowledge [03:09:56] * mwalker ... now [03:10:12] u can now say u brought on a partial outage [03:10:51] dubious honour that one [03:11:27] it really is one per ops tradition :) [03:11:56] anyone heard db42 being mentioned today? [03:13:12] is that db the answer to all of our problems? [03:13:21] cmhohnson1 working on it [03:13:31] some h/w problem …cannot come back up [03:13:38] mutante: i didn't get a chance to ask chris about it [03:14:08] i'd like to know if he pulled the new raid controller, or what steps were taken to try getting it back up [03:14:52] binasher: i know he already removed the new raid controller yesterday [03:15:03] but it did not come back up with or without it [03:15:33] i guess it must be mainboard-ish [03:15:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:16:12] oh he tried replacing the raid controller with the old one [03:16:13] as mutante said [03:16:23] pgehres_: woosters : analytics/research just asks for it.. [03:16:33] okay, i is going to leave the irc channel and stop breaking the site [03:16:38] but they still have db1047 and Dario said a couple days are ok [03:16:45] au revoir [03:16:47] * pgehres_ was trying to make a joke about 42 ... [03:17:07] ah, of course:) hehe, getting it now [03:17:09] it apparently was as well received as image purges in swift [03:17:10] it is [03:17:46] this sure is the day for putting the "fun" in fundraising [03:17:47] RECOVERY - Puppet freshness on sq69 is OK: puppet ran at Thu Oct 11 03:17:28 UTC 2012 [03:18:07] first we lose our master db and then we bring down the cluster :-) [03:18:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.265 seconds [03:25:03] New patchset: Asher; "when a varnish instance name isn't specified, the hostname is used for shm access" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27503 [03:26:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27503 [03:27:14] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27503 [03:37:40] pgehres_: better now than later, right? /me quits [03:37:56] New patchset: Asher; "variables asigned as undef in a manifest exist in an erb, oops" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27504 [03:38:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27504 [03:39:40] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27504 [03:39:47] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [03:39:47] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [03:52:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:05:26] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.167 seconds [04:12:53] !log marmontel (blog): fixed some file permissions in ./blog/wp-content for www-data the wordpress caching plugin needs to be re-activated [04:12:56] !log activated W3 Total Cache performance plugin in Wordpress (blog.wikimedia.org) [04:13:06] Logged the message, Master [04:13:17] Logged the message, Master [04:40:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:55:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.017 seconds [05:28:19] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:42:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.583 seconds [06:16:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:32:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.169 seconds [07:03:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.065 seconds [07:33:34] apergos: hey Ariel :-] Do you have anyway to fix up the labs primary NS server ? [07:33:51] labs-ns1.wikimedia.org seems to be missing entries :( [07:34:13] I have no idea about labs at all, I am sorry [07:36:05] apergos: I understand :-D [07:36:50] I will use my host file meanwhile [07:38:39] ok [07:38:42] good luck [07:53:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:01:21] Hey tech team, did someone re-enable editing for wikimania2010wiki recently? [08:03:09] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [08:03:10] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [08:03:10] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [08:06:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.298 seconds [08:07:35] oh, nevermind, there was a bug for that zzz [08:34:39] PROBLEM - Apache HTTP on srv223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:09] PROBLEM - Apache HTTP on srv219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:36:36] PROBLEM - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:37:11] swift pmtpa spike [08:37:33] lots more network traffic [08:38:06] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [08:39:18] RECOVERY - Apache HTTP on srv223 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.035 second response time [08:39:27] RECOVERY - Apache HTTP on srv219 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [08:39:36] RECOVERY - LVS HTTP IPv4 on rendering.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59453 bytes in 0.269 seconds [08:41:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:50:35] meh memory [08:50:46] !log restarted ms-fe4 swift-proxy [08:50:59] Logged the message, Master [08:56:06] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [08:57:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.039 seconds [08:59:28] !log restarted swift proxy on ms-fe3 and ms-fe2 with some delay between. note I had to shoot stale processes from Sept on all these boxes as well [08:59:37] I was about to ask about that [08:59:40] Logged the message, Master [08:59:56] apache logs have numerous swift related warnings [09:00:21] they would. those should go away now [09:02:35] how does it look? [09:29:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:39:48] They're filtering out slowly [09:40:27] The logs dont get as many lines as they did before, so when you're using the last 1000 lines... takes a while for them ot disappear [09:40:38] Oh, no. All gone [09:41:04] yay [09:44:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [09:50:59] New patchset: DamianZaremba; "Adding in nscd file to override cache times." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27517 [09:51:01] so about that wgCacheEpoch... [09:51:25] ? [09:51:27] I think in about a month we're going to want to move that up again [09:51:29] here's why [09:52:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27517 [09:52:46] pages with math images, not edited since... dunno but sometime early September, revalidated by squid after sept 8 00:00 2012 but before oct 5 12:17 2012 [09:53:07] (after epoch date value but before the sync of the setting change) [09:53:17] have the bad math paths in 'em [09:54:06] if we get lucky they'll fall out of the parser cache but if they are viewed enough I guess they won't [09:54:17] and revalidation will give us the bad paths yet again [09:55:09] (I looked at a pile of pages to see what was going on: this was it.) [09:57:39] I'm a little uncertain about the parser cache end of things actually [09:57:44] Ahh [09:57:52] but I know what the squids are doing, exactly [09:58:51] so I don't know if revalidation in a month is going to give us a good value or a bad one yet [09:58:51] we could make it increase 1 day/day for the next month [09:59:32] on Oct 22 I can check an example I have which will be up for revalidation then [09:59:49] if I can find one that expires sooner I'll check it [10:02:16] * apergos goes back to watching dtraces, most of the requests have no referer and it's impossible to find the page that contains them [10:04:33] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [10:09:13] the stuff I saw revalidated after oct 5 was ok so maybe if we just wait the rest of them will be fixed up [10:17:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:24:10] meh, not unless they fall out of squid cache, which they may not, I have one here with a date of sept 10 >_< [10:28:29] 60 days, I see *sigh* [10:32:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.167 seconds [10:52:33] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [11:03:41] why does a revalidation give bad paths again? [11:04:09] the ones that happened before oct 5 and after sept 8 [11:04:22] got whatever the parser cache had in there [11:04:33] which was likely old since wgcacheepoch was from a year ago [11:04:40] yes [11:04:58] I guess that in 60 days from oct 5 whatever squid has will be good [11:05:06] yes [11:05:09] and probably before that [11:05:16] I don't think it revalidates at the very last moment [11:06:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:06:29] so until then we are going to have GETs to ms7 for these bad math paths [11:06:41] yes [11:09:04] the real lesson to be learned was that the cache epoch should have changed when the math paths were changed. oh well [11:09:18] yes [11:09:57] and now it's time for lunch [11:10:37] * apergos waits for the next "yes" :-D [11:11:08] no [11:11:40] too bad, eating anyways [11:14:29] hmm [11:14:42] so to copy thumbs onto the netapp [11:14:47] I wonder what the best way to go is [11:16:27] we don't care if we get them all, just some large amount right? [11:16:41] actually we do need them all for purging [11:16:50] uugghhhhh [11:17:13] they'll have to be copied form siwft then, there's no hope for it [11:17:41] yes [11:17:45] aaron had a way to do that iirc [11:18:20] his sync script [11:18:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.149 seconds [11:19:04] if that will fix up an out of date copy, you could rsync the thumbs that are on ms5 (many were removed) and then have him run his script [11:19:14] but I don't know what his thumb script does [11:19:37] many were removed? [11:19:46] yes, [11:20:05] ah right [11:20:41] not randomly, it was large sizes and not in use on a project, iirc [11:20:48] but it will be way out of sync with swift most likely [11:21:05] still there's 4.6T of em over there [11:21:11] that's why ms6 is larger than ms5 [11:21:42] how up to date is ms6? [11:21:55] should be very up to date [11:22:02] it follows the purge stream [11:22:12] maybe the rsync should come from there [11:22:12] whether one can trust that, I dunno [11:53:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:09:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [12:19:11] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [12:42:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:44:32] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [12:54:53] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.854 seconds [13:00:15] hashar: why do you need python-git et al for lucid? [13:00:19] why isn't precise enough? [13:00:25] (it's much more complicated) [13:00:55] the new Gerrit/Jenkins gateway (zuul) is written in python and need to run on the jenkins host [13:00:57] which is gallium [13:01:00] and still run Lucid [13:01:07] will be happy to upgrade the box to Precise though ;-D [13:01:31] but that looked harder to me compared to backporting the packages [13:01:37] hello btw :-] [13:02:15] hi :) [13:04:37] paravoid: I am afraid the python modules would need a lot of dependencies to be backported [13:05:06] a workaround would be to ship all the modules as a zip file and extract them directly [13:05:08] not ideal though [13:05:15] or we could upgrade Gallium to Precise ;-] [13:05:23] or setup a new box and migrated jenkins to it [13:05:46] it's puppetized isnt it [13:05:53] kind of [13:06:23] I installed Jenkins on labs after some fix (committed them) [13:06:55] so that should be "easy" to replicate gallium on another host [13:07:52] we'll need to do it at some point, we might just as well do it now [13:08:26] I am all for it [13:08:38] backporting from quantal to lucid is usually tricky, it's a lot of distance/years between them [13:10:10] upgrading to Precise might just leave us with back porting of php5-parsekit and making our jenkins package available to Precise [13:10:18] (both in RT 3579 with bug # attached) [13:10:59] then the 4 python modules (git, gitdb, async, smmap) would need back porting from Quantal to Precise [13:11:13] another possible way would be to have our own pip repository, but that is probably unwanted ;-] [13:12:03] that's what I'm doing now [13:12:27] tried to do it for lucid too, but it's tricky and I'd prefer to avoid spending time for lucid [13:12:34] yeah that's nonproductive [13:14:42] so how do we migrate to Precise ? Simply backup + do-precise-upgrade or do we setup a new box with a new installation? [13:14:53] new box I'd say [13:14:54] s/we/ops/ ;-]] [13:15:02] if it's easy/automated, reinstalls or migrations to different hw are best [13:15:12] most probably [13:15:49] the added value is that it will let me have the new gateway system installed in production without interfering with prod [13:15:57] err with the current setup [13:16:16] so whenever we do a switch, we will be almost sure that everything is fine [13:16:24] mark: do we have an available box? [13:16:39] so a simple reinstall won't work here? [13:16:39] should I ask Rob? [13:16:57] oh we could [13:17:08] hashar: what do you think? what are the uptime requirements for the box? [13:17:13] i.e. can we withstand a downtime [13:17:33] we can take it down for a few hours, ideally during european morning [13:17:38] er, s/withstand/tolerate/ [13:17:55] would need to warn the developers on wikitech-l [13:18:01] if we can't even tolerate a bit of downtime of these kinds of services anymore we better triple our team size ;) [13:19:35] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&c=Miscellaneous+eqiad&h=gallium.wikimedia.org&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS [13:19:43] gallium has ton of unused memory ;-D [13:27:53] ah yeah [13:28:00] i have 2 new swift frontend nodes in esams as well [13:28:40] New patchset: Demon; "Tweaks to gerrit package:" [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/27531 [13:29:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:29:32] oh really? [13:29:34] what kind of boxes? [13:29:49] - [+] 2 frontend nodes? [13:29:50] - [+] configuration [13:29:50] - [+] Dell R620 [13:29:50] - [+] 2x intel E5-2640 [13:29:50] - [+] 16 GB memory [13:29:50] - [+] small SATA drives [13:30:03] nice [13:30:13] also two database class machines [13:30:13] with ssds [13:40:20] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [13:40:20] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [13:46:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [14:09:36] heya paravoid, you around? can I ask you a ganglia question? [14:09:58] I am [14:10:10] I'm not a ganglia expert but shoot anyway [14:10:30] k, i don't think I need an expert [14:10:41] i'm trying to set up ganglia monitoring for hadoop [14:10:48] i kinnnnnda understand ganglia and our setup, but not really [14:10:59] so I think I just need to type out what I think I understand, and you can correct me where I'm wrong [14:11:19] so the analytics machines are part of the 'Miscellaneous eqiad' cluster data source [14:11:28] which in ganglia.pp are defined as [14:11:31] "carbon.wikimedia.org ms1004.eqiad.wmnet" [14:12:07] so those two machines [14:12:13] are set up as ganglia aggregators (right)? [14:12:24] for misc eqiad machines [14:12:29] yes [14:12:53] they each listen to a multicast addy that the misc eqiad machines send to? [14:12:55] correct? [14:13:04] yes [14:13:31] ok, so, the analytics machines gmond is already sending to this multicast addy [14:13:35] hadoop has built in ganglia support [14:13:44] but I need to give it some configs on where to send its stuff [14:13:51] woudl that be the local machine name :8649 then? [14:13:59] !log apt: including php-parsekit, python-async, python-git, python-gitdb, python-smmap backports for precise [14:14:06] somehow connecting to the local gmond? [14:14:11] Logged the message, Master [14:14:17] either the local machine or the aggregators I think [14:14:23] or, should I configure hadoop to also point at the multicast addy [14:14:23] but you should setup a special group for the analytics machines [14:14:24] ? [14:14:31] yeah that's what notpeter suggested too [14:14:44] can I try to see if I can get hadoop into the misc eqiad stuff first, and then it up as a separate cluster? [14:15:04] it probably makes more sense to do it the other way round [14:15:06] i *think* the first step (getting hadoop stats into ganglia) is just a matter of a single hadoop config file [14:15:09] oh? [14:15:09] pok [14:15:10] ok [14:15:12] if you run a git log (or blame then log), you'll find some commits doing that [14:15:16] e.g. I did that for LVS [14:15:22] ok cool, i'll read that commit then [14:15:36] not sure what this comment means: [14:15:36] # NOTE: Do *not* add new clusters *per site* anymore, [14:15:36] # the site name will automatically be appended now, [14:15:36] # and a different IP prefix will be used. [14:16:44] that you make an "Analytics" group, and automatically site names will be appended, so "Analytics eqiad" etc [14:17:03] ahhhh site == eqiad,pmtpa etc. [14:17:04] ok [14:17:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:17:20] ahh i see [14:17:21] cool [14:17:59] ok, so to make my own group, should I set up gmond as an aggregator on one or two of the analytics machines? [14:18:29] yes, by setting $ganglia_aggregator = "true" in the puppet node entries [14:18:40] right, saw that [14:18:56] and ip_oct is arbitrary, as long as it is not used? [14:19:10] use the next free one [14:19:15] right, cool [14:19:35] ok, then, once that is all configed, then I should see these machines as their own group in the ganglia.wm.org gui? [14:19:53] if you make sure the gmetad.conf is adapted as well [14:19:57] oo [14:20:01] ah right [14:20:05] the data_sources [14:20:39] looks like the data_sources should have the site name appended, eh? [14:20:45] yes [14:20:46] Analytics eqiad [14:20:46] k [14:20:51] it's not very consistent [14:20:54] hehe [14:20:59] see a previous commit that does that [14:21:03] i think that file should be generated from the other data structure [14:21:04] you'll see it fixing all the bits [14:21:10] but whatever [14:21:43] yeah i'm reading yours now paravoid, the LVS one [14:21:52] pretty simple [14:23:01] ok hmm, while I have your attention i'm going to keep asking qs before I do this part [14:23:20] is there an example of someone setting up a, um, custom ganglia metric generator thingee? [14:23:20] heh [14:23:35] there are a couple of ways to do that [14:23:39] i see the mysql.pyconf stuff [14:23:45] but I don't think I need a custom ganglia module for this [14:24:04] ottomata: it was missing something, there was a subsequent commit [14:24:06] so from what I've read, there can be modules, and um, a way to 'spoof' ganglia metrics by just sending packets? [14:24:13] paravoid: oh your commit? [14:24:17] yes [14:24:24] yes [14:24:51] so I assume that the hadoop stuff is going to do the 'spoof' way [14:24:55] just by sending stuff over [14:25:00] no idea [14:25:11] aye, well, the instructions are in a hadoop-metrics.properties file [14:25:13] tha thte hadoop daemons read [14:25:17] nothing ganglia specific to configure [14:25:26] but, they need to be told where to send metrics to [14:25:37] probably the multicast group [14:25:42] hmmmmmmmmmmm right [14:25:46] hmmmmm, right [14:25:46] ah [14:25:50] hmm making sense now [14:25:52] ok cool [14:25:55] will try this, thanks guys [14:34:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [14:35:27] welcome back Rob :) [14:35:41] take some pictures there, RobH [14:36:57] Danny_B|backup: gotta clear that well in advance and have smarthands on site [14:37:00] so thats not happening [14:37:06] but recently we pushed a bunch of them. [14:55:57] !log shutting down db1012 to remove evaluation ssd card [14:56:10] Logged the message, RobH [14:59:18] PROBLEM - Host db1012 is DOWN: PING CRITICAL - Packet loss = 100% [15:03:39] RECOVERY - Host db1012 is UP: PING OK - Packet loss = 0%, RTA = 26.51 ms [15:04:33] RECOVERY - Host db42 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [15:05:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:18] PROBLEM - MySQL Slave Delay on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:08:27] PROBLEM - MySQL Slave Running on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:08:36] PROBLEM - SSH on db42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:08:36] PROBLEM - MySQL Idle Transactions on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:08:54] PROBLEM - mysqld processes on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:09:03] New patchset: Ottomata; "Setting up ganglia aggregators for Analytics cluster." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27535 [15:09:03] PROBLEM - MySQL disk space on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:09:03] PROBLEM - Full LVS Snapshot on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:09:03] PROBLEM - MySQL Recent Restart on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:09:21] PROBLEM - MySQL Replication Heartbeat on db42 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:10:06] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/27535 [15:10:30] whaaaat? [15:10:36] oop [15:11:13] good ol' lint check, what would we do without you? :) [15:11:14] New patchset: Ottomata; "Setting up ganglia aggregators for Analytics cluster." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27535 [15:12:12] PROBLEM - Host db42 is DOWN: PING CRITICAL - Packet loss = 100% [15:12:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27535 [15:14:24] paravoid, would you mind reviewing that real quick? [15:23:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [15:26:01] mark, if you are still around, could you review this real quick? [15:26:01] https://gerrit.wikimedia.org/r/#/c/27535/ [15:26:01] It is the discussed ganglia changes for analytics [15:26:38] i could merge it, and it would probably be ok since we discussed it, but it would be better to ask for review, right? [15:26:56] oh sorry mark [15:26:59] i miseed that you +1ed it [15:27:01] just saw that [15:27:12] i'm going to go ahead and merge it then [15:27:26] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27535 [15:28:24] :) [15:29:17] ottomata: so we resolved the redirectoin for serial in post on the one system yesterday [15:29:25] oh? [15:29:31] but i see ticket https://rt.wikimedia.org/Ticket/Display.html?id=3582 shows 1007 isnt booting right? [15:29:49] we fixed 1002 yesterday, different issue [15:30:01] just checkign with you that the status on this (1007) hasnt changed since ticket was made [15:30:11] cuz im going to go poke at it if its still busted. [15:30:20] yes, both 02 and 07 are busted [15:30:25] slightly different problems [15:30:28] both had the redirection reset [15:30:31] but we fixed that on both of them [15:30:32] now: [15:30:40] 07 will PXE boot and fully install, but will not boot from HDD [15:30:45] no matter what I set boot order too [15:31:07] 02 looks like it tries to PXE boot, but hangs at a certain point (or at least, I don't see anymore console output after a certain point) [15:32:21] ok, i see no open ticket for 1002, i will take a look at 1007 now [15:32:31] i'll reopen the old 1002 ticket as well [15:33:07] hrmm [15:33:12] i guess we didnt do a 1002 ticket [15:35:17] 1007 is hung at some efi shell [15:35:20] tht is locked up [15:35:40] !log working on analytics1007 per rt 3582 [15:35:51] Logged the message, RobH [15:37:03] damn it [15:37:13] i just sheared through my headphone cables a third time on a rack [15:37:15] they are gone now. [15:37:28] rip ue earphones [15:37:31] last pair i get of them [15:37:46] mark: you get any shure earbuds? [15:39:07] New patchset: Ottomata; "Need to set $cluster for analytics" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27536 [15:39:08] wait i mad a 1002 ticket yesterday [15:39:22] https://rt.wikimedia.org/Ticket/Display.html?id=3680 [15:39:27] pssh, i assigned it to you [15:39:42] hmm, i can't cahnge the owner! [15:40:15] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27536 [15:40:53] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27536 [15:42:43] heh [15:42:51] ahh, ops requests [15:42:54] i was lookin in eqiad queue [15:43:15] moved, and taken [15:43:23] bwerp, I am a noob RT creator [15:44:00] so ops requests are for folks outside of ops to put in requests [15:44:09] your an odd duck in your cross departmental responsibilities [15:44:28] but, if its something that is an onsite issue, you should feel free to create the relevant ticket in the queue for that site [15:44:34] eqiad for here, sdtpa for tampa, etc. [15:45:03] hmm, ok cool [15:45:05] good to know [15:45:25] sdtpa? not pmtpa? [15:45:33] (what do these letters stand for, btw?) [15:45:40] robh: want to run my troubleshooting by you on db42 see if i am missing something...u have a few mins? [15:46:13] sd switch and data and pm power medium ...tpa is the airport code [15:46:26] lemme ping you in a bit, cisco is giving me shit right now [15:46:37] cool...thx! [15:46:46] ARGHAEFQWEIPOFwed [15:46:53] ottomata: 1007 reset its redirection to off again!! [15:47:02] ;_; [15:47:07] its trying to drive us insane. [15:47:43] grwaaaaaa [15:47:47] that is annoying [15:47:50] unless something in cli is doing it [15:47:54] hmmm [15:47:54] not me! [15:47:56] im not seeing this on virts [15:48:03] lemme check cli [15:48:08] lets try to do all stuff in gui from here on out though other than the console redirect view [15:48:18] i set it back to the right stuff in gui [15:48:21] and booting it now. [15:48:22] boot order remanes teh same [15:48:27] well, even if no console [15:48:30] if boot order is good [15:48:32] it should boot, right? [15:48:38] and we shoudl eventually be able to log in? [15:48:50] should, unless its hitting an error and we dont see due to redirection not working [15:48:59] so its all set right and posting now [15:49:07] but it should boot regardless of redirection, yes [15:49:27] ok, i see memtest [15:49:30] did you just reboot it? [15:49:32] no [15:49:37] just connected to console [15:49:39] odd, nm, didnt reboot [15:49:43] it just spun all the fans like crazy [15:49:44] heh [15:49:57] fan test! [15:49:58] interesting, ciscos let two users on console [15:50:05] yeah, i saw a setting in bios for that [15:50:08] concurrent sessions or something [15:50:10] was set to 4 [15:52:38] nice firefox. i have 4gb memory and you are taking 1.9gb. [15:54:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:06] datacenter is loud sans headphones ;_; [16:02:36] RECOVERY - Host analytics1007 is UP: PING WARNING - Packet loss = 64%, RTA = 26.57 ms [16:03:55] RobH: no, my UEs died recently as I noticed on my SF flight [16:03:59] but I soldered them [16:04:19] now they're working fine again [16:04:23] my cable has been cut and replaced now two other times [16:04:31] its all frayed up and getting too short =P [16:04:38] hehe [16:04:43] and now the right monitor is staticy, this is logitec build [16:04:52] not the older more reliably style =P [16:04:57] yeah [16:04:58] reliable even [16:05:14] supposedly the UE 10 is still okish, but i dunno [16:05:27] the triplefi one, now at much lower price than back in the days [16:05:40] the logitec build has a cheaper monitor cable, and the earbuds slip off too easily. [16:05:55] but still pretty high right? [16:06:10] yeah, still over $100 iirc [16:06:19] hrmm, im going to take them apart when i get home, cannot break them any worse than they are. [16:06:43] that's what I was thinking as well [16:06:54] worked out fine [16:08:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.206 seconds [16:10:34] ottomata: so slllllooooowwwwww [16:10:36] =P [16:10:42] hehe [16:10:52] should we try the same on an02 while we wait? [16:11:14] lemme take alook at its settings [16:11:20] mark, ok, i've made ganglia group changes, and I ran puppet on nickel [16:11:28] or rather, I let puppet run [16:11:34] I see the change in the gmetad.conf file [16:12:00] I don't yet see the Analytics cluster in the 'Wikimedia grid' list at ganglia.wm.org [16:12:05] ottomata: so boot order is wrong on 1002 as well, going to set it right and do reinstall per the steps i just didn on 1007 [16:12:07] do I need to restart anything to get it to show up? or should I just wait [16:12:13] ok [16:13:53] i set it properly and its rebooting now as well 1002 that is [16:14:00] 1007 seems to have finished installer, and is reposting [16:14:02] k, i'm watching in console [16:14:03] if it loads os its fixed. [16:14:13] hrmm, nm, still updating [16:14:19] grub, but will be done soon. [16:14:21] aye [16:14:38] not used to seeing it take so long. [16:15:30] there it goes (1007) rebooting [16:15:51] I am still super bummed that I am in the datacenter with no music! [16:15:56] =P [16:16:35] mark: i have lost so much of the UE cable I may just splice these monitors onto a ipod cable [16:16:37] heh [16:16:58] cuz the ue cable is fine to the y spilit to each monitor, will look funny but meh. [16:19:43] ...... [16:19:49] ottomata: 1007 just posted and rebooted. [16:20:36] 1002 is set to boot on disk first [16:20:45] but it seems to just blank screen, os may not be correct, but dunno yet [16:20:50] i kinda wanna see what 1007 does [16:21:40] PROBLEM - Host analytics1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:22:15] RECOVERY - Host db42 is UP: PING OK - Packet loss = 0%, RTA = 1.18 ms [16:25:06] now its just hitting blank screen when os should load. [16:25:07] wtf. [16:29:33] ottomata: so when i say to boot from the disk [16:29:34] it fails [16:29:42] when i specify it intentionally [16:29:50] this partman script has been used to make successful builds right? [16:39:31] !log authdns update for pc1001-1003 mgmt ip [16:39:42] Logged the message, RobH [16:43:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:43:33] PROBLEM - Auth DNS on ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [16:44:00] PROBLEM - Host cp1031 is DOWN: PING CRITICAL - Packet loss = 100% [16:45:30] PROBLEM - NTP on db42 is CRITICAL: NTP CRITICAL: No response from NTP server [16:49:18] ping robh [16:49:20] RobH, sorry [16:49:23] was afk for lunch [16:49:24] yes [16:49:35] all of the other ciscos have been installed using same partman [16:50:45] !log dist-upgrade and reboot on oxygen [16:50:56] Logged the message, notpeter [16:50:56] !log preilly synchronized php-1.21wmf1/extensions/MobileFrontend 'update for landing page' [16:51:08] Logged the message, Master [16:51:11] notpeter ns0 is down - pls restart it [16:52:28] !log restarting pdns on ns0 [16:52:39] Logged the message, notpeter [16:52:41] woosters_: thanks! :) [16:53:01] thks! [16:53:10] RECOVERY - Auth DNS on ns0.wikimedia.org is OK: DNS OK: 0.044 seconds response time. www.wikipedia.org returns 208.80.154.225 [16:54:08] woosters_: yes? [16:54:11] heya, notpeter [16:54:18] i've created a new analytics ganglia group [16:54:21] ottomata: hrmm, i dunno wtf is up with them. [16:54:22] robh - peter fixed it [16:54:34] how do I get it to show up in ganglia.wm.org? [16:54:52] ottomata: im going to work on a couple other tickets to clear my brain and loop back to them shortly. [16:55:38] ok cool, thanks [16:56:07] ottomata: you should be able to just run puppet on nickel [16:56:10] working on ciscos is annoying since the boot process is so slow. [16:56:29] notpeter, I did that [16:56:35] hrm [16:56:35] ok [16:56:36] i see the new conf in the gmetad file [16:56:40] do I need to restart the web gui? [16:56:46] I odn't tihnk so [16:56:51] it also might not be getting traffic yet [16:56:58] hmmm [16:57:01] it doesn't show things if they're completely empty [16:57:07] well, i know that the analytics machines do show up in ganglia [16:57:09] I'd trace the flow of packets with tcpdump [16:57:10] they did before [16:57:10] running a job now [16:57:14] or just leave it for 30 minutes [16:57:17] when they were in the misc eqiad group [16:57:21] hehe, its been an hour [16:57:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [16:57:31] ok, I can take a look if you'd like [16:57:36] http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=analytics1001.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [16:57:39] it is still in misc eqiad [16:57:54] can you link to the checkins you made? [16:57:57] yup [16:58:23] https://gerrit.wikimedia.org/r/#/c/27535/ [16:58:24] https://gerrit.wikimedia.org/r/#/c/27536/ [16:59:38] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [17:00:51] in fact, going to snag food, back shortly [17:00:58] hrm, ok. that looks right. I shall inwestigate further [17:01:35] New patchset: Reedy; "Initial stab at Wikidata config" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/27546 [17:01:43] PROBLEM - Varnish traffic logger on cp1042 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:01:43] PROBLEM - Varnish traffic logger on cp1044 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:02:10] PROBLEM - Varnish traffic logger on cp1023 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:02:12] ottomata: can you take a look at oxygen and double check that everything came back up properly? [17:02:19] it looks like it to me, but I'd love confirmation [17:02:28] PROBLEM - Varnish traffic logger on cp1033 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:02:28] PROBLEM - Varnish traffic logger on cp1025 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:02:28] PROBLEM - Varnish traffic logger on cp1021 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:02:28] PROBLEM - Varnish traffic logger on cp1027 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:02:28] PROBLEM - Varnish traffic logger on cp1029 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:02:29] PROBLEM - Varnish traffic logger on cp1035 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:02:29] PROBLEM - Varnish traffic logger on cp1041 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:02:30] PROBLEM - Varnish traffic logger on cp1043 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:03:13] PROBLEM - Varnish traffic logger on cp1032 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:03:13] PROBLEM - Varnish traffic logger on cp1036 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:03:13] PROBLEM - Varnish traffic logger on cp1034 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:03:13] PROBLEM - Varnish traffic logger on cp1030 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:03:13] PROBLEM - Varnish traffic logger on cp1028 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:03:22] PROBLEM - Varnish traffic logger on cp1022 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:03:22] PROBLEM - Varnish traffic logger on cp1026 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:03:22] PROBLEM - Varnish traffic logger on cp1024 is CRITICAL: PROCS CRITICAL: 2 processes with command name varnishncsa [17:04:09] wt...? [17:07:07] uuuuhhh, weird [17:08:02] notpeter, oxygen looks cool [17:08:12] ottomata: ok, data gets to the aggrigators and that's then sent to the correct multicast address. but nickel doesn't pick it up [17:08:23] hm [17:08:34] also, hm, actually [17:08:41] I didn't get the aggregator conf to apply on an1001 [17:08:42] not sure why [17:08:47] 1010 picked it up no problem [17:08:53] but puppet didn't make any changes to 1001 [17:08:58] weird [17:09:02] maybe my hostname match is bad [17:09:06] seems like it should work though [17:09:17] so it still says deaf = yes [17:09:30] huh [17:09:39] so the data gets from 1010 to the multicast addy? [17:09:58] looks like it [17:10:11] I just don't see it get to nickel :/ [17:10:11] ok, was gonna wonder if gmond needed a restart, but if that's happening then that hsould be fine [17:10:13] hm [17:11:27] wwell those two processes on hose cp boxes have been there for more than a month [17:11:29] *those [17:14:38] ottomata: I would probably ask someone more knowledgeable about networking than I for wisdom on this issue [17:15:08] bwphhhh, mk, to me that means lesliecarr, but she is not on my list over there [17:17:45] New patchset: Matthias Mullie; "Make abusefilter emergency disable more sensible" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25855 [17:17:59] notpeter, how did you see the multicast traffic? [17:18:00] netcat? [17:18:12] tcpdump port 8649 [17:18:25] from 1010? or nickel? [17:18:37] 1002, 1010 and nickel [17:20:10] hmmm [17:20:17] i see analytics traffic on nickel [17:20:26] this is the old misc one: [17:20:26] analytics1001.wikimedia.org.55985 > 239.192.1.8.8649: [17:20:51] but also analytics1010.eqiad.wmnet.8649 > nickel.wikimedia.org.53259 [17:22:01] didn't look at 1001, tbh [17:22:44] ok, i'm going to try to get it to apply confs, maybe it is messing with the whole thing since it is not doing the correct multicast addy yet [17:23:12] I see it going to both, tbh [17:23:15] which is weird...... [17:23:23] yeah, applying conf++ [17:25:55] New patchset: Ottomata; "site.pp - manually specificying ganglia aggregator hostnames" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27549 [17:26:14] uh.... what's up with analytics1001? why doesn't it send snmp? [17:26:58] there are funny iptables rules maybe! [17:27:05] it should alllow all internal traffic [17:27:08] New review: Werdna; "Fine by me." [operations/mediawiki-config] (master); V: 1 C: 1; - https://gerrit.wikimedia.org/r/25855 [17:27:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27549 [17:27:11] but maybe somethign is not right? [17:27:21] it should at least try to send in puppet [17:27:26] ? [17:27:38] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27549 [17:28:40] analytics1001 is being really weird [17:28:43] it won't apply these configs... [17:29:03] New patchset: Matthias Mullie; "Make abusefilter emergency disable more sensible" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25855 [17:30:00] notpeter, am I a dummy? [17:30:01] if ($hostname == "analytics1001" or $hostname == "analytics1010") { [17:30:01] $ganglia_aggregator = "true" [17:30:01] } [17:30:03] what's wrong with that? [17:30:19] that works for 1010 [17:30:40] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:30:47] I think that /analytics(10[0-9][0-9])\.(wikimedia\.org|eqiad\.wmnet)/ [17:30:53] isn't getting 1001 somehow.... [17:30:58] which is weird [17:31:04] but that's really what it looks like [17:31:17] hmmmmm [17:31:39] I'd try sperating that into two node defs [17:31:41] that are simpler [17:31:43] won't hurt [17:31:48] ok [17:31:56] analytics1001 is diff enough that i'll have to do that eventaully anyway [17:31:59] that regex def matches though [17:32:09] I know.... [17:32:14] might be puppet weirdness... [17:32:28] but yeah, that box doesn't seem to be getting the standard class.... [17:32:28] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25855 [17:32:47] hmmmmmmmmmmmmmmmmmmm [17:32:59] yeah, seperate it out and see what happene [17:33:00] s [17:33:15] ahhh, i think i know, hang on [17:34:43] do tell [17:45:22] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 3 processes with command name varnishncsa [17:45:31] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 3 processes with command name varnishncsa [17:45:31] RECOVERY - Varnish traffic logger on cp1024 is OK: PROCS OK: 3 processes with command name varnishncsa [17:45:40] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 3 processes with command name varnishncsa [17:45:46] ok, how would I guess there were supposed to be three processes [17:45:48] sheesh [17:45:49] RECOVERY - Varnish traffic logger on cp1028 is OK: PROCS OK: 3 processes with command name varnishncsa [17:45:49] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 3 processes with command name varnishncsa [17:45:49] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 3 processes with command name varnishncsa [17:45:49] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:07] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:07] RECOVERY - Varnish traffic logger on cp1030 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:07] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:07] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:07] RECOVERY - Varnish traffic logger on cp1027 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:16] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:16] RECOVERY - Varnish traffic logger on cp1021 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.022 seconds [17:46:34] RECOVERY - Varnish traffic logger on cp1029 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:34] RECOVERY - Varnish traffic logger on cp1035 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:34] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 3 processes with command name varnishncsa [17:46:52] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 3 processes with command name varnishncsa [17:50:26] robh - on rt3644,is it 2 or 3 that you set up? [18:03:58] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [18:03:58] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [18:03:58] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [18:11:58] notpeter: can you please merge and push this change https://gerrit.wikimedia.org/r/#/c/27554/ [18:12:07] sure [18:12:28] notpeter: thanks! [18:12:31] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27554 [18:18:25] binasher: for the Wikidata test wiki, are we alright to use s3 for its database? I know it can be moved later if necessary... [18:19:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:19:21] Reedy: yep, that looks ok [18:27:55] New review: Tychay; "LGTM" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25855 [18:31:16] PROBLEM - Host db42 is DOWN: PING CRITICAL - Packet loss = 100% [18:32:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.389 seconds [18:39:04] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [18:49:04] binasher: update on db42 [18:49:18] it wont load into the os, so cmjohnson1 is going to burn and load it up on rescue cd [18:49:27] get networking up and ssh server, so you can login to it [18:49:53] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:49:53] PROBLEM - Host virt1003 is DOWN: PING CRITICAL - Packet loss = 100% [18:49:53] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [18:49:56] then either cmjohnson1, you, or myself can attempt to mount the filesystem and pull data. [18:50:14] !log i need to pull virt1001-1003 out of nagios later, as they are now renamed [18:50:25] Logged the message, RobH [18:51:06] RobH: fyi pc1001-1003 are all ready to go [18:52:12] RobH: cmjohnson1: sounds good, let me know when its ssh'able [18:53:30] binasher: so pc1001-1003 need ip's assigned in dns, adding to dhcpd files (can just change existing virt1001-1003 entries), and os load [18:53:37] did you want to do that or have me handle it later? [18:55:09] RobH: would you be able to handle it by the end of day monday? [18:56:44] binasher: Hey, when would be a good time in the near future for you to get hijacked by the fundraising tech corner? We'd love to have a long convoluted conversation about caching with you. [18:57:04] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [18:57:55] binasher: i assumed i would handle it today or tomorrow, so yes =] [18:58:14] perfect [18:58:40] binasher: cool, stealing back rt 3644 and will update [18:58:46] LeslieCarr_afk: thanks for handling networking=] [19:02:56] !log stopping puppet on brewster [19:03:07] Logged the message, notpeter [19:06:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:09:02] fyi, cleaned up the https://noc.wikimedia.org/ index a bit, lemme know if anything else obvious is missing there [19:12:11] <^demon> Eloquence: integration.mw.o is available over https, maybe use that link. [19:12:51] <^demon> Same w/ the others, actually. [19:13:11] * AaronSchulz runs into https://lists.ubuntu.com/archives/kernel-bugs/2010-August/136118.html :) [19:14:44] AaronSchulz: Have you tried installing Linux? [19:14:47] done [19:14:54] Reedy: ? [19:17:59] The requested URL /CephWiki/core/Main_Page/mw-config/index.php was not found on this server. [19:18:06] Reedy: did someone break the installer? [19:18:20] "Please set up the wiki first. " link is broken [19:18:40] * AaronSchulz works around by changing the url [19:19:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.047 seconds [19:47:10] RECOVERY - Host db42 is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [19:49:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/26956 [19:49:35] New patchset: Eloquence; "Replace SVN link with Gerrit link and add a few more helpful links." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27574 [19:53:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:04] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [20:06:13] PROBLEM - Host db42 is DOWN: PING CRITICAL - Packet loss = 100% [20:07:57] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27517 [20:09:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [20:12:06] PROBLEM - Auth DNS on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:12:32] !log rebooting analytics1007 [20:12:43] Logged the message, Master [20:13:18] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [20:16:18] New patchset: Hashar; "(bug 40686) zuul role for production" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27611 [20:16:20] Reedy, on OTRSwiki I believe we usually have ENotif enabled to receive emails for changes to watchlisted pages - but it doesn't seem to be working anymore - last known email was October 8, do you happen to know if it was disabled? [20:17:10] Not that I know of... [20:17:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27611 [20:17:44] binasher: i was stoked to see a few log lines in nc on vanadium this morning but the link seems to have died -- curling the pixel service doesn't seem to generate any logs, and i am hitting the eqiad its afaik [20:17:53] *eqiad bits [20:18:24] RECOVERY - Auth DNS on labs-ns1.wikimedia.org is OK: DNS OK: 0.096 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [20:19:27] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 0.072 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [20:31:13] hmm, is there any other way to figure out why it's not working then? [20:32:03] PROBLEM - Puppet freshness on db1038 is CRITICAL: Puppet has not run in the last 10 hours [20:32:03] PROBLEM - Puppet freshness on mw38 is CRITICAL: Puppet has not run in the last 10 hours [20:32:03] PROBLEM - Puppet freshness on db1043 is CRITICAL: Puppet has not run in the last 10 hours [20:32:03] PROBLEM - Puppet freshness on mw44 is CRITICAL: Puppet has not run in the last 10 hours [20:32:03] PROBLEM - Puppet freshness on db55 is CRITICAL: Puppet has not run in the last 10 hours [20:32:04] PROBLEM - Puppet freshness on mw74 is CRITICAL: Puppet has not run in the last 10 hours [20:32:04] PROBLEM - Puppet freshness on mw70 is CRITICAL: Puppet has not run in the last 10 hours [20:32:05] PROBLEM - Puppet freshness on srv263 is CRITICAL: Puppet has not run in the last 10 hours [20:32:05] PROBLEM - Puppet freshness on srv249 is CRITICAL: Puppet has not run in the last 10 hours [20:32:06] PROBLEM - Puppet freshness on srv291 is CRITICAL: Puppet has not run in the last 10 hours [20:32:06] PROBLEM - Puppet freshness on srv190 is CRITICAL: Puppet has not run in the last 10 hours [20:33:18] There's nothing that changed with regards to otrswiki on thatday [20:37:02] otrswiki should've been swapped to 1.21wmf1 on the 3rd.. [20:42:33] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [20:42:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:48:15] !log flipping payments.wikimedia.org to eqiad cluster to live-test [20:48:26] Logged the message, Master [20:54:06] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [20:55:54] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [20:57:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.032 seconds [21:07:27] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.44 ms [21:09:03] ^ lies ...it's not back up [21:18:14] New patchset: Ryan Lane; "Update project search" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/27630 [21:18:55] Change merged: Ryan Lane; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/27630 [21:29:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:36:42] !log bsitu Started syncing Wikimedia installation... : Update ArticleFeedbackv5, MoodBar, PageTriage to master [21:36:54] Logged the message, Master [21:42:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.105 seconds [21:53:37] New patchset: Pyoungmeister; "adding some more macs for eqiad mc hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27633 [21:54:41] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27633 [21:58:09] !log bsitu Finished syncing Wikimedia installation... : Update ArticleFeedbackv5, MoodBar, PageTriage to master [21:58:20] Logged the message, Master [22:08:13] New patchset: Ryan Lane; "If nick is taken ghost old nick and change nick" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/27634 [22:08:28] Change merged: Ryan Lane; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/27634 [22:16:44] !log olivneh synchronized php-1.21wmf1/extensions/E3Experiments [22:16:56] Logged the message, Master [22:17:05] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27633 [22:17:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:17:19] New patchset: Ryan Lane; "Up version to 1.3" [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/27638 [22:17:53] Change merged: Ryan Lane; [operations/debs/adminbot] (master) - https://gerrit.wikimedia.org/r/27638 [22:20:03] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [22:23:49] how did we choose swift over ceph in the first place, anyway? [22:24:06] was there something good about swift? [22:25:16] it was before i started - you may want to search for emails from back then, especially from ben ? [22:25:28] he was very good about doing performance analysis and stuff [22:27:12] TimStarling: ceph wasn't even close to stable when we chose swift [22:27:31] LeslieCarr: ben had nothing to do with the selection process [22:27:43] ah okay [22:27:45] LeslieCarr: it was…. what's his name…. [22:27:48] russ [22:27:50] nelson [22:27:57] there was an evaluation process [22:28:00] it's documented on wikitech [22:28:17] TimStarling: http://wikitech.wikimedia.org/view/Media_server/Distributed_File_Storage_choices [22:28:33] most of its header files say copyright 2004-2006, but I gather it was mostly a research project back then [22:28:39] yes [22:28:45] github shows a lot of commits in the last couple of years [22:29:02] dreamhost started providing money support for it within the last few years [22:29:35] Mark Shuttleworth has invested in it also [22:29:40] http://www.inktank.com/news-events/new/shuttleworth-invests-1-million-in-ceph-storage-startup-inktank/ [22:29:48] * Ryan_Lane nods [22:29:53] it's more of a choice now [22:30:21] * AaronSchulz struggles setting up a rados gateway [22:30:44] is it harder than swift to install? [22:30:48] yes [22:31:03] because I looked at the swift "all-in-one" documents and that looked pretty difficult [22:31:06] and its error messages are cryptic [22:31:07] the all-in-one might be easier than swift [22:31:15] how to set up a test server in 15 easy steps [22:31:19] and its documentation is total shit [22:31:27] ceph is much more complicated than swift from what I've seen [22:31:29] though messing with the gateway is annoying [22:31:33] it also does much more than swift [22:31:37] yep [22:31:44] well, like I said on the ML, ceph is enormous in terms of code size [22:31:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.034 seconds [22:31:48] NoSuchKey [22:31:49] ceph is way more interesting from a filesystem perspective [22:31:51] * AaronSchulz sighs [22:32:05] TimStarling: are you including cephFS? [22:32:14] I wonder how large that code is [22:33:26] I'm including everything in https://github.com/ceph/ceph.git [22:33:38] I see a few files with CephFS in the name [22:33:49] ./src/client/hadoop/ceph/CephFS.java [22:34:24] that's a client, I think the server is probably here also but called something different internally [22:35:12] they've kinda messed up by overloading the term ceph [22:35:34] they use it for both the name of the whole project and the posix filesystem layered over rados [22:35:48] right, so CephFS is the kernel module, according to the website [22:36:22] /a/ kernel module, yes [22:36:29] they also have rbd which is also a kernel module [22:36:39] (all of the above afaik, which is not much) [22:38:33] it's in here somewhere, the mount.ceph utility is here [22:38:43] as is some fuse stuff [22:55:02] New patchset: Pyoungmeister; "adding macs for pc2 and pc3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27643 [22:56:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27643 [22:56:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27643 [23:05:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:08:28] !log olivneh synchronized php-1.21wmf1/extensions/E3Experiments [23:08:40] Logged the message, Master [23:11:37] Change merged: Dzahn; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27574 [23:13:13] !log dzahn synchronized docroot/noc/index.html [23:13:24] Logged the message, Master [23:15:13] !log dzahn synchronized docroot/noc/index.html [23:15:25] Logged the message, Master [23:19:11] !log manually copy index.html for noc from /common/docroot/noc to /h/w/htdocs/noc/ [23:19:23] Logged the message, Master [23:19:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.044 seconds [23:19:56] <--it is in git repo and ./common/docroot but Apache still uses /h/w/htdocs [23:24:15] New patchset: Dzahn; "fix document root for noc.wikimedia.org to ./common/.. dir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27646 [23:25:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27646 [23:25:36] New patchset: Dzahn; "fix document root for noc.wikimedia.org to ./common/.. dir" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27646 [23:26:29] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27646 [23:36:26] I'm looking at the slope of new objects in swift [23:36:40] it's quite impressive [23:37:09] it includes thumbs, which complicates the data a bit though [23:40:59] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [23:40:59] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [23:46:12] cmjohnson1: hey, are you still in pmtpa? [23:52:29] mutante: what about I349b83b860a539e51d7f477f4fa3eb21e9eb24e8 ? [23:52:31] !g I349b83b860a539e51d7f477f4fa3eb21e9eb24e8 [23:52:31] https://gerrit.wikimedia.org/r/#q,I349b83b860a539e51d7f477f4fa3eb21e9eb24e8,n,z [23:52:50] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:51] Aha [23:57:00] New review: Reedy; "I'd already done this in https://gerrit.wikimedia.org/r/#/c/23425/ :p" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27646 [23:59:03] Switched to branch 'production' [23:59:03] Your branch is behind 'origin/production' by 376 commits, and can be fast-forwarded.